MagicTools
documentApril 22, 2026192 views8 min read

Practical Guide to Document Format Conversion: Comprehensive Analysis of Markdown, HTML, PDF Interconversion

Every document has its most suitable format. Markdown is suitable for writing, HTML for web display, PDF for printing and distribution, and Word for office collaboration. The trouble is, in real work, you often need to switch between these formats—converting blog posts to PDF to send to clients, migrating Word documents to knowledge bases, and extracting web content into Markdown for archiving.

Format conversion seems simple, but when you actually do it, problems often arise: formatting gets messed up, tables disappear, and Chinese characters become garbled. This guide outlines the best solutions for various conversion paths to help you avoid detours.

Comparison of Characteristics of the Four Major Formats

Before choosing a conversion path, first understand the essential characteristics of each format:

Format Readability Editability Rendering Effect File Size Application Scenarios
Markdown (.md) Extremely high (plain text) Extremely high Depends on renderer Extremely small Technical documentation, blog writing, README
HTML (.html) Medium (contains tags) High Browser rendering Small to medium Web pages, email templates, document display
PDF (.pdf) High (visual) Extremely low Fixed layout Medium to large Printing, formal documents, cross-platform distribution
Word (.docx) High High Office rendering Medium Office collaboration, requiring comments and revisions

Core principle: The higher the editability, the lower the layout fixedness; the more fixed the layout, the greater the conversion loss. PDF is a one-way format, and when converting from PDF to other formats, the quality loss is the greatest.

Markdown → PDF

This is the most common conversion requirement, with three solutions, each with different effects:

  1. Open the document in a tool that supports Markdown preview (VS Code preview, MagicTools, Typora)
  2. Press Ctrl+P (Mac: Cmd+P) to open the print dialog
  3. Select 'Save as PDF' as the target printer
  4. Adjust page settings: uncheck headers and footers, set appropriate margins

Advantages: Simple operation, zero learning cost, what you see is what you get. Disadvantages: Code highlighting, fonts, and pagination depend on browser rendering, and there may be subtle differences on different machines.

Pandoc is the Swiss Army knife of format conversion, supporting over 40 formats for interconversion.

# 安装(Mac)
brew install pandoc
# 同时需要安装 LaTeX(用于 PDF 输出)
brew install --cask mactex-no-gui

# 基本转换
pandoc input.md -o output.pdf

# 带中文支持(必须指定字体,否则中文显示为方块)
pandoc input.md -o output.pdf \
  --pdf-engine=xelatex \
  -V mainfont="PingFang SC" \
  -V geometry:margin=2cm

# 自定义 CSS 样式(通过 HTML 中间步骤)
pandoc input.md -o output.pdf \
  --pdf-engine=wkhtmltopdf \
  --css=style.css

Advantages: Highest output quality, fine-grained format control, supports batch processing, suitable for production environments. Disadvantages: Requires installing LaTeX environment (about 4GB), and Chinese configuration has a learning curve.

Solution 3: Online Tools (For Temporary Needs)

MagicTools has a built-in export function. After writing Markdown, export it to PDF with one click, without installing any software. Suitable for occasional needs.

Markdown → HTML

Converting Markdown to HTML is the most "lossless" conversion because Markdown is essentially a simplified version of HTML.

Static Blog Generation

Mainstream solutions: Hugo, Jekyll, Gatsby, Astro. Taking Hugo as an example:

# 一条命令把整个 articles/ 目录转成网站
hugo --source . --destination ./public

# 生成的 public/ 包含完整的 HTML 站点,可直接部署

These tools automatically handle code highlighting, table of contents generation, related article recommendations, and other functions.

Single File Conversion

# Pandoc 转换,嵌入完整样式(standalone 模式)
pandoc input.md -o output.html --standalone

# 引用外部样式文件
pandoc input.md -o output.html --standalone --css=github-markdown.css

The conversion result can be opened directly in a browser or embedded into existing web pages.

Email Templates

After converting Markdown content to HTML, note that email clients have poor CSS support, so all styles need to be inlined:

# 使用 juice 工具内联 CSS
npm install -g juice
pandoc input.md -o temp.html --standalone --css=email.css
juice temp.html output-email.html

HTML → Markdown

This path is most commonly used for: blog platform migration (WordPress → Hexo/Hugo), extracting content from web pages for archiving, and cleaning crawled content from web scrapers.

MagicTools HTML to Markdown supports three input methods:

  • Directly paste HTML code
  • Enter a URL to automatically fetch the web page
  • Paste rich text (copy from a web page and paste)

It uses the Turndown engine behind the scenes, which can correctly handle complex structures such as tables, code blocks, and nested lists.

Command Line Batch Conversion

When migrating an entire WordPress blog, the command line solution is more efficient:

# 安装 html2text(Python)
pip install html2text

# 单文件转换
html2text article.html > article.md

# 批量转换整个目录
for f in html-pages/*.html; do
  html2text "$f" > "markdown-output/${f%.html}.md"
done

Notes on conversion quality:

  • Navigation bars, sidebars, ads, and other irrelevant content need to be manually cleaned up
  • Image URLs will retain the original address, and need to be downloaded separately and replaced with local paths
  • Complex nested tables may lose formatting

PDF → Markdown

This is the worst quality path among all conversions because PDF is essentially a "print instruction set," and the logical order of text (paragraphs, heading levels) is not directly stored in the file.

OCR Solution (Suitable for Scanned PDFs)

Scanned PDFs (scanned from paper documents) must use OCR:

  • Adobe Acrobat Pro: Highest recognition accuracy, supports Chinese, relatively expensive
  • Microsoft Office Lens: Free mobile app, scanning + OCR, outputs Word and then converts to Markdown
  • Online OCR: ilovepdf.com, smallpdf.com offer free OCR conversion

Text Layer PDF Conversion

If the PDF contains selectable text (not a scanned version), you can use:

# pdftotext(poppler 工具包)
pdftotext -layout input.pdf output.txt
# 注意:只能提取纯文本,标题层级和格式信息会丢失

# pdf2md 工具(更好保留结构)
npm install -g pdf2md-cli
pdf2md input.pdf > output.md

Realistic expectations: PDF → Markdown almost always requires manual correction, only suitable for "better than nothing" scenarios. For important documents, it is recommended to keep the source files (Word, Markdown originals).

Word/DOCX → Markdown

In enterprise environments, a large number of documents exist in Word format, and this path is commonly used when migrating to Markdown knowledge bases.

# 基本转换
pandoc input.docx -o output.md

# 提取 Word 中的图片到 media/ 目录
pandoc input.docx -o output.md --extract-media=./media

# 批量转换
for f in *.docx; do
  pandoc "$f" -o "${f%.docx}.md" --extract-media="./media/${f%.docx}"
done

Pandoc can correctly identify Word's heading styles (Heading 1 → #, Heading 2 → ##), and retain bold, italic, tables, and images.

Word Save As (Simple but Poor Quality)

In Word, go to File → Save As → Plain Text, which only retains plain text and loses all formatting. Not recommended as a migration solution.

Quality Ratings for Each Conversion Path

Conversion Path Format Fidelity Operation Difficulty Recommended Tools
Markdown → HTML ★★★★★ Low Pandoc / Online Tools
Markdown → PDF ★★★★☆ Medium Pandoc + XeLaTeX
HTML → Markdown ★★★★☆ Low MagicTools / Turndown
Word → Markdown ★★★★☆ Low Pandoc
Markdown → Word ★★★☆☆ Low Pandoc
PDF → Markdown ★★☆☆☆ High OCR + Manual Correction
PDF → Word ★★★☆☆ Low Adobe / Online Tools

FAQ

Q: What to do if the formatting is messed up after conversion?

A: There are usually three reasons for messy formatting: first, the source file format is not standard (e.g., using blank lines instead of paragraph styles in Word), and the solution is to standardize the source file before converting; second, the conversion tool does not support certain features (e.g., special fonts, complex tables), try switching to another tool; third, encoding issues (Chinese garbled characters), add --from=utf-8 to the Pandoc command line or ensure the file is saved in UTF-8. If it's PDF → other formats, messy formatting is basically normal and requires manual correction.

Q: How to handle tables that are easily lost during conversion?

A: Tables are the most problematic element in format conversion. Handling suggestions: when converting HTML → Markdown, prioritize using MagicTools or Turndown, which have the best support for tables; table loss in PDFs is basically impossible to recover automatically and can only be manually rewritten as Markdown tables; Word tables converted with Pandoc have better quality, but merged cells will lose merge information; as a last resort, you can take a screenshot of the table and use an image instead—although not elegant, it's better than messy formatting.

Q: Can free tools handle batch conversion?

A: Absolutely. Pandoc is a free and open-source tool, and you can write Shell scripts to batch process entire directories. Python's html2text and markdownify libraries also support batch invocation. For batch conversion of up to 100 files, command-line scripts combined with Pandoc are the most efficient solution, completely free, and allow precise control of output format. For over 1000 files, consider parallel processing (xargs -P 4 or Python's multiprocessing) to improve speed.

Summary

There is no silver bullet for document format conversion, but there are patterns to follow:

  • High-frequency needs: Markdown ↔ HTML, Markdown → PDF, these paths have mature tools and reliable quality
  • Migration needs: Word → Markdown, HTML → Markdown, Pandoc + online tools can handle most situations
  • Hardest to convert: PDF → any format, be prepared for manual correction

The most important advice: Keep the original format files. Conversion is a temporary operation, and the original is the asset. Write in Markdown, use Git for version management, and you can convert it to any format you need at any time—this is the most flexible document workflow.

Related Articles

Tmux Terminal Multiplexer: Recommended Configuration + Complete User Manual

A complete guide to the tmux terminal multiplexer for developers, including recommended .tmux.conf configuration, common shortcut key cheat sheets, plugin recommendations, and practical tips to help you significantly improve terminal efficiency.

developerApr 22, 20267 min
187

Complete Guide to JWT Authentication: Principles, Usage, and Security Best Practices

JWT (JSON Web Token) is a mainstream solution for modern API authentication. This article provides an in-depth analysis of JWT's three-part structure, signature verification principles, comparison with Session, as well as key security practices such as storage location selection, expiration and refresh mechanisms, and algorithm confusion vulnerabilities.

developerApr 22, 20268 min
192

Complete Guide to Password Security: Best Practices from Creation to Management

Every year, billions of accounts are stolen due to weak passwords or password reuse. This article systematically explains common password attack methods, password strength standards, password manager selection, and the correct use of two-factor authentication, helping you fundamentally protect your digital asset security.

utilityApr 22, 20267 min
195
Crayfish (OpenClaw) Workbench Dashboard

Crayfish (OpenClaw) Workbench Dashboard

Crayfish has strong execution capabilities, but there's no good sense of control over what it has done and what it's currently doing. I happened to see a blogger's share, so we can first build a 'Mission Control' to achieve full control!

openclawApr 22, 20264 min
194

Published by MagicTools