Practical Guide to Document Format Conversion: Comprehensive Analysis of Markdown, HTML, PDF Interconversion

Every document has its most suitable format. Markdown is suitable for writing, HTML for web display, PDF for printing and distribution, and Word for office collaboration. The trouble is, in real work, you often need to switch between these formats—converting blog posts to PDF to send to clients, migrating Word documents to knowledge bases, and extracting web content into Markdown for archiving.

Format conversion seems simple, but when you actually do it, problems often arise: formatting gets messed up, tables disappear, and Chinese characters become garbled. This guide outlines the best solutions for various conversion paths to help you avoid detours.

Comparison of Characteristics of the Four Major Formats

Before choosing a conversion path, first understand the essential characteristics of each format:

Format	Readability	Editability	Rendering Effect	File Size	Application Scenarios
Markdown (.md)	Extremely high (plain text)	Extremely high	Depends on renderer	Extremely small	Technical documentation, blog writing, README
HTML (.html)	Medium (contains tags)	High	Browser rendering	Small to medium	Web pages, email templates, document display
PDF (.pdf)	High (visual)	Extremely low	Fixed layout	Medium to large	Printing, formal documents, cross-platform distribution
Word (.docx)	High	High	Office rendering	Medium	Office collaboration, requiring comments and revisions

Core principle: The higher the editability, the lower the layout fixedness; the more fixed the layout, the greater the conversion loss. PDF is a one-way format, and when converting from PDF to other formats, the quality loss is the greatest.

Markdown → PDF

This is the most common conversion requirement, with three solutions, each with different effects:

Solution 1: Browser Printing (Recommended for Beginners)

Open the document in a tool that supports Markdown preview (VS Code preview, MagicTools, Typora)
Press Ctrl+P (Mac: Cmd+P) to open the print dialog
Select 'Save as PDF' as the target printer
Adjust page settings: uncheck headers and footers, set appropriate margins

Advantages: Simple operation, zero learning cost, what you see is what you get. Disadvantages: Code highlighting, fonts, and pagination depend on browser rendering, and there may be subtle differences on different machines.

Solution 2: Pandoc Command Line (Recommended for Professionals)

Pandoc is the Swiss Army knife of format conversion, supporting over 40 formats for interconversion.

# 安装（Mac）
brew install pandoc
# 同时需要安装 LaTeX（用于 PDF 输出）
brew install --cask mactex-no-gui

# 基本转换
pandoc input.md -o output.pdf

# 带中文支持（必须指定字体，否则中文显示为方块）
pandoc input.md -o output.pdf \
  --pdf-engine=xelatex \
  -V mainfont="PingFang SC" \
  -V geometry:margin=2cm

# 自定义 CSS 样式（通过 HTML 中间步骤）
pandoc input.md -o output.pdf \
  --pdf-engine=wkhtmltopdf \
  --css=style.css

Advantages: Highest output quality, fine-grained format control, supports batch processing, suitable for production environments. Disadvantages: Requires installing LaTeX environment (about 4GB), and Chinese configuration has a learning curve.

Solution 3: Online Tools (For Temporary Needs)

MagicTools has a built-in export function. After writing Markdown, export it to PDF with one click, without installing any software. Suitable for occasional needs.

Markdown → HTML

Converting Markdown to HTML is the most "lossless" conversion because Markdown is essentially a simplified version of HTML.

Static Blog Generation

Mainstream solutions: Hugo, Jekyll, Gatsby, Astro. Taking Hugo as an example:

# 一条命令把整个 articles/ 目录转成网站
hugo --source . --destination ./public

# 生成的 public/ 包含完整的 HTML 站点，可直接部署

These tools automatically handle code highlighting, table of contents generation, related article recommendations, and other functions.

Single File Conversion

# Pandoc 转换，嵌入完整样式（standalone 模式）
pandoc input.md -o output.html --standalone

# 引用外部样式文件
pandoc input.md -o output.html --standalone --css=github-markdown.css

The conversion result can be opened directly in a browser or embedded into existing web pages.

Email Templates

After converting Markdown content to HTML, note that email clients have poor CSS support, so all styles need to be inlined:

# 使用 juice 工具内联 CSS
npm install -g juice
pandoc input.md -o temp.html --standalone --css=email.css
juice temp.html output-email.html

HTML → Markdown

This path is most commonly used for: blog platform migration (WordPress → Hexo/Hugo), extracting content from web pages for archiving, and cleaning crawled content from web scrapers.

Online Tools (Recommended)

MagicTools HTML to Markdown supports three input methods:

Directly paste HTML code
Enter a URL to automatically fetch the web page
Paste rich text (copy from a web page and paste)

It uses the Turndown engine behind the scenes, which can correctly handle complex structures such as tables, code blocks, and nested lists.

Command Line Batch Conversion

When migrating an entire WordPress blog, the command line solution is more efficient:

# 安装 html2text（Python）
pip install html2text

# 单文件转换
html2text article.html > article.md

# 批量转换整个目录
for f in html-pages/*.html; do
  html2text "$f" > "markdown-output/${f%.html}.md"
done

Notes on conversion quality:

Navigation bars, sidebars, ads, and other irrelevant content need to be manually cleaned up
Image URLs will retain the original address, and need to be downloaded separately and replaced with local paths
Complex nested tables may lose formatting

PDF → Markdown

This is the worst quality path among all conversions because PDF is essentially a "print instruction set," and the logical order of text (paragraphs, heading levels) is not directly stored in the file.

OCR Solution (Suitable for Scanned PDFs)

Scanned PDFs (scanned from paper documents) must use OCR:

Adobe Acrobat Pro: Highest recognition accuracy, supports Chinese, relatively expensive
Microsoft Office Lens: Free mobile app, scanning + OCR, outputs Word and then converts to Markdown
Online OCR: ilovepdf.com, smallpdf.com offer free OCR conversion

Text Layer PDF Conversion

If the PDF contains selectable text (not a scanned version), you can use:

# pdftotext（poppler 工具包）
pdftotext -layout input.pdf output.txt
# 注意：只能提取纯文本，标题层级和格式信息会丢失

# pdf2md 工具（更好保留结构）
npm install -g pdf2md-cli
pdf2md input.pdf > output.md

Realistic expectations: PDF → Markdown almost always requires manual correction, only suitable for "better than nothing" scenarios. For important documents, it is recommended to keep the source files (Word, Markdown originals).

Word/DOCX → Markdown

In enterprise environments, a large number of documents exist in Word format, and this path is commonly used when migrating to Markdown knowledge bases.

Pandoc (Recommended, Highest Quality)

# 基本转换
pandoc input.docx -o output.md

# 提取 Word 中的图片到 media/ 目录
pandoc input.docx -o output.md --extract-media=./media

# 批量转换
for f in *.docx; do
  pandoc "$f" -o "${f%.docx}.md" --extract-media="./media/${f%.docx}"
done

Pandoc can correctly identify Word's heading styles (Heading 1 → #, Heading 2 → ##), and retain bold, italic, tables, and images.

Word Save As (Simple but Poor Quality)

In Word, go to File → Save As → Plain Text, which only retains plain text and loses all formatting. Not recommended as a migration solution.

Quality Ratings for Each Conversion Path

Conversion Path	Format Fidelity	Operation Difficulty	Recommended Tools
Markdown → HTML	★★★★★	Low	Pandoc / Online Tools
Markdown → PDF	★★★★☆	Medium	Pandoc + XeLaTeX
HTML → Markdown	★★★★☆	Low	MagicTools / Turndown
Word → Markdown	★★★★☆	Low	Pandoc
Markdown → Word	★★★☆☆	Low	Pandoc
PDF → Markdown	★★☆☆☆	High	OCR + Manual Correction
PDF → Word	★★★☆☆	Low	Adobe / Online Tools

FAQ

Q: What to do if the formatting is messed up after conversion?

A: There are usually three reasons for messy formatting: first, the source file format is not standard (e.g., using blank lines instead of paragraph styles in Word), and the solution is to standardize the source file before converting; second, the conversion tool does not support certain features (e.g., special fonts, complex tables), try switching to another tool; third, encoding issues (Chinese garbled characters), add --from=utf-8 to the Pandoc command line or ensure the file is saved in UTF-8. If it's PDF → other formats, messy formatting is basically normal and requires manual correction.

Q: How to handle tables that are easily lost during conversion?

A: Tables are the most problematic element in format conversion. Handling suggestions: when converting HTML → Markdown, prioritize using MagicTools or Turndown, which have the best support for tables; table loss in PDFs is basically impossible to recover automatically and can only be manually rewritten as Markdown tables; Word tables converted with Pandoc have better quality, but merged cells will lose merge information; as a last resort, you can take a screenshot of the table and use an image instead—although not elegant, it's better than messy formatting.

Q: Can free tools handle batch conversion?

A: Absolutely. Pandoc is a free and open-source tool, and you can write Shell scripts to batch process entire directories. Python's html2text and markdownify libraries also support batch invocation. For batch conversion of up to 100 files, command-line scripts combined with Pandoc are the most efficient solution, completely free, and allow precise control of output format. For over 1000 files, consider parallel processing (xargs -P 4 or Python's multiprocessing) to improve speed.

Summary

There is no silver bullet for document format conversion, but there are patterns to follow:

High-frequency needs: Markdown ↔ HTML, Markdown → PDF, these paths have mature tools and reliable quality
Migration needs: Word → Markdown, HTML → Markdown, Pandoc + online tools can handle most situations
Hardest to convert: PDF → any format, be prepared for manual correction

The most important advice: Keep the original format files. Conversion is a temporary operation, and the original is the asset. Write in Markdown, use Git for version management, and you can convert it to any format you need at any time—this is the most flexible document workflow.