Practical Guide to Document Format Conversion: Comprehensive Analysis of Markdown, HTML, PDF Interconversion
Every document has its most suitable format. Markdown is suitable for writing, HTML for web display, PDF for printing and distribution, and Word for office collaboration. The trouble is, in real work, you often need to switch between these formats—converting blog posts to PDF to send to clients, migrating Word documents to knowledge bases, and extracting web content into Markdown for archiving.
Format conversion seems simple, but when you actually do it, problems often arise: formatting gets messed up, tables disappear, and Chinese characters become garbled. This guide outlines the best solutions for various conversion paths to help you avoid detours.
Comparison of Characteristics of the Four Major Formats
Before choosing a conversion path, first understand the essential characteristics of each format:
| Format | Readability | Editability | Rendering Effect | File Size | Application Scenarios |
|---|---|---|---|---|---|
| Markdown (.md) | Extremely high (plain text) | Extremely high | Depends on renderer | Extremely small | Technical documentation, blog writing, README |
| HTML (.html) | Medium (contains tags) | High | Browser rendering | Small to medium | Web pages, email templates, document display |
| PDF (.pdf) | High (visual) | Extremely low | Fixed layout | Medium to large | Printing, formal documents, cross-platform distribution |
| Word (.docx) | High | High | Office rendering | Medium | Office collaboration, requiring comments and revisions |
Core principle: The higher the editability, the lower the layout fixedness; the more fixed the layout, the greater the conversion loss. PDF is a one-way format, and when converting from PDF to other formats, the quality loss is the greatest.
Markdown → PDF
This is the most common conversion requirement, with three solutions, each with different effects:
Solution 1: Browser Printing (Recommended for Beginners)
- Open the document in a tool that supports Markdown preview (VS Code preview, MagicTools, Typora)
- Press
Ctrl+P(Mac:Cmd+P) to open the print dialog - Select 'Save as PDF' as the target printer
- Adjust page settings: uncheck headers and footers, set appropriate margins
Advantages: Simple operation, zero learning cost, what you see is what you get. Disadvantages: Code highlighting, fonts, and pagination depend on browser rendering, and there may be subtle differences on different machines.
Solution 2: Pandoc Command Line (Recommended for Professionals)
Pandoc is the Swiss Army knife of format conversion, supporting over 40 formats for interconversion.
# 安装(Mac)
brew install pandoc
# 同时需要安装 LaTeX(用于 PDF 输出)
brew install --cask mactex-no-gui
# 基本转换
pandoc input.md -o output.pdf
# 带中文支持(必须指定字体,否则中文显示为方块)
pandoc input.md -o output.pdf \
--pdf-engine=xelatex \
-V mainfont="PingFang SC" \
-V geometry:margin=2cm
# 自定义 CSS 样式(通过 HTML 中间步骤)
pandoc input.md -o output.pdf \
--pdf-engine=wkhtmltopdf \
--css=style.css
Advantages: Highest output quality, fine-grained format control, supports batch processing, suitable for production environments. Disadvantages: Requires installing LaTeX environment (about 4GB), and Chinese configuration has a learning curve.
Solution 3: Online Tools (For Temporary Needs)
MagicTools has a built-in export function. After writing Markdown, export it to PDF with one click, without installing any software. Suitable for occasional needs.
Markdown → HTML
Converting Markdown to HTML is the most "lossless" conversion because Markdown is essentially a simplified version of HTML.
Static Blog Generation
Mainstream solutions: Hugo, Jekyll, Gatsby, Astro. Taking Hugo as an example:
# 一条命令把整个 articles/ 目录转成网站
hugo --source . --destination ./public
# 生成的 public/ 包含完整的 HTML 站点,可直接部署
These tools automatically handle code highlighting, table of contents generation, related article recommendations, and other functions.
Single File Conversion
# Pandoc 转换,嵌入完整样式(standalone 模式)
pandoc input.md -o output.html --standalone
# 引用外部样式文件
pandoc input.md -o output.html --standalone --css=github-markdown.css
The conversion result can be opened directly in a browser or embedded into existing web pages.
Email Templates
After converting Markdown content to HTML, note that email clients have poor CSS support, so all styles need to be inlined:
# 使用 juice 工具内联 CSS
npm install -g juice
pandoc input.md -o temp.html --standalone --css=email.css
juice temp.html output-email.html
HTML → Markdown
This path is most commonly used for: blog platform migration (WordPress → Hexo/Hugo), extracting content from web pages for archiving, and cleaning crawled content from web scrapers.
Online Tools (Recommended)
MagicTools HTML to Markdown supports three input methods:
- Directly paste HTML code
- Enter a URL to automatically fetch the web page
- Paste rich text (copy from a web page and paste)
It uses the Turndown engine behind the scenes, which can correctly handle complex structures such as tables, code blocks, and nested lists.
Command Line Batch Conversion
When migrating an entire WordPress blog, the command line solution is more efficient:
# 安装 html2text(Python)
pip install html2text
# 单文件转换
html2text article.html > article.md
# 批量转换整个目录
for f in html-pages/*.html; do
html2text "$f" > "markdown-output/${f%.html}.md"
done
Notes on conversion quality:
- Navigation bars, sidebars, ads, and other irrelevant content need to be manually cleaned up
- Image URLs will retain the original address, and need to be downloaded separately and replaced with local paths
- Complex nested tables may lose formatting
PDF → Markdown
This is the worst quality path among all conversions because PDF is essentially a "print instruction set," and the logical order of text (paragraphs, heading levels) is not directly stored in the file.
OCR Solution (Suitable for Scanned PDFs)
Scanned PDFs (scanned from paper documents) must use OCR:
- Adobe Acrobat Pro: Highest recognition accuracy, supports Chinese, relatively expensive
- Microsoft Office Lens: Free mobile app, scanning + OCR, outputs Word and then converts to Markdown
- Online OCR: ilovepdf.com, smallpdf.com offer free OCR conversion
Text Layer PDF Conversion
If the PDF contains selectable text (not a scanned version), you can use:
# pdftotext(poppler 工具包)
pdftotext -layout input.pdf output.txt
# 注意:只能提取纯文本,标题层级和格式信息会丢失
# pdf2md 工具(更好保留结构)
npm install -g pdf2md-cli
pdf2md input.pdf > output.md
Realistic expectations: PDF → Markdown almost always requires manual correction, only suitable for "better than nothing" scenarios. For important documents, it is recommended to keep the source files (Word, Markdown originals).
Word/DOCX → Markdown
In enterprise environments, a large number of documents exist in Word format, and this path is commonly used when migrating to Markdown knowledge bases.
Pandoc (Recommended, Highest Quality)
# 基本转换
pandoc input.docx -o output.md
# 提取 Word 中的图片到 media/ 目录
pandoc input.docx -o output.md --extract-media=./media
# 批量转换
for f in *.docx; do
pandoc "$f" -o "${f%.docx}.md" --extract-media="./media/${f%.docx}"
done
Pandoc can correctly identify Word's heading styles (Heading 1 → #, Heading 2 → ##), and retain bold, italic, tables, and images.
Word Save As (Simple but Poor Quality)
In Word, go to File → Save As → Plain Text, which only retains plain text and loses all formatting. Not recommended as a migration solution.
Quality Ratings for Each Conversion Path
| Conversion Path | Format Fidelity | Operation Difficulty | Recommended Tools |
|---|---|---|---|
| Markdown → HTML | ★★★★★ | Low | Pandoc / Online Tools |
| Markdown → PDF | ★★★★☆ | Medium | Pandoc + XeLaTeX |
| HTML → Markdown | ★★★★☆ | Low | MagicTools / Turndown |
| Word → Markdown | ★★★★☆ | Low | Pandoc |
| Markdown → Word | ★★★☆☆ | Low | Pandoc |
| PDF → Markdown | ★★☆☆☆ | High | OCR + Manual Correction |
| PDF → Word | ★★★☆☆ | Low | Adobe / Online Tools |
FAQ
Q: What to do if the formatting is messed up after conversion?
A: There are usually three reasons for messy formatting: first, the source file format is not standard (e.g., using blank lines instead of paragraph styles in Word), and the solution is to standardize the source file before converting; second, the conversion tool does not support certain features (e.g., special fonts, complex tables), try switching to another tool; third, encoding issues (Chinese garbled characters), add --from=utf-8 to the Pandoc command line or ensure the file is saved in UTF-8. If it's PDF → other formats, messy formatting is basically normal and requires manual correction.
Q: How to handle tables that are easily lost during conversion?
A: Tables are the most problematic element in format conversion. Handling suggestions: when converting HTML → Markdown, prioritize using MagicTools or Turndown, which have the best support for tables; table loss in PDFs is basically impossible to recover automatically and can only be manually rewritten as Markdown tables; Word tables converted with Pandoc have better quality, but merged cells will lose merge information; as a last resort, you can take a screenshot of the table and use an image instead—although not elegant, it's better than messy formatting.
Q: Can free tools handle batch conversion?
A: Absolutely. Pandoc is a free and open-source tool, and you can write Shell scripts to batch process entire directories. Python's html2text and markdownify libraries also support batch invocation. For batch conversion of up to 100 files, command-line scripts combined with Pandoc are the most efficient solution, completely free, and allow precise control of output format. For over 1000 files, consider parallel processing (xargs -P 4 or Python's multiprocessing) to improve speed.
Summary
There is no silver bullet for document format conversion, but there are patterns to follow:
- High-frequency needs: Markdown ↔ HTML, Markdown → PDF, these paths have mature tools and reliable quality
- Migration needs: Word → Markdown, HTML → Markdown, Pandoc + online tools can handle most situations
- Hardest to convert: PDF → any format, be prepared for manual correction
The most important advice: Keep the original format files. Conversion is a temporary operation, and the original is the asset. Write in Markdown, use Git for version management, and you can convert it to any format you need at any time—this is the most flexible document workflow.
