How can doc/docx files be converted to markdown or structured text?

后端 未结 11 854
难免孤独
难免孤独 2021-01-29 21:45

Is there a program or workflow to convert .doc or .docx files to Markdown or similar text?

PS: Ideally, I would welcome the option that a spec

相关标签:
11条回答
  • 2021-01-29 22:34

    You can use Word to Markdown (Ruby Gem) to convert it in one step. Conversion can be as simple as:

    $ gem install word-to-markdown
    $ w2m path/to/document.docx
    

    It routes the document through LibreOffice, but also does it best to semantice headings based on their relative font size.

    There's also a hosted version which would be as simple as drag-and-drop to convert.

    0 讨论(0)
  • 2021-01-29 22:34

    Word to Markdown might be worth a shot, or the procedure described here using Calibre and Pandoc via HTMLZ, here's a bash script they use:

    #!/bin/bash
    mkdir temp
    cp $1 temp
    cd temp
    ebook-convert $1 output.htmlz
    unzip output.htmlz
    cd ..
    pandoc -f html -t markdown -o output.md temp/index.html
    rm -R temp
    
    0 讨论(0)
  • 2021-01-29 22:35

    Pandoc supports conversion from docx to markdown directly:

    pandoc -f docx -t markdown foo.docx -o foo.markdown
    

    Several markdown formats are supported:

    -t gfm (GitHub-Flavored Markdown)  
    -t markdown_mmd (MultiMarkdown)  
    -t markdown (pandoc’s extended Markdown)  
    -t markdown_strict (original unextended Markdown)  
    -t markdown_phpextra (PHP Markdown Extra)  
    -t commonmark (CommonMark Markdown)  
    
    0 讨论(0)
  • 2021-01-29 22:40

    For bulleted lists you can paste a list into Sublime Text and use multiselect ( tested ) or find and replace ( not tested ) to replace eg the proprietary MS Word characters with -, -- etc

    This doesn't work with headings but it may be possible to use a similar technique with other elements.

    0 讨论(0)
  • 2021-01-29 22:41

    From here:

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
    
    0 讨论(0)
提交回复
热议问题