How can doc/docx files be converted to markdown or structured text?

后端 未结 11 853
难免孤独
难免孤独 2021-01-29 21:45

Is there a program or workflow to convert .doc or .docx files to Markdown or similar text?

PS: Ideally, I would welcome the option that a spec

相关标签:
11条回答
  • 2021-01-29 22:17

    Here's an open-source web application built in Ruby to do this exact thing: https://word2md.com

    0 讨论(0)
  • 2021-01-29 22:17

    If you're using Linux, try Pandoc (first convert .doc/.docx into html with LibreOffice or something and then run it).

    On Windows (or if Pandoc doesn't work), you can try this website (online demo, you can download it): Markdownify

    0 讨论(0)
  • 2021-01-29 22:29

    Mammoth is best known as a Word to HTML converter but it now supports a Markdown writer module. When I last checked, Mammoth Markdown support was still in its early stages, so you may find some features are unsupported. As usual ... check the website for the latest details.

    Install

    To use the Javascript version ... install NodeJS and then install Mammoth:

    npm install -g mammoth
    

    Command line

    Command line to convert a Word document to Markdown ...

    mammoth document.docx --output-format=markdown
    

    API

    NodeJS API to convert to Markdown ...

    var mammoth = require("mammoth");
    mammoth.convertToMarkdown({path: "path/to/document.docx"});
    

    Features:

    Mammoth Markdown writer currently supports:

    • Lists (numbered and bulleted)
    • Links
    • Font styles such as bold, italic
    • Images

    The Mammoth command line tools and API have been ported to several languages:

    With NO Markdown (May 2016):

    • .NET
    • Java/JVM
    • Wordpress

    With Markdown:

    • Javascript
    • Python
    0 讨论(0)
  • 2021-01-29 22:29

    You can convert Word documents from within MS Word to Markdown using this Visual Basic Script:

    https://gist.github.com/hawkrives/2305254

    Follow the instructions under "To use the code" to create a new Macro in Word.

    Note: This converts the currently open Word document ato Markdown, which removes all the Word formatting (headings, lists, etc.). First save the Word document you plan to converts, and then save the document again as a new document before running the macro. This way you can always go back to the original Word document to make changes.

    There are more examples of Word to markdown VB scripts here:

    https://www.mediawiki.org/wiki/Microsoft_Word_Macros

    0 讨论(0)
  • 2021-01-29 22:33

    Options

    1. Use a Conversion Tool for multi-file conversion.
    2. Use a WYSIWYG Editor for single files and superior fonts.


    Which Conversion Tools?

    I've tested these three: (1)-Pandoc / (2)-Mammoth / (3)-w2m


    Pandoc

    By far the superior tool for conversions with support for a multitude of file types (see Pandoc's man page for supported file types):

    pandoc -f docx -t gfm somedoc.docx -o somedoc.md
    


    NB
    • To get pandoc to export markdown tables ('pipe_tables' in pandoc) use multimarkdown or gfm output formats.

    • If formatting to PDF, pandoc uses LaTeX templates for this so you may need to install the LaTeX package for your OS if that command does not work out of the box. Instructions at LaTeX Installation


    Which WYSIWYG Editors?

    Writeage

    In answer to this specific question (docx --> markdown), use the Writeage plugin for Microsoft Word. It also works the other way round markdown --> docx.


    If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to docx.

    • Typora
    • iaWriter
    • Markdown Viewer for Chrome.


    Update: A4 vs US Letter

    For outside the US, set the geometry variable:

    pandoc -s -V geometry:a4paper -o outfile.pdf infile.md
    


    Footnote

    Its worth mentioning here - what's not that obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format, supporting amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.

    But Github's default format uses gfm which also supports tables. I use gfm for Github/GitLab and MultiMarkdown for everything else.

    0 讨论(0)
  • 2021-01-29 22:34

    Given that you asked this question on stackoverflow you're probably wanting a programmatic or command line solution for which I've included another answer.

    However, an alternative solution might be to use the Writage Markdown plugin for Microsoft Word.

    Writage turns Word into your Markdown WYSIWYG editor, so you will be able to open a Markdown file and edit it like you normally edit any document in Microsoft Word. Also it will be possible to save your Word document as a Markdown file without any other converters.

    Under the covers, Writage uses Pandoc that you'll also need to install for this plugin to work.

    It currently supports the following Markdown elements:

    • Headings
    • Lists (numbered and bulleted)
    • Links
    • Font styles such as bold, italic
    • Tables
    • Footnotes

    This might be the ideal solution for many end users as they won't need to install or run any command line tools - but rather just stick with what they are most familiar.

    0 讨论(0)
提交回复
热议问题