I\'ve been manually converting articles into Markdown syntax for a few days now, and it\'s getting rather tedious. Some of these are 3 or 4 pages, italics and other emphasized t
Pandoc is a good command-line conversion tool, but again, you will first need to get the input into a format that Pandoc can read, which is:
Have you tried this one? Not sure about feature richness, but it works for simple texts. http://markitdown.medusis.com/
As part of the university ruby course I developed a tool which can convert openoffice word files (.odt) to markdown. A lot of assumptions has to be made in order to turn it to correct formatting. For example it is hard to determine the size of a text which has to be considered as Heading. However the only think that you can loose with this conversion is the formatting any text that is met is always appends to the markdown document. The tool I've developed supports lists, bold and italic text, and it has syntax for tables.
http://github.com/bostko/doc2text Give it a try and please give me your feedback.
If you happen to be on a mac, textutil
does a good job of converting doc, docx, and rtf to html, and pandoc does a good job of converting the resulting html to markdown:
$ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md
I have a script that I threw together a while back that tries to use textutil, pdf2html, and pandoc to convert whatever I throw at it to markdown.
If you're open to using the .docx
format, you could use this PHP script that I put together that will extract the XML, run some XSL transformations and output a pretty decent Markdown equivalent:
https://github.com/matb33/docx2md
Note that it is meant to work from the command-line, and is rather basic in its interface. However, it will get the job done!
If the script doesn't work well enough for you, I encourage you to send me your .docx
files so I can reproduce your problem and fix it. Log an issue in GitHub or contact me directly if you prefer.
We had the same problem of having to convert Word documents to markdown. Some were more complicated and (very) large documents, with math equations and images and such. So I made this script which converts using a number of different tools: https://github.com/Versal/word2markdown
Because it uses a chain of several tools it is a bit more error-prone, but it can be a good starting point if you have more complicated documents. Hope it can be helpful! :)
Update: It currently only works on Mac OS X, and you need to have some requirements installed (Word, Pandoc, HTML Tidy, git, node/npm). For it to work properly, you also need to open an empty Word document, and do: File->Save As Webpage->Compatibility->Encoding->UTF-8. Then this encoding is saved as default. See the README for more details on how to set up.
Then run this in the console:
$ git clone git@github.com:Versal/word2markdown.git
$ cd word2markdown
$ npm install
(copy over the Word files, for example, "document.docx")
$ ./doc-to-md.sh document.docx document_files > document.md
Then you can find the Markdown in document.md
and images in the directory document_files
.
It's perhaps a bit complicated now, so I would welcome any contributions that make this easier or make this work on other operating systems! :)