Quickly Convert (.rtf|.doc) Files to Markdown Syntax with PHP

北城余情 提交于 2019-12-02 13:53:22
David

If you happen to be on a mac, textutil does a good job of converting doc, docx, and rtf to html, and pandoc does a good job of converting the resulting html to markdown:

$ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md

I have a script that I threw together a while back that tries to use textutil, pdf2html, and pandoc to convert whatever I throw at it to markdown.

Taj Moore

ProgTips has a possible solution with a Word macro (source download):

A simple macro (source download) for converting the most trivial things automatically. This macro does:

  • Replace bold and italics
  • Replace headings (marked heading 1-6)
  • Replace numbered and bulleted lists

It's very buggy, I believe it hangs on larger documents, however I'm NOT stating it's a stable release anyway! :-) Experimental use only, recode and reuse it as you like, post a comment if you've found a better solution.

Source: ProgTips

Macro source

Installation

  • open WinWord,
  • press Alt+F11 to open the VBA editor,
  • right click the first project in the project browser
  • choose insert->module
  • paste the code from the file
  • close macro editor
  • go tools>macro>macros; run the macro named MarkDown

Source: ProgTips

Source

Macro source for safe keeping if ProgTips deletes the post or the site gets wiped out:

'*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02.
'*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only
'*** the most simple things. These are:
'*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph
'*** 2) Converts tables to text. In fact, tables get lost.
'*** 3) Adds a single indent to all indented paragraphs
'*** 4) Replaces all the text in italics to _text_
'*** 5) Replaces all the text in bold to **text**
'*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost)
'*** 7) Replaces bulleted lists with ^p *  listitem ^p*  listitem2...
'*** 8) Replaces numbered lists with ^p 1. listitem ^p2.  listitem2...
'*** Feel free to use and redistribute this code
Sub MarkDown()
    Dim bReplace As Boolean
    Dim i As Integer
    Dim oPara As Paragraph


    'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p
    Call RemoveBoldEnters


    For i = Selection.Document.Tables.Count To 1 Step -1
            Call Selection.Document.Tables(i).ConvertToText
    Next

    'simple text indent + extra paragraphs for non-numbered paragraphs
    For i = Selection.Document.Paragraphs.Count To 1 Step -1
        Set oPara = Selection.Document.Paragraphs(i)
        If oPara.Range.ListFormat.ListType = wdListNoNumbering Then
            If oPara.LeftIndent > 0 Then
                oPara.Range.InsertBefore (">")
            End If
            oPara.Range.InsertBefore (vbCrLf)
        End If


    Next

    'italic -> _italic_
    Selection.HomeKey Unit:=wdStory
    bReplace = ReplaceOneItalic  'first replacement
    While bReplace 'other replacements
        bReplace = ReplaceOneItalic
    Wend

    'bold-> **bold**
    Selection.HomeKey Unit:=wdStory
    bReplace = ReplaceOneBold 'first replacement
    While bReplace
        bReplace = ReplaceOneBold 'other replacements
    Wend



    'Heading -> ##heading
    For i = 1 To 6 'heading1 to heading6
        Selection.HomeKey Unit:=wdStory
        bReplace = ReplaceH(i) 'first replacement
        While bReplace
            bReplace = ReplaceH(i) 'other replacements
        Wend
    Next

    Call ReplaceLists


    Selection.HomeKey Unit:=wdStory
End Sub


'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Function ReplaceOneBold() As Boolean
    Dim bReturn As Boolean

    Selection.Find.ClearFormatting
    With Selection.Find
        .Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Font.Bold = True
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With

    bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Text = "**" & Selection.Text & "**"
        Selection.Font.Bold = False
        Selection.Find.Execute
    Wend

    ReplaceOneBold = bReturn
End Function

'*******************************************************************
' Function to replace italic with _italic_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'********************************************************************
Function ReplaceOneItalic() As Boolean
    Dim bReturn As Boolean

        Selection.Find.ClearFormatting

    With Selection.Find
        .Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Font.Italic = True
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With

    bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Text = "_" & Selection.Text & "_"
        Selection.Font.Italic = False
        Selection.Find.Execute
    Wend
    ReplaceOneItalic = bReturn
End Function

'*********************************************************************
' Function to replace headingX with #heading, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'*********************************************************************
Function ReplaceH(ByVal ipNumber As Integer) As Boolean
    Dim sReplacement As String

    Select Case ipNumber
    Case 1: sReplacement = "#"
    Case 2: sReplacement = "##"
    Case 3: sReplacement = "###"
    Case 4: sReplacement = "####"
    Case 5: sReplacement = "#####"
    Case 6: sReplacement = "######"
    End Select

    Selection.Find.ClearFormatting
    Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber)
    With Selection.Find
        .Text = ""
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With


     bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Range.InsertBefore (vbCrLf & sReplacement & " ")
        Selection.Style = ActiveDocument.Styles("Normal")
        Selection.Find.Execute
    Wend

    ReplaceH = bReturn
End Function



'***************************************************************
' A fix-up for paragraph marks that ar are bold or italic
'***************************************************************
Sub RemoveBoldEnters()
    Selection.HomeKey Unit:=wdStory
    Selection.Find.ClearFormatting
    Selection.Find.Font.Italic = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = "^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
    End With
    Selection.Find.Execute Replace:=wdReplaceAll

    Selection.HomeKey Unit:=wdStory
    Selection.Find.ClearFormatting
    Selection.Find.Font.Bold = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = "^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Sub ReplaceLists()
    Dim i As Integer
    Dim j As Integer
    Dim Para As Paragraph

    Selection.HomeKey Unit:=wdStory

    'iterate through all the lists in the document
    For i = Selection.Document.Lists.Count To 1 Step -1
        'check each paragraph in the list
        For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1
            Set Para = Selection.Document.Lists(i).ListParagraphs(j)
            'if it's a bulleted list
            If Para.Range.ListFormat.ListType = wdListBullet Then
                        Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*"))
            'if it's a numbered list
            ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _
                                                    wdListMixedNumbering Or _
                                                    wdListListNumOnly Then
                Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ".  ")
            End If
        Next j
        'inserts paragraph marks before and after, removes the list itself
        Selection.Document.Lists(i).Range.InsertParagraphBefore
        Selection.Document.Lists(i).Range.InsertParagraphAfter
        Selection.Document.Lists(i).RemoveNumbers
    Next i
End Sub

'***********************************************************
' Returns the MarkDown indent text
'***********************************************************
Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String
    Dim i  As Integer
    For i = 1 To ipNumber - 1
        ListIndent = ListIndent & "    "
    Next
    ListIndent = ListIndent & spChar & "    "
End Function

Source: ProgTips

If you're open to using the .docx format, you could use this PHP script that I put together that will extract the XML, run some XSL transformations and output a pretty decent Markdown equivalent:

https://github.com/matb33/docx2md

Note that it is meant to work from the command-line, and is rather basic in its interface. However, it will get the job done!

If the script doesn't work well enough for you, I encourage you to send me your .docx files so I can reproduce your problem and fix it. Log an issue in GitHub or contact me directly if you prefer.

Pandoc is a good command-line conversion tool, but again, you will first need to get the input into a format that Pandoc can read, which is:

  • markdown
  • reStructuredText
  • textile
  • HTML
  • LaTeX

We had the same problem of having to convert Word documents to markdown. Some were more complicated and (very) large documents, with math equations and images and such. So I made this script which converts using a number of different tools: https://github.com/Versal/word2markdown

Because it uses a chain of several tools it is a bit more error-prone, but it can be a good starting point if you have more complicated documents. Hope it can be helpful! :)

Update: It currently only works on Mac OS X, and you need to have some requirements installed (Word, Pandoc, HTML Tidy, git, node/npm). For it to work properly, you also need to open an empty Word document, and do: File->Save As Webpage->Compatibility->Encoding->UTF-8. Then this encoding is saved as default. See the README for more details on how to set up.

Then run this in the console:

$ git clone git@github.com:Versal/word2markdown.git
$ cd word2markdown
$ npm install
(copy over the Word files, for example, "document.docx")
$ ./doc-to-md.sh document.docx document_files > document.md

Then you can find the Markdown in document.md and images in the directory document_files.

It's perhaps a bit complicated now, so I would welcome any contributions that make this easier or make this work on other operating systems! :)

Have you tried this one? Not sure about feature richness, but it works for simple texts. http://markitdown.medusis.com/

As part of the university ruby course I developed a tool which can convert openoffice word files (.odt) to markdown. A lot of assumptions has to be made in order to turn it to correct formatting. For example it is hard to determine the size of a text which has to be considered as Heading. However the only think that you can loose with this conversion is the formatting any text that is met is always appends to the markdown document. The tool I've developed supports lists, bold and italic text, and it has syntax for tables.

http://github.com/bostko/doc2text Give it a try and please give me your feedback.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!