Rich text format (with formatting tags) in Excel to unformatted text

后端 未结 4 1271
天涯浪人
天涯浪人 2020-12-10 20:32

I have approx. 12000 cells in excel containing RTF (including formatting tags). I need to parse them to get to the unformatted text.

This is the example of one of th

相关标签:
4条回答
  • 2020-12-10 21:12

    If you did want to go down the route of using Word to parse the text, this function should help you out. As the comments suggest, you'll need a reference to the MS Word Object Library.

    Function ParseRTF(strRTF As String) As String
    Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
    Dim f     As Integer       'Variable to store the file I/O number'
    
    'File path for a temporary .rtf file'
    Const strFileTemp = "C:\TempFile_ParseRTF.rtf"
    
    'Obtain the next valid file I/O number'
    f = FreeFile
    
    'Open the temp file and save the RTF string in it'
    Open strFileTemp For Output As #f
        Print #f, strRTF
    Close #f
    
    'Open the .rtf file as a Word.Document'
    Set wdDoc = GetObject(strFileTemp)
    
    'Read the now parsed text from the Word.Document'
    ParseRTF = wdDoc.Range.Text
    
    'Delete the temporary .rtf file'
    Kill strFileTemp
    
    'Close the Word connection'
    wdDoc.Close False
    Set wdDoc = Nothing
    End Function
    

    You could call it for each of your 12,000 cells using something similar to this:

    Sub ParseAllRange()
    Dim rngCell As Range
    Dim strRTF  As String
    
    For Each rngCell In Range("A1:A12000")
    
        'Parse the cell contents'
        strRTF = ParseRTF(CStr(rngCell))
    
        'Output to the cell one column over'
        rngCell.Offset(0, 1) = strRTF
    Next
    End Sub
    

    The ParseRTF function takes about a second to run (on my machine at least), so for 12,000 cells this will work out at about three and a half hours.


    Having thought about this problem over the weekend, I was sure there was a better (quicker) solution for this.

    I remembered the RTF capabilities of the clipboard, and realised that a class could be created that would copy RTF data to the clipboard, paste to a word doc, and output the resulting plain text. The benefit of this solution is that the word doc object would not have to be opened and closed for each rtf string; it could be opened before the loop and closed after.

    Below is the code to achieve this. It is a Class module named clsRTFParser.

    Private Declare Function GlobalAlloc Lib "kernel32" _
                    (ByVal wFlags&, ByVal dwBytes As Long) As Long
    Private Declare Function GlobalLock Lib "kernel32" _
                    (ByVal hMem As Long) As Long
    Private Declare Function GlobalUnlock Lib "kernel32" _
                    (ByVal hMem As Long) As Long
    Private Declare Function lstrcpy Lib "kernel32" _
                    (ByVal lpString1 As Any, ByVal lpString2 As Any) As Long
    
    Private Declare Function OpenClipboard Lib "user32" _
                    (ByVal Hwnd As Long) As Long
    Private Declare Function EmptyClipboard Lib "user32" () As Long
    Private Declare Function RegisterClipboardFormat Lib "user32" Alias _
                    "RegisterClipboardFormatA" (ByVal lpString As String) As Long
    Private Declare Function SetClipboardData Lib "user32" _
                    (ByVal wFormat As Long, ByVal hMem As Long) As Long
    Private Declare Function CloseClipboard Lib "user32" () As Long
    
    '---'
    
    Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
    
    Private Sub Class_Initialize()
    Set wdDoc = New Word.Document
    End Sub
    
    Private Sub Class_Terminate()
    wdDoc.Close False
    Set wdDoc = Nothing
    End Sub
    
    '---'
    
    Private Function CopyRTF(strCopyString As String) As Boolean
    Dim hGlobalMemory  As Long
    Dim lpGlobalMemory As Long
    Dim hClipMemory    As Long
    Dim lngFormatRTF   As Long
    
    'Allocate and copy string to memory'
    hGlobalMemory = GlobalAlloc(&H42, Len(strCopyString) + 1)
    lpGlobalMemory = GlobalLock(hGlobalMemory)
    lpGlobalMemory = lstrcpy(lpGlobalMemory, strCopyString)
    
    'Unlock the memory and then copy to the clipboard'
    If GlobalUnlock(hGlobalMemory) = 0 Then
        If OpenClipboard(0&) <> 0 Then
            Call EmptyClipboard
    
            'Save the data as Rich Text Format'
            lngFormatRTF = RegisterClipboardFormat("Rich Text Format")
            hClipMemory = SetClipboardData(lngFormatRTF, hGlobalMemory)
    
            CopyRTF = CBool(CloseClipboard)
        End If
    End If
    End Function
    
    '---'
    
    Private Function PasteRTF() As String
    Dim strOutput As String
    
    'Paste the clipboard data to the wdDoc and read the plain text result'
    wdDoc.Range.Paste
    strOutput = wdDoc.Range.Text
    
    'Get rid of the new lines at the beginning and end of the document'
    strOutput = Left(strOutput, Len(strOutput) - 2)
    strOutput = Right(strOutput, Len(strOutput) - 2)
    
    PasteRTF = strOutput
    End Function
    
    '---'
    
    Public Function ParseRTF(strRTF As String) As String
    If CopyRTF(strRTF) Then
        ParseRTF = PasteRTF
    Else
        ParseRTF = "Error in copying to clipboard"
    End If
    End Function
    

    You could call it for each of your 12,000 cells using something similar to this:

    Sub CopyParseAllRange()
    Dim rngCell As Range
    Dim strRTF  As String
    
    'Create new instance of clsRTFParser'
    Dim RTFParser As clsRTFParser
    Set RTFParser = New clsRTFParser
    
    For Each rngCell In Range("A1:A12000")
    
        'Parse the cell contents'
        strRTF = RTFParser.ParseRTF(CStr(rngCell))
    
        'Output to the cell one column over'
        rngCell.Offset(0, 1) = strRTF
    Next
    End Sub
    

    I have simulated this using example RTF strings on my machine. For 12,000 cells it took two and a half minutes, a much more reasonable time frame!

    0 讨论(0)
  • 2020-12-10 21:16

    Your post made it sound as if each RTF document was stored in a single Excell cell. If so, then

    Solution using .Net Framework RichTextBox control

    will convert the RTF in each cell to plain text in 2 lines of code (after a little system configuration to get the right .tlb file to allow reference to the .Net Framework). Put the cell value in rtfsample and

    Set miracle = New System_Windows_Forms.RichTextBox
    With miracle
        .RTF = rtfText
        PlainText = .TEXT
    End With
    
    0 讨论(0)
  • 2020-12-10 21:19

    Some of the solutions here require a reference to the MS Word Object Library. Playing with the cards I am dealt, I found a solution that does not rely on it. It strips RTF tags, and other fluff like font tables and stylesheets, all in VBA. It might be helpful to you. I ran it across your data, and other than the whitespace, I get the same output as what you expected.

    Here is the code.

    First, something to check if a string is alphanumeric or not. Give it a string that's one character long. This function is used to work out delimitation here and there.

    Public Function Alphanumeric(Character As String) As Boolean
       If InStr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-", Character) Then
           Alphanumeric = True
       Else
           Alphanumeric = False
       End If
    End Function
    

    Next up is to remove an entire group. I use this to remove font tables and other rubbish.

    Public Function RemoveGroup(RTFString As String, GroupName As String) As String
        Dim I As Integer
        Dim J As Integer
        Dim Count As Integer
    
        I = InStr(RTFString, "{\" & GroupName)
    
        ' If the group was not found in the RTF string, then just return that string unchanged.
        If I = 0 Then
            RemoveGroup = RTFString
            Exit Function
        End If
    
        ' Otherwise, we will need to scan along, from the start of the group, until we find the end of the group.
        ' The group is delimited by { and }. Groups may be nested, so we need to count up if we encounter { and
        ' down if we encounter }. When that count reaches zero, then the end of the group has been found.
        J = I
        Do
            If Mid(RTFString, J, 1) = "{" Then Count = Count + 1
            If Mid(RTFString, J, 1) = "}" Then Count = Count - 1
            J = J + 1
        Loop While Count > 0
    
        RemoveGroup = Replace(RTFString, Mid(RTFString, I, J - I), "")
    
    End Function
    

    Okay, and this function removes any tags.

    Public Function RemoveTags(RTFString As String) As String
        Dim L As Long
        Dim R As Long
        L = 1
        ' Search to the end of the string.
        While L < Len(RTFString)
            ' Append anything that's not a tag to the return value.
            While Mid(RTFString, L, 1) <> "\" And L < Len(RTFString)
                RemoveTags = RemoveTags & Mid(RTFString, L, 1)
                L = L + 1
            Wend
        
            'Search to the end of the tag.
            R = L + 1
            While Alphanumeric(Mid(RTFString, R, 1)) And R < Len(RTFString)
                R = R + 1
            Wend
            L = R
        Wend
    End Function
    

    We can remove curly braces in the obvious way:

    Public Function RemoveBraces(RTFString As String) As String
        RemoveBraces = Replace(RTFString, "{", "")
        RemoveBraces = Replace(RemoveBraces, "}", "")
    End Function
    

    Once you have the functions above copy-pasted into your module, you can create a function that uses them to strip away any stuff you don't need or want. The following works perfectly in my case.

    Public Function RemoveTheFluff(RTFString As String) As String
        RemoveTheFluff = Replace(RTFString, vbCrLf, "")
        RemoveTheFluff = RemoveGroup(RemoveTheFluff, "fonttbl")
        RemoveTheFluff = RemoveGroup(RemoveTheFluff, "colortbl")
        RemoveTheFluff = RemoveGroup(RemoveTheFluff, "stylesheet")
        RemoveTheFluff = RemoveTags(RemoveBraces(RemoveTheFluff))
    End Function
    

    I hope this helps. I wouldn't use it in a word processor or anything, but it might do for scraping data if that's what you're doing.

    0 讨论(0)
  • 2020-12-10 21:20

    You can try to parse every cell with regular expression and leave only the content you need.

    Every RTF control code start with "\" and ends with space, without any additional space between. "{}" are use for grouping. If your text won't contain any, you can just remove them (the same for ";"). So now you stay with your text and some unnecessary words as "Arial", "Normal" etc. You can build the dictionary to remove them also. After some tweaking, you will stay with only the text you need.

    Look at http://www.regular-expressions.info/ for more information and great tool to write RegExp's (RegexBuddy - unfortunately it isn't free, but it's worth the money. AFAIR there is also trial).

    UPDATE: Of course, I don't encourage you to do it manually for every cell. Just iterate through active range: Refer this thread: SO: About iterating through cells in VBA

    Personally, I'll give a try to this idea:

    Sub Iterate()
       For Each Cell in ActiveSheet.UsedRange.Cells
          'Do something
       Next
    End Sub
    

    And how to use RegExp's in VBA (Excel)?

    Refer: Regex functions in Excel and Regex in VBA

    Basically you've to use VBScript.RegExp object through COM.

    0 讨论(0)
提交回复
热议问题