Rich text format (with formatting tags) in Excel to unformatted text

后端 未结 4 1270
天涯浪人
天涯浪人 2020-12-10 20:32

I have approx. 12000 cells in excel containing RTF (including formatting tags). I need to parse them to get to the unformatted text.

This is the example of one of th

4条回答
  •  囚心锁ツ
    2020-12-10 21:19

    Some of the solutions here require a reference to the MS Word Object Library. Playing with the cards I am dealt, I found a solution that does not rely on it. It strips RTF tags, and other fluff like font tables and stylesheets, all in VBA. It might be helpful to you. I ran it across your data, and other than the whitespace, I get the same output as what you expected.

    Here is the code.

    First, something to check if a string is alphanumeric or not. Give it a string that's one character long. This function is used to work out delimitation here and there.

    Public Function Alphanumeric(Character As String) As Boolean
       If InStr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-", Character) Then
           Alphanumeric = True
       Else
           Alphanumeric = False
       End If
    End Function
    

    Next up is to remove an entire group. I use this to remove font tables and other rubbish.

    Public Function RemoveGroup(RTFString As String, GroupName As String) As String
        Dim I As Integer
        Dim J As Integer
        Dim Count As Integer
    
        I = InStr(RTFString, "{\" & GroupName)
    
        ' If the group was not found in the RTF string, then just return that string unchanged.
        If I = 0 Then
            RemoveGroup = RTFString
            Exit Function
        End If
    
        ' Otherwise, we will need to scan along, from the start of the group, until we find the end of the group.
        ' The group is delimited by { and }. Groups may be nested, so we need to count up if we encounter { and
        ' down if we encounter }. When that count reaches zero, then the end of the group has been found.
        J = I
        Do
            If Mid(RTFString, J, 1) = "{" Then Count = Count + 1
            If Mid(RTFString, J, 1) = "}" Then Count = Count - 1
            J = J + 1
        Loop While Count > 0
    
        RemoveGroup = Replace(RTFString, Mid(RTFString, I, J - I), "")
    
    End Function
    

    Okay, and this function removes any tags.

    Public Function RemoveTags(RTFString As String) As String
        Dim L As Long
        Dim R As Long
        L = 1
        ' Search to the end of the string.
        While L < Len(RTFString)
            ' Append anything that's not a tag to the return value.
            While Mid(RTFString, L, 1) <> "\" And L < Len(RTFString)
                RemoveTags = RemoveTags & Mid(RTFString, L, 1)
                L = L + 1
            Wend
        
            'Search to the end of the tag.
            R = L + 1
            While Alphanumeric(Mid(RTFString, R, 1)) And R < Len(RTFString)
                R = R + 1
            Wend
            L = R
        Wend
    End Function
    

    We can remove curly braces in the obvious way:

    Public Function RemoveBraces(RTFString As String) As String
        RemoveBraces = Replace(RTFString, "{", "")
        RemoveBraces = Replace(RemoveBraces, "}", "")
    End Function
    

    Once you have the functions above copy-pasted into your module, you can create a function that uses them to strip away any stuff you don't need or want. The following works perfectly in my case.

    Public Function RemoveTheFluff(RTFString As String) As String
        RemoveTheFluff = Replace(RTFString, vbCrLf, "")
        RemoveTheFluff = RemoveGroup(RemoveTheFluff, "fonttbl")
        RemoveTheFluff = RemoveGroup(RemoveTheFluff, "colortbl")
        RemoveTheFluff = RemoveGroup(RemoveTheFluff, "stylesheet")
        RemoveTheFluff = RemoveTags(RemoveBraces(RemoveTheFluff))
    End Function
    

    I hope this helps. I wouldn't use it in a word processor or anything, but it might do for scraping data if that's what you're doing.

提交回复
热议问题