Cleaning up RTF text

后端 未结 4 1113
谎友^
谎友^ 2021-01-02 06:38

I\'d like to take some RTF input and clean it to remove all RTF formatting except \\ul \\b \\i to paste it into Word with minor format information.

The command used

相关标签:
4条回答
  • 2021-01-02 07:06

    I would use a hidden RichTextBox, set the Rtf member, then retrieve the Text member to sanitize the RTF in a well-supported way. Then I would use manually inject the desired formatting afterwards.

    0 讨论(0)
  • 2021-01-02 07:13

    Regex it, it wont parse absolutely everything correctly (tables for example) but does the job in most cases.

    string unformatted = Regex.Replace(rtfString, @"\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?", "");
    

    Magic =)

    0 讨论(0)
  • 2021-01-02 07:27

    I'd do something like the following:

    Dim unformatedtext As String
    
    someRTFtext = Replace(someRTFtext, "\ul", "[ul]")
    someRTFtext = Replace(someRTFtext, "\b", "[b]")
    someRTFtext = Replace(someRTFtext, "\i", "[i]")
    
    Dim RTFConvert As RichTextBox = New RichTextBox
    RTFConvert.Rtf = someRTFtext
    unformatedtext = RTFConvert.Text
    
    unformatedtext = Replace(unformatedtext, "[ul]", "\ul")
    unformatedtext = Replace(unformatedtext, "[b]", "\b")
    unformatedtext = Replace(unformatedtext, "[i]", "\i")
    
    Clipboard.SetText(unformatedtext)
    
    oWord.ActiveDocument.ActiveWindow.Selection.PasteAndFormat(0)
    
    0 讨论(0)
  • 2021-01-02 07:30

    You can strip out the tags with regular expressions. Just make sure that your expressions will not filter tags that were actually text. If the text had "\b" in the body of text, it would appear as \b in the RTF stream. In other words, you would match on "\b" but not "\b".

    You could probably take a short cut and filter out the header RTF tags. Look for the first occurrence of "\viewkind4" in the input. Then read ahead to the first space character. You would remove all of the characters from the start of the text up to and including that space character. That would strip out the RTF header information (fonts, colors, etc).

    0 讨论(0)
提交回复
热议问题