问题
I was looking for a way to remove text from and RTF string and I found the following regex:
({\\)(.+?)(})|(\\)(.+?)(\b)
However the resulting string has two right angle brackets "}"
Before: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }
After: } can u send me info for the call pls }
Any thoughts on how to improve the regex?
Edit: A more complicated string such as this one does not work: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\test\\myapp\\Apps\\\{3423234-283B-43d2-BCE6-A324B84CC70E\}\par }
回答1:
In RTF, { and } marks a group. Groups can be nested. \ marks beginning of a control word. Control words end with either a space or a non alphabetic character. A control word can have a numeric parameter following, without any delimiter in between. Some control words also take text parameters, separated by ';'. Those control words are usually in their own groups.
I think I have managed to make a pattern that takes care of most the cases.
\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?
It leaves a few spaces when run on your pattern though.
Going trough the RTF specification (some of it), I see that there are a lot of pitfalls for pure regex based strippers. The most obvious one are that some groups should be ignored (headers, footers, etc.), while others should be rendered (formatting).
I have written a Python script that should work better than my regex above:
def striprtf(text):
pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
# control words which specify a "destionation".
destinations = frozenset((
'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
'listoverridetable','listpicture','liststylename','listtable','listtext',
'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
'svb','tc','template','themedata','title','txe','ud','upr','userprops',
'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
'xmlopen',
))
# Translation of some special characters.
specialchars = {
'par': '\n',
'sect': '\n\n',
'page': '\n\n',
'line': '\n',
'tab': '\t',
'emdash': u'\u2014',
'endash': u'\u2013',
'emspace': u'\u2003',
'enspace': u'\u2002',
'qmspace': u'\u2005',
'bullet': u'\u2022',
'lquote': u'\u2018',
'rquote': u'\u2019',
'ldblquote': u'\201C',
'rdblquote': u'\u201D',
}
stack = []
ignorable = False # Whether this group (and all inside it) are "ignorable".
ucskip = 1 # Number of ASCII characters to skip after a unicode character.
curskip = 0 # Number of ASCII characters left to skip
out = [] # Output buffer.
for match in pattern.finditer(text):
word,arg,hex,char,brace,tchar = match.groups()
if brace:
curskip = 0
if brace == '{':
# Push state
stack.append((ucskip,ignorable))
elif brace == '}':
# Pop state
ucskip,ignorable = stack.pop()
elif char: # \x (not a letter)
curskip = 0
if char == '~':
if not ignorable:
out.append(u'\xA0')
elif char in '{}\\':
if not ignorable:
out.append(char)
elif char == '*':
ignorable = True
elif word: # \foo
curskip = 0
if word in destinations:
ignorable = True
elif ignorable:
pass
elif word in specialchars:
out.append(specialchars[word])
elif word == 'uc':
ucskip = int(arg)
elif word == 'u':
c = int(arg)
if c < 0: c += 0x10000
if c > 127: out.append(unichr(c))
else: out.append(chr(c))
curskip = ucskip
elif hex: # \'xx
if curskip > 0:
curskip -= 1
elif not ignorable:
c = int(hex,16)
if c > 127: out.append(unichr(c))
else: out.append(chr(c))
elif tchar:
if curskip > 0:
curskip -= 1
elif not ignorable:
out.append(tchar)
return ''.join(out)
It works by parsing the RTF code, and skipping any groups which has a "destination" specified, and all "ignorable" groups ({\*
...}
). I also added handling of some special characters.
There are lots of features missing to make this a full parser, but should be enough for simple documents.
UPDATED: This url have this script updated to run on Python 3.x:
https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676
回答2:
So far, we haven't found a good answer to this either, other than using a RichTextBox control:
/// <summary>
/// Strip RichTextFormat from the string
/// </summary>
/// <param name="rtfString">The string to strip RTF from</param>
/// <returns>The string without RTF</returns>
public static string StripRTF(string rtfString)
{
string result = rtfString;
try
{
if (IsRichText(rtfString))
{
// Put body into a RichTextBox so we can strip RTF
using (System.Windows.Forms.RichTextBox rtfTemp = new System.Windows.Forms.RichTextBox())
{
rtfTemp.Rtf = rtfString;
result = rtfTemp.Text;
}
}
else
{
result = rtfString;
}
}
catch
{
throw;
}
return result;
}
/// <summary>
/// Checks testString for RichTextFormat
/// </summary>
/// <param name="testString">The string to check</param>
/// <returns>True if testString is in RichTextFormat</returns>
public static bool IsRichText(string testString)
{
if ((testString != null) &&
(testString.Trim().StartsWith("{\\rtf")))
{
return true;
}
else
{
return false;
}
}
Edit: Added IsRichText method.
回答3:
I've used this before and it worked for me:
\\\w+|\{.*?\}|}
You will probably want to trim the ends of the result to get rid of the extra spaces left over.
回答4:
I made this helper function to do this in JavaScript. So far this has worked well for simple RTF formatting removal for me.
function stripRtf(str){
var basicRtfPattern = /\{\*?\\[^{}]+;}|[{}]|\\[A-Za-z]+\n?(?:-?\d+)?[ ]?/g;
var newLineSlashesPattern = /\\\n/g;
var ctrlCharPattern = /\n\\f[0-9]\s/g;
//Remove RTF Formatting, replace RTF new lines with real line breaks, and remove whitespace
return str
.replace(ctrlCharPattern, "")
.replace(basicRtfPattern, "")
.replace(newLineSlashesPattern, "\n")
.trim();
}
Of Note:
- I slightly modified the regex written by @Markus Jarderot above. It now removes slashes at the end of new lines in two step to avoid a more complex regex.
.trim()
is only supported in newer browsers. If you need to have support for these then see this: Trim string in JavaScript?
EDIT: I've updated the regex to work around some issues I've found since posting this originally. I'm using this in a project, see it in context here: https://github.com/chrismbarr/LyricConverter/blob/865f17613ee8f43fbeedeba900009051c0aa2826/scripts/parser.js#L26-L37
回答5:
Regex won't never 100% solve this problem, you need a parser. Check this implementation in CodeProject (it's in C# though): http://www.codeproject.com/Articles/27431/Writing-Your-Own-RTF-Converter
回答6:
According to RegexPal, the two }'s are the ones bolded below:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }
I was able to fix the first curly brace by adding a plus sign to the regex:
({\\)(.+?)(}+)|(\\)(.+?)(\b)
^
plus sign added here
And to fix the curly brace at the end, I did this:
({\\)(.+?)(})|(\\)(.+?)(\b)|}$
^
this checks if there is a curly brace at the end
I don't know the RTF format very well so this might not work in all cases, but it works on your example...
回答7:
Late contributor but the regex below helped us with the RTF code we found in our DB (we're using it within an RDL via SSRS).
This expression removed it for our team. Although it may just resolve our specific RTF, it may be a helpful base for someone. Although this webby is incredible handy for live testing.
http://regexpal.com/
{\*?\\.+(;})|\s?\\[A-Za-z0-9]+|\s?{\s?\\[A-Za-z0-9]+\s?|\s?}\s?
Hope this helps, K
回答8:
None of the answers were sufficient, so my solution was to use the RichTextBox control (yes, even in a non-Winform app) to extract text from RTF
回答9:
Here's an Oracle SQL statement that can strip RTF from an Oracle field:
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
CONTENT,
'\\(fcharset|colortbl)[^;]+;', ''
),
'(\\[^ ]+ ?)|[{}]', ''
) TEXT
FROM EXAMPLE WHERE CONTENT LIKE '{\rtf%';
This is designed for data from Windows rich text controls, not RTF files. Limitations are:
\{
and\}
are not replaced with{
and}
- Headers and footers are not handled specially
- Images and other embedded objects are not handled specially (no idea what will happen if one of these is encountered!)
It works by first removing the \fcharset
and \colourtbl
tags, which are special because data follows them until ;
is reached. Then it removes all the \xxx
tags (including a single optional trailing space), followed by all the {
and }
characters. This handles most simple RTF such as what you get from the rich text control.
回答10:
The following solution allows you to extract text from an RTF string:
FareRule = Encoding.ASCII.GetString(FareRuleInfoRS.Data);
System.Windows.Forms.RichTextBox rtf = new System.Windows.Forms.RichTextBox();
rtf.Rtf = FareRule;
FareRule = rtf.Text;
来源:https://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string