Research and replace Word Rtf

问题

I'm working on an application which has a workflow for postal mails. These postal mails are generated according to my application business rules.

Models are in html or Rtf and it works perfectly as long the user do not create the rtf with word. This is not within the specs, but my hierarchy would welcome a Word compatibility if it don't involve too much work, and it would please and ease the life of our customer.

The Rtf models have tags which are replaced by application values. In most RTF, tags are not splitted, so the search and replace works perfectly. I wish to be handle word with few modifications.

Example data : [[FooBuzz]] in most rtf it's not splited.

In word 2003 :

{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]}

And their word (word 2007) splitted also Foo{garbage inside} Buzz.

So i wish to be able to handle common RTF perfectly, and detect tags even if they are splitted.

I have 2 constraints. First no regression, second it has to stay simple. Performance is not an issue here.

I'm using symfony 1.4. The actual relevant research code part :

$regExpression = '/\[\[([^\[\]]*)\]\]/';  

preg_match_all($regExpression, $sTemplate, $outKeys);

Update :

I guess i mostly need to perfect this regex. I'm working on some regex but they need some improvements still :

/([\a-zA-Z0-9]+)/

produce :

[0] => Array
    (
        [0] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[
        [1] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz
        [2] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]
    )

Update 2 :

I still have a few problem with the regex. It actually find tag value and plain text for the first. I'm not sure what i want is even possible in a reasonnable amount of time.

I need to modify the regex, so she catch the same results, but inside [[ ]], actually it works on plain text too.

And even harder i have to be able to catch all my sample data (but not plain text) by whatever i have to.

For my replace regex, which replace my tag and all the garbage. I have almost succedd :

/{.*?\[\[.*(?<!\\)\w+\b.*\]\].*?}/

But it is too greedy. I want to match the group { [[}{tag}{ ]]} and it match {plain text}{ [[}{tag}{ ]]}{plain text}

I add the ? cause i read it would make the .* non greedy but it don't work. Any ideas ?

I can't get what's wrong with this regex (name of tag finding) :

\[\[(\b(?<!\\)\w+\b)\]\]

According to my understanding. It says inside [[ ]], find any word which do no start with a backslawh followed by any word character. Am i right ?

Update 3 :

Sorry i was unclear.

My first regex aim to catch FooBuzz in [[FooBuzz]]. And the seconde to catch [[FooBuzz]]. So in the first regex, i want to catch only the text FooBuzz, and ignoring everything else (like {} \eoeoe).

In the seconde place i have to replace [[FooBuzz]] completely. So i have to catch {[[}{FooBuzz}}{]]} and nothing more.

Actually i'm catching {plain text i musn't catch} {[[}{FooBuzz}}{]]}}. See i catch too must here. I'm catching : plain text i musn't catch [[FooBuzz]].

For the [[ part, i need to only catch this : {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}. I guess that's because he can't find an ungreedy match. So he is in greedy mode. And fail with this data sample :

{\toto toto}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]}{\toto toto}

回答1:

After your edit, to find FooBuzz or any other tag you can search for

(?<=\[\[).+?\b(?<!\\)(\w+)\b(?=.+?\]\])

and match the first group.

It finds a whole word not preceeded by a \ using negative lookbehind (?<!\\) also to tell that it needs to be preceeded by [[ and followed by ]]

Here an example, you can see the first group correctly containing FooBar :)

To better understand RTF I found a good link, I think that you could consider also a non regex approach, even if in this case I have no clues.

EDIT:

Your last regex is wrong because it expects a \w+ exactly after the last square bracket, it will just match something like [[wordWithoutSpaces]].

The first "update 1" regex correctly matches the whole string, you say: "start at the first { and find quite everything". Let's see:

{.*?\[\[ match everything between { and [[
.*(?<!\\)\w+\b match everything after [[ and before the first word character \w not preceeded by a backslash (probably here you want a \b before the negative lookbehind and the \w)
.*\]\].*?}/ match everything between ]] and the first } you find (non greedy)

But if you want to match the single parts you need to create different matches or different groups

EDIT:

As only one regex itis possible to merge the two regexes crafedin this answer:

{[^{]?[[.(?<=[[).+?\b(?]].?}

Preg_match_all will return 2 tabs. 1 containing the data matched by the regex, the second the tag.

And then thanks to the strtr function, only tags matched with translations are replaced. ( 3 rounds in the workflow).

回答2:

In case some people get the same problem. A better and global solution. The RTF reprensentation of words depends of ... police. So a simple text search of [[FooBuzz]] in times new roman works. But in Arial, the word is exploded, and you need a clever regex.

Examples :

Police                Text                                RTF
 Times new roman        [[FooBuzz]]                       {\someRtfTags [[FooBuzz]]}
  Arial                 [[FooBuzz]]                         {\hich\af1\dbch\af12\loch\f1 [[Signature}{\rtlch\fcs1 \af0 \ltrch\fcs0 \i\insrsid15225063 \hich\af1\dbch\af12\loch\f1 President}{\rtlch\fcs1 \af0 \ltrch\fcs0 \i\insrsid1974114\charrsid1974114 \hich\af1\dbch\af12\loch\f1 ]]}

So use Times new Roman for tags.

来源：https://stackoverflow.com/questions/12856177/research-and-replace-word-rtf

标签

php

regex

word

rtf