Find Lowercase immediately followed by uppercase

你离开我真会死。 提交于 2019-12-13 04:13:36

问题


My text is as below:

<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font> 
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500>     »» Word wordWord word.<br></font>

There is a lowercase letter immediately followed by an uppercase in each of the <font color =#0B610B>...</font>. For example:

<font color =#0B610B> Word word wordWord word.<br></font>

I want to correct this error by splitting them as follows (i.e: adding a colon and a space between them):

<font color =#0B610B> Word word word: Word word.<br></font>

So far, I have been using:

(<font color =#0B610B\b[^>]*>)(.*?</font>)

to select each of the instances of <font color =#0B610B>...</font>, and it works fine in finding one instance by one instance of <font color =#0B610B>...</font>.

But when I use:

(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)

it does find but selects everything between <font color =#0B610B>...</font>in one line regardless of other font-color tags, and replaces other unwanted instances.

I want it to find and replace error in each of this specific pair of tags: <font color =#0B610B>...</font>, not grabbing everything starting by <font color =#0B610B> and ending in </font>

Are there any regular expressions to solve this problem? Many thanks in advance.


回答1:


In general, regex is not a good idea for parsing HTML (if it's a once-off you might be OK).

I think this might be the reason your regex is not working. Can you give an example of a case in which your regex fails?

One case I can think of if is there is no match ([a-z][A-Z]) within a matching <font color=#0B610B></font> pair, but there is in a neighbouring <font></font>. For example:

<font color=#0B610B>word word</font><font color=#000000>word wordWord</font>

In this case, the only valid match is <font color=#0B610B>word word</font><font color=#000000>word word and the rest of the string Word</font>, and so this is what the regex matches (since if it can match it will!)

I can think of a crude workaround but I wouldn't recommend it unless this task is a once-off because using regex for HTML is always prone to such errors!. This regex is also pretty inefficient. Try (untested):

(<font color =#0B610B\b[^>]*>)(([^<]|<(?!/font))*?[a-z])([A-Z].*?</font>)

It says, "look for the <font colour=xxxx> tag, followed by either an angle bracket < not followed by /font, OR anything else, and again followed by the [a-z][A-Z]". So it tries to make sure that the match doesn't go over a </font> boundary.



来源:https://stackoverflow.com/questions/8775419/find-lowercase-immediately-followed-by-uppercase

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!