regex to remove words from a list that are not A-Z a-z (exceptions)

血红的双手。 提交于 2021-02-10 20:43:27

问题


I would like to remove non-alpha characters from a string and convert each word into a list component such that:

"All, the above." -> ["all", "the", "above"]

It would seem that the following function works:

re.split('\W+', str)

but it does not account for corner cases.

For example:

"The U.S. is where it's nice." -> ["the", "U", "S", "is", "where", "it", "s", "nice"]

I want the period removed but neither the apostrophe or the periods in "U.S."

My idea is to create a regex where spaces are broken up but then remove extra punctuation:

"I, live at home." -> ["I", "live", "at", "home"] (comma and period removed)
"I J.C. live at home." -> ["I", "J.C.", "live", "at", "home"] (acronym periods not removed but end of sentence period removed)

What I'm trying to do becomes sufficiently difficult for sentences like:

"The flying saucer (which was green)." -> ["...", "green"] (ignore ").") 
"I J.C., live at home." -> ["I", "J.C.", "..."] (ignore punctuation)

Special case (strings are retrieved from raw text file):

"I love you.<br /> Come home soon!" -> ["..."] (ignore breakpoint and punctuation) 

I am relatively new to python and creating regex's is confusing to me so any help on how to parse strings in this way would be very helpful!! If there is a catch 22 here, and not all things I am trying to accomplish are possible let me know.


回答1:


Although I understand you are asking specifically about regex, another solution to your overall problem is to use a library for this express purpose. For instance nltk. It should help you split your strings in sane ways (parsing out the proper punctuation into separate items in a list) which you can then filter out from there.

You are right, the number of corner cases is huge precisely because human language is imprecise and vague. Using a library that already accounts for these edge cases should save you a lot of headache.

A helpful primer on dealing with raw text in nltk is here. It seems the most useful function for your use case is nltk.word_tokenize, which passes back a list of strings with words and punctuation separated.




回答2:


Here's a Python regex that should work for splitting the sentences you provided.

((?<![A-Z])\.)*[\W](?<!\.)|[\W]$

Try it here

Since all abbreviations with periods should have a capital letter before the period, we can use a negative lookbehind to exclude those periods:

((?<![A-Z])\.)*

Then splits on all other non-period non-alphanumerics:

[\W](?<!\.)

or symbols at the end of a line:

|[\W]$

I tested the regex on these strings:

The R.N. lives in the U.S.

The R.N., lives in the U.S. here.



来源:https://stackoverflow.com/questions/34006169/regex-to-remove-words-from-a-list-that-are-not-a-z-a-z-exceptions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!