Extracting whole words

前端 未结 4 2100
借酒劲吻你
借酒劲吻你 2020-12-03 15:38

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I\'d like to extract as many meaningful words as possible with

相关标签:
4条回答
  • 2020-12-03 16:05

    Sample code

    print re.search(ur'(?u)ривет\b', ur'Привет')
    print re.search(ur'(?u)\bривет\b', ur'Привет')
    

    or

    s = ur"abcd ААБВ"
    import re
    rx1 = re.compile(ur"(?u)АБВ")
    rx2 = re.compile(ur"(?u)АБВ\b")
    rx3 = re.compile(ur"(?u)\bАБВ\b")
    print rx1.findall(s)
    print rx2.findall(s)
    print rx3.findall(s)
    
    0 讨论(0)
  • 2020-12-03 16:11

    If you restrict yourself to ASCII letters, then use (with the re.I option set)

    \b[a-z]+\b
    

    \b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.

    To also allow other non-ASCII letters, you can use something like this:

    \b[^\W\d_]+\b
    

    which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.

    [^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.

    0 讨论(0)
  • 2020-12-03 16:25

    What about:

    import re
    yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA  pie42"
    filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])
    

    Note that:

    • split explodes your string into potential candidates => returns a list of "potential words"
    • set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
    • filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
    • lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)

    EDIT : added some explanations

    0 讨论(0)
  • 2020-12-03 16:27

    Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:

    \b([a-zA-Z]+)\b
    

    For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.

    You can the \b sequence, and others, over at the python manual

    EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:

    (?!\d)   # negative look-ahead for numbers
    (?<!\d)  # negative look-behind for numbers
    
    0 讨论(0)
提交回复
热议问题