Split Strings into words with multiple word boundary delimiters

前端 未结 30 2466
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.


  • 2020-11-21 05:56

    Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:

    def split_string(source, splitlist):
        output = []  # output list of cleaned words
        atsplit = True
        for char in source:
            if char in splitlist:
                atsplit = True
                if atsplit:
                    output.append(char)  # append new word after split
                    atsplit = False
                    output[-1] = output[-1] + char  # continue copying characters until next split
        return output
    0 讨论(0)
  • 2020-11-21 05:57

    Another way to achieve this is to use the Natural Language Tool Kit (nltk).

    import nltk
    data= "Hey, you - what are you doing here!?"
    word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
    print word_tokens

    This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

    The biggest drawback of this method is that you need to install the nltk package.

    The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.

    0 讨论(0)
  • 2020-11-21 05:58

    A case where regular expressions are justified:

    import re
    DATA = "Hey, you - what are you doing here!?"
    print re.findall(r"[\w']+", DATA)
    # Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    0 讨论(0)
  • 2020-11-21 05:59

    got same problem as @ooboo and find this topic @ghostdog74 inspired me, maybe someone finds my solution usefull

    ''.join([ s if s not in splitat else ' ' for s in str1]).split()

    input something in space place and split using same character if you dont want to split at spaces.

    0 讨论(0)
  • 2020-11-21 05:59

    I think the following is the best answer to suite your needs :

    \W+ maybe suitable for this case, but may not be suitable for other cases.

    filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
    0 讨论(0)
  • 2020-11-21 06:00

    Another quick way to do this without a regexp is to replace the characters first, as below:

    >>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
    ['a', 'bcd', 'ef', 'g']
    0 讨论(0)