Split Strings into words with multiple word boundary delimiters

前端 未结 30 2608
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
相关标签:
30条回答
  • 2020-11-21 05:46

    Here is my go at a split with multiple deliminaters:

    def msplit( str, delims ):
      w = ''
      for z in str:
        if z not in delims:
            w += z
        else:
            if len(w) > 0 :
                yield w
            w = ''
      if len(w) > 0 :
        yield w
    
    0 讨论(0)
  • 2020-11-21 05:47

    Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.

    First, create a series with the above string and then apply the method to the series.

    thestring = pd.Series("Hey, you - what are you doing here!?") thestring.str.split(pat = ',|-')

    parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator). The output is as follows:

    [Hey, you , what are you doing here!?]

    0 讨论(0)
  • 2020-11-21 05:47

    using maketrans and translate you can do it easily and neatly

    import string
    specials = ',.!?:;"()<>[]#$=-/'
    trans = string.maketrans(specials, ' '*len(specials))
    body = body.translate(trans)
    words = body.strip().split()
    
    0 讨论(0)
  • 2020-11-21 05:47

    First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.

    I come across this pretty frequently, and my usual solution doesn't require re.

    One-liner lambda function w/ list comprehension:

    (requires import string):

    split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
        text.split() if word.strip(string.punctuation) != '']
    
    # Call function
    split_without_punc("Hey, you -- what are you doing?!")
    # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
    


    Function (traditional)

    As a traditional function, this is still only two lines with a list comprehension (in addition to import string):

    def split_without_punctuation2(text):
    
        # Split by whitespace
        words = text.split()
    
        # Strip punctuation from each word
        return [word.strip(ignore) for word in words if word.strip(ignore) != '']
    
    split_without_punctuation2("Hey, you -- what are you doing?!")
    # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
    

    It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.

    General Function w/o Lambda or List Comprehension

    For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:

    def split_without(text: str, ignore: str) -> list:
    
        # Split by whitespace
        split_string = text.split()
    
        # Strip any characters in the ignore string, and ignore empty strings
        words = []
        for word in split_string:
            word = word.strip(ignore)
            if word != '':
                words.append(word)
    
        return words
    
    # Situation-specific call to general function
    import string
    final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
    # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
    

    Of course, you can always generalize the lambda function to any specified string of characters as well.

    0 讨论(0)
  • 2020-11-21 05:48

    Heres my take on it....

    def split_string(source,splitlist):
        splits = frozenset(splitlist)
        l = []
        s1 = ""
        for c in source:
            if c in splits:
                if s1:
                    l.append(s1)
                    s1 = ""
            else:
                print s1
                s1 = s1 + c
        if s1:
            l.append(s1)
        return l
    
    >>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
    >>>print out
    >>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
    
    0 讨论(0)
  • 2020-11-21 05:51

    First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.

    so for your problem first compile the pattern and then perform action on it.

    import re
    DATA = "Hey, you - what are you doing here!?"
    reg_tok = re.compile("[\w']+")
    print reg_tok.findall(DATA)
    
    0 讨论(0)
提交回复
热议问题