I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.
\"H
Here is my go at a split with multiple deliminaters:
def msplit( str, delims ):
w = ''
for z in str:
if z not in delims:
w += z
else:
if len(w) > 0 :
yield w
w = ''
if len(w) > 0 :
yield w
Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.
First, create a series with the above string and then apply the method to the series.
thestring = pd.Series("Hey, you - what are you doing here!?")
thestring.str.split(pat = ',|-')
parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator). The output is as follows:
[Hey, you , what are you doing here!?]
using maketrans and translate you can do it easily and neatly
import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()
First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.
I come across this pretty frequently, and my usual solution doesn't require re.
(requires import string
):
split_without_punc = lambda text : [word.strip(string.punctuation) for word in
text.split() if word.strip(string.punctuation) != '']
# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
As a traditional function, this is still only two lines with a list comprehension (in addition to import string
):
def split_without_punctuation2(text):
# Split by whitespace
words = text.split()
# Strip punctuation from each word
return [word.strip(ignore) for word in words if word.strip(ignore) != '']
split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ")
to turn hyphens into spaces before the split.
For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:
def split_without(text: str, ignore: str) -> list:
# Split by whitespace
split_string = text.split()
# Strip any characters in the ignore string, and ignore empty strings
words = []
for word in split_string:
word = word.strip(ignore)
if word != '':
words.append(word)
return words
# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Of course, you can always generalize the lambda function to any specified string of characters as well.
Heres my take on it....
def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 = ""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 = ""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l
>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.
so for your problem first compile the pattern and then perform action on it.
import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)