I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.
\"H
Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:
def split_string(source, splitlist):
output = [] # output list of cleaned words
atsplit = True
for char in source:
if char in splitlist:
atsplit = True
else:
if atsplit:
output.append(char) # append new word after split
atsplit = False
else:
output[-1] = output[-1] + char # continue copying characters until next split
return output
Another way to achieve this is to use the Natural Language Tool Kit (nltk).
import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens
This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
The biggest drawback of this method is that you need to install the nltk package.
The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.
A case where regular expressions are justified:
import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
got same problem as @ooboo and find this topic @ghostdog74 inspired me, maybe someone finds my solution usefull
str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()
input something in space place and split using same character if you dont want to split at spaces.
I think the following is the best answer to suite your needs :
\W+
maybe suitable for this case, but may not be suitable for other cases.
filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
Another quick way to do this without a regexp is to replace the characters first, as below:
>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']