问题
This is the code that I am using:
ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)
lem = WordNetLemmatizer()
stem = PorterStemmer()
fg=stem.stem(a)
eng_stopwords = stopwords.words('english')
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True,
index=True, na_rep='NaN', formatters=None, float_format=None,
sparsify=False, index_names=True, justify=None, line_width=None,
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)
wordList = [word for word in wordList if word not in eng_stopwords]
print (wordList)
while printing (a) I am getting the below output. I am not able to perform word tokenize on it properly.
tweet
0 1495596971.6034188automotive auto ebc greenstu...
1 1495596972.330948new free stock photo of city ...
2 1495596972.775966ebay 1974 volkswagen beetle -...
3 1495596975.6460807cars fly off a hidden speed ...
4 1495596978.12868rt @jiikae guys i think mario ...
These are the first 5 lines of the csv file:-
"1495596971.6034188::automotive auto ebc greenstuff 6000 series supreme
truck and suv brake pads dp61603 https:\/\/t.co\/jpylzjyd5o cars\u2026
https:\/\/t.co\/gfsbz6pkj7""display_text_range:[0140]source:""\u003ca
href=\""https:\/\/dlvrit.com\/\""
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596972.330948::new free stock photo of city cars road
https:\/\/t.co\/qbkgvkfgpp""display_text_range:[0"
"1495596972.775966::ebay: 1974 volkswagen beetle - classic 1952 custom
conversion extremely rare 1974 vw beetle\u2026\u2026
https:\/\/t.co\/wdsnf2pmo7""display_text_range:[0140]source:""\u003ca
href=\""https:\/\/dlvrit.com\/\""
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596975.6460807::cars fly off a hidden speed bump
https:\/\/t.co\/fliiqwt1rk https:\/\/t.co\/klx7kfooro""display_text_range:
[056]source:""\u003ca href=\""https:\/\/dlvrit.com\/\""
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
1495596978.12868::rt @jiikae: guys i think mario is going through a mid-life
crisis. buying expensive cars using guns hanging out with proport\u2026
回答1:
I think you need str.split for list of all words - it split by all whitespaces - also need ho['tweet']
for select column tweet
:
wordList = word_tokenize(fg)
#output is string
ho1=ho['tweet'].str.split()
.apply(lambda x:' '.join([word for word in wordList if word not in eng_stopwords]))
Or:
wordList = word_tokenize(fg)
#output is list
ho1=ho['tweet'].str.split()
.apply(lambda x:[word for word in wordList if word not in eng_stopwords])
instead:
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True,
index=True, na_rep='NaN', formatters=None, float_format=None,
sparsify=False, index_names=True, justify=None, line_width=None,
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)
wordList = [word for word in wordList if word not in eng_stopwords]
print (wordList)
来源:https://stackoverflow.com/questions/44157005/how-can-i-enlarge-the-below-output-in-python-because-want-to-use-it-as-an-input