How can I enlarge the below output in python because want to use it as an input somewhere else?

问题

This is the code that I am using:

ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)

lem = WordNetLemmatizer()
stem = PorterStemmer()
fg=stem.stem(a)

eng_stopwords = stopwords.words('english') 
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
index=True, na_rep='NaN', formatters=None, float_format=None, 
sparsify=False, index_names=True, justify=None, line_width=None, 
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)                                     

wordList = [word for word in wordList if word not in eng_stopwords]   
print (wordList)

while printing (a) I am getting the below output. I am not able to perform word tokenize on it properly.

                     tweet
0     1495596971.6034188automotive auto ebc greenstu...
1     1495596972.330948new free stock photo of city ...
2     1495596972.775966ebay 1974 volkswagen beetle -...
3     1495596975.6460807cars fly off a hidden speed ...
4     1495596978.12868rt @jiikae guys i think mario ...

These are the first 5 lines of the csv file:-

"1495596971.6034188::automotive auto ebc greenstuff 6000 series supreme 
truck and suv brake pads dp61603 https:\/\/t.co\/jpylzjyd5o cars\u2026 
https:\/\/t.co\/gfsbz6pkj7""display_text_range:[0140]source:""\u003ca 
href=\""https:\/\/dlvrit.com\/\"" 
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596972.330948::new free stock photo of city cars road 
https:\/\/t.co\/qbkgvkfgpp""display_text_range:[0"
"1495596972.775966::ebay: 1974 volkswagen beetle - classic 1952 custom 
conversion extremely rare 1974 vw beetle\u2026\u2026 
https:\/\/t.co\/wdsnf2pmo7""display_text_range:[0140]source:""\u003ca 
href=\""https:\/\/dlvrit.com\/\"" 
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596975.6460807::cars fly off a hidden speed bump 
https:\/\/t.co\/fliiqwt1rk https:\/\/t.co\/klx7kfooro""display_text_range:
[056]source:""\u003ca href=\""https:\/\/dlvrit.com\/\"" 
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
1495596978.12868::rt @jiikae: guys i think mario is going through a mid-life 
crisis. buying expensive cars using guns hanging out with proport\u2026

回答1:

I think you need str.split for list of all words - it split by all whitespaces - also need ho['tweet'] for select column tweet:

wordList = word_tokenize(fg) 
#output is string
ho1=ho['tweet'].str.split()
     .apply(lambda x:' '.join([word for word in wordList if word not in eng_stopwords]))

Or:

wordList = word_tokenize(fg) 
#output is list
ho1=ho['tweet'].str.split()
               .apply(lambda x:[word for word in wordList if word not in eng_stopwords])

instead:

ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
index=True, na_rep='NaN', formatters=None, float_format=None, 
sparsify=False, index_names=True, justify=None, line_width=None, 
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg) 
wordList = [word for word in wordList if word not in eng_stopwords]   
print (wordList)

来源：https://stackoverflow.com/questions/44157005/how-can-i-enlarge-the-below-output-in-python-because-want-to-use-it-as-an-input

标签

python-3.x

api

twitter