问题
I have a text file which was converted to dataframe using below command:
df = pd.read_csv("C:\\Users\\Sriram\\Desktop\\New folder (4)\\aclImdb\\test\\result.txt", sep = '\t',
names=['reviews','polarity'])
Here the reviews column consists of all the movie reviews and polarity column consists of whether the review is positive or negative.
I have below feature function, to which my reviews column (nearly 1000 reviews) from dataframe needs to be passed.
def find_features(document):
words = word_tokenize(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
I am creating a training dataset using below function.
trainsets = [find_features(df.reviews), df.polarity]
Hence by doing this, all the words in my reviews column will be split as a result of tokenize function in find_feature and will be assigned a polarity (positive or negative).
For example:
reviews polarity
This is a poor excuse for a movie negative
For above case, after calling the find_features function, if the method inside the function is satisfied, I will be getting output as:
poor - negative
excuse - negative
and so on....
While I am trying to call this function, I am getting the below error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-79-76f9090c0532> in <module>()
30 return features
31
---> 32 featuresets = [find_features(df.reviews), df.polarity]
33 #featuresets = [(find_features(rev), category) for ((rev, category)) in
reviews]
34 '''
<ipython-input-79-76f9090c0532> in find_features(document)
24
25 def find_features(document):
---> 26 words = word_tokenize(document)
27 features = {}
28 for w in word_features:
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in
word_tokenize(text, language)
102 :param language: the model name in the Punkt corpus
103 """
--> 104 return [token for sent in sent_tokenize(text, language)
105 for token in _treebank_word_tokenize(sent)]
106
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in
sent_tokenize(text, language)
87 """
88 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 89 return tokenizer.tokenize(text)
90
91 # Standard word tokenizer.
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
tokenize(self, text, realign_boundaries)
1224 Given a text, returns a list of the sentences in that text.
1225 """
-> 1226 return list(self.sentences_from_text(text,
realign_boundaries))
1227
1228 def debug_decisions(self, text):
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
sentences_from_text(self, text, realign_boundaries)
1272 follows the period.
1273 """
-> 1274 return [text[s:e] for s, e in self.span_tokenize(text,
realign_boundaries)]
1275
1276 def _slices_from_text(self, text):
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
span_tokenize(self, text, realign_boundaries)
1263 if realign_boundaries:
1264 slices = self._realign_boundaries(text, slices)
-> 1265 return [(sl.start, sl.stop) for sl in slices]
1266
1267 def sentences_from_text(self, text, realign_boundaries=True):
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
<listcomp>(.0)
1263 if realign_boundaries:
1264 slices = self._realign_boundaries(text, slices)
-> 1265 return [(sl.start, sl.stop) for sl in slices]
1266
1267 def sentences_from_text(self, text, realign_boundaries=True):
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
_realign_boundaries(self, text, slices)
1302 """
1303 realign = 0
-> 1304 for sl1, sl2 in _pair_iter(slices):
1305 sl1 = slice(sl1.start + realign, sl1.stop)
1306 if not sl2:
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
_pair_iter(it)
308 """
309 it = iter(it)
--> 310 prev = next(it)
311 for el in it:
312 yield (prev, el)
C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in
_slices_from_text(self, text)
1276 def _slices_from_text(self, text):
1277 last_break = 0
-> 1278 for match in
self._lang_vars.period_context_re().finditer(text):
1279 context = match.group() + match.group('after_tok')
1280 if self.text_contains_sentbreak(context):
TypeError: expected string or bytes-like object
How to call a function directly from a dataframe which has multiple rows of values (In my case reviews)?
回答1:
going by your expected output mentioned:
poor - negative
excuse - negative
I will suggest:
trainsets = df.apply(lambda row: ([(kw, row.polarity) for kw in find_features(row.reviews)]), axis=1)
adding a sample snippet for ref:
import pandas as pd
from StringIO import StringIO
print 'pandas-version: ', pd.__version__
data_str = """
col1,col2
'leoperd lion tiger','non-veg'
'buffalo antelope elephant','veg'
'dog cat crow','all'
"""
data_str = StringIO(data_str)
# a dataframe with 2 columns
df = pd.read_csv(data_str)
# a dummy function taking a col1 value from each row
# and splits it into multiple values & returns a list
def my_fn(row_val):
return row_val.split(' ')
# calling row-wise apply vetor operation on dataframe
train_set = df.apply(lambda row: ([(kw, row.col2) for kw in my_fn(row.col1)]), axis=1)
print train_set
output:
pandas-version: 0.15.2
0 [('leoperd, 'non-veg'), (lion, 'non-veg'), (ti...
1 [('buffalo, 'veg'), (antelope, 'veg'), (elepha...
2 [('dog, 'all'), (cat, 'all'), (crow', 'all')]
dtype: object
@SriramChandramouli, hope I understood your requirement correctly.
来源:https://stackoverflow.com/questions/36672475/python-getting-typeerror-expected-string-or-bytes-like-object-while-calling-a