Tokenise text and create more rows for each row in dataframe

吃可爱长大的小学妹 提交于 2019-12-25 17:37:34

问题


I want to do this with python and pandas.

Let's suppose that I have the following:

file_id   text
1         I am the first document. I am a nice document.
2         I am the second document. I am an even nicer document.

and I finally want to have the following:

file_id   text
1         I am the first document
1         I am a nice document
2         I am the second document
2         I am an even nicer document

So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts.

What is the most efficient way to do this?


回答1:


Use:

s = (df.pop('text')
      .str.strip('.')
      .str.split('\.\s+', expand=True)
      .stack()
      .rename('text')
      .reset_index(level=1, drop=True))

df = df.join(s).reset_index(drop=True)
print (df)
   file_id                         text
0        1      I am the first document
1        1         I am a nice document
2        2     I am the second document
3        2  I am an even nicer document

Explanation:

First use DataFrame.pop for extract column, remove last . by Series.str.rstrip and split by with Series.str.split with escape . because special regex character, reshape by DataFrame.stack for Series, DataFrame.reset_index and rename for Series for DataFrame.join to original.




回答2:


df = pd.DataFrame( { 'field_id': [1,2], 
                    'text': ["I am the first document. I am a nice document.",
                             "I am the second document. I am an even nicer document."]})

df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x: 
                                      pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']


来源:https://stackoverflow.com/questions/56290155/tokenise-text-and-create-more-rows-for-each-row-in-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!