python pandas dataframe words in context: get 3 words before and after

我的梦境 提交于 2020-01-06 15:43:27

问题


I am working in jupyter notebook and have a pandas dataframe "data":

Question_ID | Customer_ID | Answer
      1           234         Data is very important to use because ... 
      2           234         We value data since we need it ... 

I want to go through the text in column "Answer" and get the three words before and after the word "data". So in this scenario I would have gotten "is very important"; "We value", "since we need".

Is there an good way to do this within a pandas dataframe? So far I only found solutions where "Answer" would be its own file run through python code (without a pandas dataframe). While I realize that I need to use the NLTK library, I haven't used it before, so I don't know what the best approach would be. (This was a great example Extracting a word and its prior 10 word context to a dataframe in Python)


回答1:


This may work:

import pandas as pd
import re

df = pd.read_csv('data.csv')

for value in df.Answer.values:
    non_data = re.split('Data|data', value) # split text removing "data"
    terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
    substrs = [term.split()[0:3] for term in terms_list]  # slice and grab first three terms
    result = [' '.join(term) for term in substrs] # combine the terms back into substrings
    print result

output:

['is very important']
['We value', 'since we need']



回答2:


The solution using generator expression, re.findall and itertools.chain.from_iterable functions:

import pandas as pd, re, itertools

data = pd.read_csv('test.csv')  # change with your current file path

data_adjacents = ((i for sublist in (list(filter(None,t))
                         for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
                            for l in data.Answer.tolist())

print(list(itertools.chain.from_iterable(data_adjacents)))

The output:

[' is very important', 'We value ', ' since we need']


来源:https://stackoverflow.com/questions/41127321/python-pandas-dataframe-words-in-context-get-3-words-before-and-after

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!