问题
I have a dataframe that looks like the following
Utterance Frequency
Directions to Starbucks 1045
Show me directions to Starbucks 754
Give me directions to Starbucks 612
Navigate me to Starbucks 498
Display navigation to Starbucks 376
Direct me to Starbucks 201
Navigate to Starbucks 180
Here, there is some data that show utterances made by people, and how frequently these were said.
I.e., "Directions to Starbucks" was uttered 1045 times, "Show me directions to Starbucks" was uttered 754 times, etc.
I was able to get the desired output with the following:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
However, I'm also trying to look for the following output:
print (df)
Words Frequency
0 Directions 2411
1 Give_Show_Direct_Navigate 2245
2 Display 376
3 Starbucks 3666
4 me 2065
5 navigation 376
6 to 3666
I.e., I'm trying to figure out a way to combine certain phrases and get the frequency of those words. For example, if the speaker says "Seattles_Best" or "Tullys", then ideally i would add it to "Starbucks" and rename it "coffee_shop" or something like that.
Thanks!!
回答1:
Here is one way, sticking with collections.Counter
from your previous question.
You can add any number of tuples to lst
to append additional results for combinations of your choice.
from collections import Counter
import pandas as pd
df = pd.DataFrame([['Directions to Starbucks', 1045],
['Show me directions to Starbucks', 754],
['Give me directions to Starbucks', 612],
['Navigate me to Starbucks', 498],
['Display navigation to Starbucks', 376],
['Direct me to Starbucks', 201],
['Navigate to Starbucks', 180]],
columns = ['Utterance', 'Frequency'])
c = Counter()
for row in df.itertuples():
for i in row[1].split():
c[i] += row[2]
res = pd.DataFrame.from_dict(c, orient='index')\
.rename(columns={0: 'Count'})\
.sort_values('Count', ascending=False)
def add_combinations(df, lst):
for i in lst:
words = '_'.join(i)
df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
return df.sort_values('Count', ascending=False)
lst = [('Give', 'Show', 'Navigate', 'Direct')]
res = add_combinations(res, lst)
Result
Count
to 3666
Starbucks 3666
Give_Show_Navigate_Direct 2245
me 2065
directions 1366
Directions 1045
Show 754
Navigate 678
Give 612
Display 376
navigation 376
Direct 201
回答2:
Here's a solution which begins with your current results set and edits appropriately:
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
First, create a dictionary which maps current words to your chosen new word:
phrase_map = {'Starbucks': 'coffee_shop',
'Seattles_Best': 'coffee_shop',
'Tullys': 'coffee_shop',
'Give': 'Give_Show_Direct_Navigate',
'Show': 'Give_Show_Direct_Navigate',
'Direct': 'Give_Show_Direct_Navigate',
'Navigate': 'Give_Show_Direct_Navigate'}
Then lookup each word, replacing with the new value if found, else keep the original value:
df['Words'] = df['Words'].apply(lambda x: phrase_map.get(x) if phrase_map.get(x) else x)
Then group:
df.groupby('Words').sum()
Results:
Frequency
Words
Directions 1045
Display 376
Give_Show_Direct_Navigate 2245
coffee_shop 3666
directions 1366
me 2065
navigation 376
to 3666
回答3:
IIUC:
(df.set_index('Frequency')['Utterance'].str.lower()
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
OUtput:
Words Frequency
0 direct 201
1 directions 2411
2 display 376
3 give 612
4 me 2065
5 navigate 678
6 navigation 376
7 show 754
8 starbucks 3666
9 to 3666
回答4:
My solution iterates over each word, so if you're considering to look for more words you should switch to some of the NLP libraries like spacy or NLTK, these should have features to count word occurances.
But here is my solution:
lst = ['Directions','Give','Show','Direct','Navigate','Display','Starbucks','me','navigation','to']
for word in lst:
A[word +'_score'] = A['Phrase'].str.contains(word).astype(int)*A['Frequency'].astype(int)
A.iloc[:,2:].sum()
This results in
Directions_score 1045
Give_score 612
Show_score 754
Direct_score 1246
Navigate_score 678
Display_score 376
Starbucks_score 3666
me_score 2065
navigation_score 376
to_score 3666
dtype: int64
And you just need to sum over the columns to get the number of occurances
来源:https://stackoverflow.com/questions/49496102/mapping-between-words-and-a-group-tuple-to-get-frequency-of-words