mapping between words and a group tuple to get frequency of words

问题

I have a dataframe that looks like the following

Utterance                         Frequency

Directions to Starbucks           1045
Show me directions to Starbucks   754
Give me directions to Starbucks   612
Navigate me to Starbucks          498
Display navigation to Starbucks   376
Direct me to Starbucks            201
Navigate to Starbucks             180

Here, there is some data that show utterances made by people, and how frequently these were said.

I.e., "Directions to Starbucks" was uttered 1045 times, "Show me directions to Starbucks" was uttered 754 times, etc.

I was able to get the desired output with the following:

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )

print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

However, I'm also trying to look for the following output:

print (df)
                        Words        Frequency
0                  Directions        2411
1   Give_Show_Direct_Navigate        2245
2                     Display        376
3                   Starbucks        3666
4                          me        2065
5                  navigation        376
6                          to        3666

I.e., I'm trying to figure out a way to combine certain phrases and get the frequency of those words. For example, if the speaker says "Seattles_Best" or "Tullys", then ideally i would add it to "Starbucks" and rename it "coffee_shop" or something like that.

Thanks!!

回答1:

Here is one way, sticking with collections.Counter from your previous question.

You can add any number of tuples to lst to append additional results for combinations of your choice.

from collections import Counter
import pandas as pd

df = pd.DataFrame([['Directions to Starbucks', 1045],
                   ['Show me directions to Starbucks', 754],
                   ['Give me directions to Starbucks', 612],
                   ['Navigate me to Starbucks', 498],
                   ['Display navigation to Starbucks', 376],
                   ['Direct me to Starbucks', 201],
                   ['Navigate to Starbucks', 180]],
                  columns = ['Utterance', 'Frequency'])

c = Counter()

for row in df.itertuples():
    for i in row[1].split():
        c[i] += row[2]

res = pd.DataFrame.from_dict(c, orient='index')\
                  .rename(columns={0: 'Count'})\
                  .sort_values('Count', ascending=False)

def add_combinations(df, lst):
    for i in lst:
        words = '_'.join(i)
        df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
    return df.sort_values('Count', ascending=False)

lst = [('Give', 'Show', 'Navigate', 'Direct')]

res = add_combinations(res, lst)

Result

                           Count
to                          3666
Starbucks                   3666
Give_Show_Navigate_Direct   2245
me                          2065
directions                  1366
Directions                  1045
Show                         754
Navigate                     678
Give                         612
Display                      376
navigation                   376
Direct                       201

回答2:

Here's a solution which begins with your current results set and edits appropriately:

print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

First, create a dictionary which maps current words to your chosen new word:

phrase_map = {'Starbucks': 'coffee_shop',
              'Seattles_Best': 'coffee_shop',
              'Tullys': 'coffee_shop',
              'Give': 'Give_Show_Direct_Navigate',
              'Show': 'Give_Show_Direct_Navigate',
              'Direct': 'Give_Show_Direct_Navigate',
              'Navigate': 'Give_Show_Direct_Navigate'}

Then lookup each word, replacing with the new value if found, else keep the original value:

df['Words'] = df['Words'].apply(lambda x: phrase_map.get(x) if phrase_map.get(x) else x)

Then group:

df.groupby('Words').sum()

Results:

                           Frequency
Words                               
Directions                      1045
Display                          376
Give_Show_Direct_Navigate       2245
coffee_shop                     3666
directions                      1366
me                              2065
navigation                       376
to                              3666

回答3:

IIUC:

(df.set_index('Frequency')['Utterance'].str.lower()
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )

OUtput:

        Words  Frequency
0      direct        201
1  directions       2411
2     display        376
3        give        612
4          me       2065
5    navigate        678
6  navigation        376
7        show        754
8   starbucks       3666
9          to       3666

回答4:

My solution iterates over each word, so if you're considering to look for more words you should switch to some of the NLP libraries like spacy or NLTK, these should have features to count word occurances.

But here is my solution:

lst = ['Directions','Give','Show','Direct','Navigate','Display','Starbucks','me','navigation','to']
for word in lst:
    A[word +'_score'] = A['Phrase'].str.contains(word).astype(int)*A['Frequency'].astype(int)

A.iloc[:,2:].sum()

This results in

Directions_score    1045
Give_score           612
Show_score           754
Direct_score        1246
Navigate_score       678
Display_score        376
Starbucks_score     3666
me_score            2065
navigation_score     376
to_score            3666
dtype: int64

And you just need to sum over the columns to get the number of occurances

来源：https://stackoverflow.com/questions/49496102/mapping-between-words-and-a-group-tuple-to-get-frequency-of-words

标签

python

pandas

dataframe

statistics

frequency