All the other answers I could find specifically referred to aggregating across all of the nested lists within a list of lists, where as I\'m looking to aggregate separately for
IMO, this question can show the real pandas's power. Let's do the following - instead of counting boring [a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]
we will count the frequency of words in real books. I've chosen the following three: 'Faust', 'Hamlet', 'Macbeth'.
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict
import string
import requests
import pandas as pd
books = {
'Faust': 'http://www.gutenberg.org/cache/epub/2229/pg2229.txt',
'Hamlet': 'http://www.gutenberg.org/cache/epub/2265/pg2265.txt',
'Macbeth': 'http://www.gutenberg.org/cache/epub/2264/pg2264.txt',
}
# prepare translate table, which will remove all punctuations and digits
chars2remove = list(string.punctuation + string.digits)
transl_tab = str.maketrans(dict(zip(chars2remove, list(' ' * len(chars2remove)))))
# replace 'carriage return' and 'new line' characters with spaces
transl_tab[10] = ' '
transl_tab[13] = ' '
def tokenize(s):
return s.translate(transl_tab).lower().split()
def get_data(url):
r = requests.get(url)
if r.status_code == requests.codes.ok:
return r.text
else:
r.raise_for_status()
# generate DF containing words from books
d = defaultdict(list)
for name, url in books.items():
d[name] = tokenize(get_data(url))
df = pd.concat([pd.DataFrame({'book': name, 'word': tokenize(get_data(url))})
for name, url in books.items()], ignore_index=True)
# let's count the frequency
frequency = df.groupby(['book','word']) \
.size() \
.sort_values(ascending=False)
# output
print(frequency.head(30))
print('[Macbeth]: macbeth\t', frequency.loc['Macbeth', 'macbeth'])
print('[Hamlet]: nay\t', frequency.loc['Hamlet', 'nay'])
print('[Faust]: faust\t', frequency.loc['Faust', 'faust'])
Output:
book word
Hamlet the 1105
and 919
Faust und 918
Hamlet to 760
Macbeth the 759
Hamlet of 698
Faust ich 691
die 668
der 610
Macbeth and 602
Hamlet you 588
i 560
a 542
my 506
Macbeth to 460
Hamlet it 439
Macbeth of 426
Faust nicht 426
Hamlet in 409
Faust das 403
ein 399
zu 380
Hamlet that 379
Faust in 365
ist 363
Hamlet is 346
Macbeth i 344
Hamlet ham 337
this 328
not 316
dtype: int64
[Macbeth]: macbeth 67
[Hamlet]: nay 27
[Faust]: faust 272
You can create a list and append the counters to it. (Also, you are using Counter
, but still doing the counts yourself, which is unnecessary.)
master_list = [[a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]]
counters = []
for list_ in master_list:
counters.append(Counter(list_))
Now you can address each separate list with counters[i]
.