Creating a separate Counter() object and Pandas DataFrame for each list within a list of lists

后端 未结 2 1772
再見小時候
再見小時候 2021-01-28 08:24

All the other answers I could find specifically referred to aggregating across all of the nested lists within a list of lists, where as I\'m looking to aggregate separately for

相关标签:
2条回答
  • 2021-01-28 09:12

    IMO, this question can show the real pandas's power. Let's do the following - instead of counting boring [a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f] we will count the frequency of words in real books. I've chosen the following three: 'Faust', 'Hamlet', 'Macbeth'.

    Code:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    from collections import defaultdict
    import string
    import requests
    import pandas as pd
    
    books = {
      'Faust': 'http://www.gutenberg.org/cache/epub/2229/pg2229.txt',
      'Hamlet': 'http://www.gutenberg.org/cache/epub/2265/pg2265.txt',
      'Macbeth': 'http://www.gutenberg.org/cache/epub/2264/pg2264.txt',
    }
    
    # prepare translate table, which will remove all punctuations and digits
    chars2remove = list(string.punctuation + string.digits)
    transl_tab = str.maketrans(dict(zip(chars2remove, list(' ' * len(chars2remove)))))
    # replace 'carriage return' and 'new line' characters with spaces
    transl_tab[10] = ' '
    transl_tab[13] = ' '
    
    
    def tokenize(s):
        return s.translate(transl_tab).lower().split()
    
    def get_data(url):
        r = requests.get(url)
        if r.status_code == requests.codes.ok:
            return r.text
        else:
            r.raise_for_status()
    
    # generate DF containing words from books
    d = defaultdict(list)
    for name, url in books.items():
        d[name] = tokenize(get_data(url))
    
    df = pd.concat([pd.DataFrame({'book': name, 'word': tokenize(get_data(url))})
                    for name, url in books.items()], ignore_index=True)
    
    # let's count the frequency
    frequency = df.groupby(['book','word']) \
                  .size() \
                  .sort_values(ascending=False)
    
    # output
    print(frequency.head(30))
    print('[Macbeth]: macbeth\t', frequency.loc['Macbeth', 'macbeth'])
    print('[Hamlet]: nay\t', frequency.loc['Hamlet', 'nay'])
    print('[Faust]: faust\t', frequency.loc['Faust', 'faust'])
    

    Output:

    book     word
    Hamlet   the      1105
             and       919
    Faust    und       918
    Hamlet   to        760
    Macbeth  the       759
    Hamlet   of        698
    Faust    ich       691
             die       668
             der       610
    Macbeth  and       602
    Hamlet   you       588
             i         560
             a         542
             my        506
    Macbeth  to        460
    Hamlet   it        439
    Macbeth  of        426
    Faust    nicht     426
    Hamlet   in        409
    Faust    das       403
             ein       399
             zu        380
    Hamlet   that      379
    Faust    in        365
             ist       363
    Hamlet   is        346
    Macbeth  i         344
    Hamlet   ham       337
             this      328
             not       316
    dtype: int64
    
    [Macbeth]: macbeth      67
    [Hamlet]: nay    27
    [Faust]: faust   272
    
    0 讨论(0)
  • 2021-01-28 09:23

    You can create a list and append the counters to it. (Also, you are using Counter, but still doing the counts yourself, which is unnecessary.)

    master_list = [[a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]]
    counters = []
    for list_ in master_list:
        counters.append(Counter(list_))
    

    Now you can address each separate list with counters[i].

    0 讨论(0)
提交回复
热议问题