Pandas sum of all word counts in column

前端 未结 4 1100
自闭症患者
自闭症患者 2021-01-23 21:03

I have a pandas column that contains strings. I want to get a word count of all of the words in the entire column. What\'s the best way of doing that without looping through eac

相关标签:
4条回答
  • 2021-01-23 21:19

    You could use the vectorized string operations:

    In [7]: df["a"].str.split().str.len().sum()
    Out[7]: 6
    

    which comes from

    In [8]: df["a"].str.split()
    Out[8]: 
    0          [some, words]
    1    [lots, more, words]
    2                   [hi]
    Name: a, dtype: object
    
    In [9]: df["a"].str.split().str.len()
    Out[9]: 
    0    2
    1    3
    2    1
    Name: a, dtype: int64
    
    In [10]: df["a"].str.split().str.len().sum()
    Out[10]: 6
    
    0 讨论(0)
  • 2021-01-23 21:28

    Another option using the cat string method. We will smash all strings together then split and count

    len(df["a"].str.cat(sep=' ').split())
    

    elaborate test data

    li = [
        'Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetur',
        'adipiscing', 'elit', 'Integer', 'et', 'tincidunt', 'nisl',
        'Sed', 'pretium', 'arcu', 'nec', 'est', 'hendrerit',
        'vestibulum', 'Curabitur', 'a', 'nibh', 'justo', 'Praesent',
        'non', 'pellentesque', 'enim', 'ac', 'nulla', 'ut', 'mi',
        'diam', 'Aenean', 'placerat', 'ante', 'euismod', 'pulvinar',
        'augue', 'purus', 'ornare', 'erat', 'pharetra', 'mauris',
        'sapien', 'vitae', 'In', 'id', 'velit', 'quis', 'mattis',
        'condimentum', 'Cras', 'congue', 'neque', 'faucibus', 'nisi',
        'tempor', 'eget', 'Etiam', 'semper', 'Nulla', 'elementum',
        'magna', 'Donec', 'vel', 'ex', 'dictum', 'Aliquam', 'lobortis',
        'rutrum', 'ligula', 'Vivamus', 'eu', 'eros', 'Morbi', 'blandit',
        'rhoncus', 'consequat', 'orci', 'convallis', 'finibus', 'lorem',
        'urna', 'molestie', 'in', 'sed', 'luctus', 'Ut', 'imperdiet',
        'felis', 'Mauris', 'nunc', 'malesuada', 'lacinia', 'Vestibulum',
        'bibendum', 'risus', 'tortor', 'sollicitudin', 'aliquam',
        'primis', 'ultrices', 'posuere', 'cubilia', 'Curae',
        'Phasellus', 'turpis', 'auctor', 'venenatis', 'Pellentesque',
        'fermentum', 'accumsan', 'maximus', 'Fusce', 'ultricies',
        'tristique', 'sodales', 'suscipit', 'sagittis', 'at', 'cursus',
        'Nullam', 'dui', 'fringilla', 'mollis', 'Orci', 'varius',
        'natoque', 'penatibus', 'magnis', 'dis', 'parturient', 'montes',
        'nascetur', 'ridiculus', 'mus', 'facilisi', 'sem', 'viverra',
        'feugiat', 'aliquet', 'lectus', 'porta', 'Nunc', 'facilisis',
        'Duis', 'volutpat', 'scelerisque', 'Maecenas', 'tempus',
        'massa', 'laoreet', 'gravida', 'odio', 'iaculis', 'libero',
        'eleifend', 'leo', 'Quisque', 'ullamcorper', 'dignissim',
        'interdum', 'vulputate', 'lacus', 'vehicula', 'Nam', 'commodo',
        'dapibus', 'efficitur', 'tellus', 'Suspendisse', 'metus',
        'Proin', 'quam', 'porttitor', 'egestas'
    ]
    
    df = pd.DataFrame(
        dict(a=[' '.join(
                np.random.choice(li, np.random.randint(5, 10, 1))
        ) for _ in range(10000)]))
    

    naive test results

    0 讨论(0)
  • 2021-01-23 21:29

    Numbers of words could be gotten by str count blanks+1, then sum()

    (df.a.str.count(' ')+1).sum()
    
    0 讨论(0)
  • 2021-01-23 21:34
    df.a.str.extractall('(\w+)').count()[0]
    

    This extracts all words (matches the regex (\w+)) in a each cell in a and puts them in a new frame that looks something like:

                 0
      match       
    0 0       some
      1      words
    1 0       lots
      1       more
      2      words
    2 0         hi 
    

    You can then just do a count on the rows to get the number of words.

    Note that you can always change the regex if you want. For example, if some words might contain punctuation characters you can define a words as any series of non-whitespace characters and do something like:

    df.a.str.extractall('(\S+)').count()[0]
    

    instead

    EDIT

    If you care about speed at all, use DSM's solution instead:

    Basic time test using ipython's %timeit:

    %timeit df.a.str.extractall('(\S+)').count()[0] 
    1000 loops, best of 3: 1.28 ms per loop
    
    %timeit df["a"].str.split().str.len().sum()
    1000 loops, best of 3: 447 µs per loop
    
    0 讨论(0)
提交回复
热议问题