Counting particular occurrences in python in csv file

后端 未结 2 1621
南方客
南方客 2021-01-13 21:12

I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python I would like to do the following: For every cluster_id (from 1 to 500), I want to see for eac

相关标签:
2条回答
  • 2021-01-13 21:22

    Since someone's already posted a defaultdict solution, I'm going to give a pandas one, just for variety. pandas is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:

    df = pd.read_csv("cluster.csv")
    counted = df.groupby(["Cluster_id", "User", "Quality"]).size()
    df.to_csv("counted.csv")
    

    --

    Just to give a trailer for what pandas makes easy, we can load the file -- the main data storage object in pandas is called a "DataFrame":

    >>> import pandas as pd
    >>> df = pd.read_csv("cluster.csv")
    >>> df
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 500000 entries, 0 to 499999
    Data columns:
    Tag           500000  non-null values
    User          500000  non-null values
    Quality       500000  non-null values
    Cluster_id    500000  non-null values
    dtypes: int64(1), object(3)
    

    We can check that the first few rows look okay:

    >>> df[:5]
       Tag  User Quality  Cluster_id
    0  bbb  u001     bad          39
    1  bbb  u002     bad          36
    2  bag  u003    good          11
    3  bag  u004    good           9
    4  bag  u005     bad          26
    

    and then we can group by Cluster_id and User, and do work on each group:

    >>> for name, group in df.groupby(["Cluster_id", "User"]):
    ...     print 'group name:', name
    ...     print 'group rows:'
    ...     print group
    ...     print 'counts of Quality values:'
    ...     print group["Quality"].value_counts()
    ...     raw_input()
    ...     
    group name: (1, 'u003')
    group rows:
            Tag  User Quality  Cluster_id
    372002  xxx  u003     bad           1
    counts of Quality values:
    bad    1
    
    group name: (1, 'u004')
    group rows:
               Tag  User Quality  Cluster_id
    126003  ground  u004     bad           1
    348003  ground  u004    good           1
    counts of Quality values:
    good    1
    bad     1
    
    group name: (1, 'u005')
    group rows:
               Tag  User Quality  Cluster_id
    42004   ground  u005     bad           1
    258004  ground  u005     bad           1
    390004  ground  u005     bad           1
    counts of Quality values:
    bad    3
    [etc.]
    

    If you're going to be doing a lot of processing of csv files, it's definitely worth having a look at.

    0 讨论(0)
  • 2021-01-13 21:45

    collections.defaultdict should be a great help here:

    # WARNING: Untested
    from collections import defaultdict
    
    auto_vivificator = lambda: defaultdict(auto_vivificator)
    
    data = auto_vivificator()
    
    # open your csv file
    
    for tag, user, quality, cluster in csv_file:
        user = data[cluster].setdefault(user, defaultdict(int))
        if is_good(quality):
            user["good"] += 1
        else:
            user["bad"] += 1
    
    for cluster, users in enumerate(data):
        print "Cluster:", cluster
        for user, quality_metrics in enumerate(users):
           print "User:", user
           print quality_metrics
           print  # A blank line
    
    0 讨论(0)
提交回复
热议问题