Sort CSV using a key computed from two columns, grab first n largest values

后端 未结 3 486
被撕碎了的回忆
被撕碎了的回忆 2021-01-17 03:11

Python amateur here...let\'s say here I have snippet of an example csv file:

Country, Year, GDP, Population
Country1         


        
相关标签:
3条回答
  • 2021-01-17 03:31

    This is an approach that will enable you to do one scan of the file to get the top 10 for each country...

    It is possible to do this without pandas by utilising the heapq module, the following is untested, but should be a base for you to refer to appropriate documentation and adapt for your purposes:

    import csv
    import heapq
    from itertools import islice
    
    freqs = {}
    with open('yourfile') as fin:
        csvin = csv.reader(fin)
        rows_with_gdp = ([float(row[2]) / float(row[3])] + row for row in islice(csvin, 1, None) if row[2] and row[3])
        for row in rows_with_gdp:
            cnt = freqs.setdefault(row[2], [[]] * 10) # 2 = year, 10 = num to keep
            heapq.heappushpop(cnt, row)
    
    for year, vals in freqs.iteritems():
        print year, [row[1:] for row in sorted(filter(None, vals), reverse=True)]
    
    0 讨论(0)
  • 2021-01-17 03:34

    Use the optional key argument to the sort function:

    array.sort(key=lambda x: x[2])
    

    will sort array using its third element as a key. The value of the key argument should be a lambda expression that takes in a single argument (an arbitrary element of the array being sorted) and returns the key for sorting.

    For your GDP example, the lambda function to use would be:

    lambda x: float(x[2])/float(x[3]) # x[2] is GDP, x[3] is population
    

    The float function converts the CSV fields from strings into floating point numbers. Since there are no guarantees that this will be successful (improper formatting, bad data, etc), I'd typically do this before sorting, when inserting stuff into the array. You should use floating point division here explicitly, as integer division won't give you the results you expect. If you find yourself doing this often, changing the behavior of the division operator is an option (http://www.python.org/dev/peps/pep-0238/ and related links).

    0 讨论(0)
  • 2021-01-17 03:45

    The relevant modules would be:

    • csv for parsing the input
    • collections.namedtuple to name the fields
    • the filter() function to extract the specified year range
    • heapq.nlargest() to find the largest values
    • pprint.pprint() for nice output

    Here's a little bit to get you started (I would do it all but what is the fun in having someone write your whole program and deprive you of the joy of finishing it):

    from __future__ import division
    import csv, collections, heapq, pprint
    
    filecontents = '''\
    Country, Year, GDP, Population
    Country1,2002,44545,24352
    Country2,2004,14325,75677
    Country3,2004,23132412,1345234
    Country4,2004,2312421,12412
    '''
    
    CountryStats = collections.namedtuple('CountryStats', ['country', 'year', 'gdp', 'population'])
    dialect = csv.Sniffer().sniff(filecontents)
    
    data = []
    for country, year, gdp, pop in csv.reader(filecontents.splitlines()[1:], dialect):
        row = CountryStats(country, int(year), int(gdp), int(pop))
        if row.year == 2004:
            data.append(row)
    
    data.sort(key = lambda s: s.gdp / s.population)
    pprint.pprint(data)
    
    0 讨论(0)
提交回复
热议问题