Python: how to do basic data manipulation like in R?

前端 未结 4 1110
半阙折子戏
半阙折子戏 2021-01-31 21:22

I have been working with R for several years. R is very strong in data manipulation. I\'m learning python and I would like to know how to manipulate data using python. Basically

相关标签:
4条回答
  • 2021-01-31 21:29

    I disagree with Cpfohl's comment - perhaps because I've been through this same transition myself, and it's not obvious how a naive user would be able to formulate the problem more precisely. It is actually an active development problem right now with a number of projects that have all come up with non-overlapping functionality (e.g. in the financial timeseries world, in the brain imaging world, etc.).

    The short answer is that python's various libraries for dealing with tables and csv files are not as good for a beginner as those in R, which are the end result of many years of users of varying levels.

    First, have a look at recarrays in numpy. This is probably the closest data structure that is in a commonly used library that is similar to a data.frame in R. In particular, you'll probably like the numpy.recfromcsv function, though it is not as robust as e.g. read.csv in R (it will have trouble with non-standard line-endings, for example).

    Subsetting a recarray is easy (though creating one can seem clunky):

    import numpy as np
    mydata = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
    mydata = mydata.view(np.recarray)
    mydata[mydata.x > 2]
    

    Modifying the nature of a numpy array is not generally as easy as in R, but there is a nice library of functions in numpy.lib.recfunctions (which must be imported separately - it doesn't come along with a simple import numpy). In particular, check out rec_append_fields and rec_join for adding columns.

    Numpy has a function numpy.savetxt that will accept a simple delimiter argument to make a csv file, but it will not print column names sadly (at least, I don't see that it does). So, while I discourage adding unnecessary libraries (since it gives less portable code), you might just use matplotlib.mlab.rec2csv (you'll find some other similar functions in that neighborhood as well - the numpy community is trying to port generally useful numeric / data manip code to numpy proper. Who knows, maybe you'll do this?).

    You'll notice I didn't answer (4), because that doesn't make sense. Tables don't transpose in python or R. Arrays or matrices do. So, convert your data to an array with a uniform dtype, then just use myarray.T.

    Other tools you might look at are pytables (and the related package carray), larry, datarray, pandas and tabular. In particular, datarray is looking to create a system for labelled data arrays which would serve as a foundation to other projects (and I think has developers from the larry and pandas projects as well).

    Hope that helps! Dav

    0 讨论(0)
  • 2021-01-31 21:32

    Simple answer: use pandas

    1

    In [2]: df = read_csv('foo.csv', index_col=None)
    
    In [3]: df
    Out[3]: 
       var1  var2  var3
    0  1     2     3   
    1  4     5     6   
    2  7     8     9   
    

    2

    In [4]: df[df['var2'].isin([5, 8])]
    Out[4]: 
       var1  var2  var3
    1  4     5     6   
    2  7     8     9   
    

    3

    In [5]: df['var4'] = df['var3'] * 2
    
    In [6]: df
    Out[6]: 
       var1  var2  var3  var4
    0  1     2     3     6   
    1  4     5     6     12  
    2  7     8     9     18  
    

    4

    In [7]: df.T
    Out[7]: 
          0  1   2 
    var1  1  4   7 
    var2  2  5   8 
    var3  3  6   9 
    var4  6  12  18
    

    6

    In [8]: df.to_csv('foo2.csv')
    
    In [9]: !cat foo2.csv
    index,var1,var2,var3,var4
    0,1,2,3,6
    1,4,5,6,12
    2,7,8,9,18
    
    0 讨论(0)
  • 2021-01-31 21:52
    import csv
    from itertools import izip
    
    with open('source.csv') as f:
        reader = csv.reader(f)
        # filter data
        data = (row for row in reader if row[1].strip() in ('5', '8'))
        # make a new variable
        data = (row + [int(row[2]) * 3] for row in data)
        # transpose data
        data = izip(*data)
        # write data to a new csv file
        with open('destination.csv', 'w') as fw:
            csv.writer(fw).writerows(data)
    
    0 讨论(0)
  • 2021-01-31 21:53

    There's a module for CSV parsing in the standard library. To get a list of row each containing a list of the cells, you can use list(csv.reader(...)).

    Step 2 and 3 can be written in one list comprehension: [(var1, var2, var3, var3 * 3) for var1, var2, var3 in data if var2 in ('5', '8')].

    I'm not aware of anything in the standard library for transposing lists of lists. Perhaps NumPy or SciPy has something. A quick and dirty way would be zipping the rows (zip(*lists)).

    Writing back to file should be as simple as constructing as csv.writer and passing it each row in a loop.

    0 讨论(0)
提交回复
热议问题