reading csv files in scipy/numpy in Python

前端 未结 5 1769
轻奢々
轻奢々 2020-12-16 08:23

I am having trouble reading a csv file, delimited by tabs, in python. I use the following function:

def csv2array(filename, skiprows=0, delimiter=\'\\t\', ra         


        
相关标签:
5条回答
  • 2020-12-16 08:27

    Check out the python CSV module: http://docs.python.org/library/csv.html

    import csv
    reader = csv.reader(open("myfile.csv", "rb"), 
                        delimiter='\t', quoting=csv.QUOTE_NONE)
    
    header = []
    records = []
    fields = 16
    
    if thereIsAHeader: header = reader.next()
    
    for row, record in enumerate(reader):
        if len(record) != fields:
            print "Skipping malformed record %i, contains %i fields (%i expected)" %
                (record, len(record), fields)
        else:
            records.append(record)
    
    # do numpy stuff.
    
    0 讨论(0)
  • 2020-12-16 08:33

    May I ask why you're not using the built-in csv reader? http://docs.python.org/library/csv.html

    I've used it very effectively with numpy/scipy. I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own.

    0 讨论(0)
  • 2020-12-16 08:33

    I think Nick T's approach would be the better way to go. I would make one change. As I would replace the following code:

    for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)
    

    with

    records = np.asrray([row for row in reader if len(row) = fields ])
    print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple
    

    The list comprehension will return a numpy array and take advantage of pre-compiled libraries which should speed things up greatly. Also, I would recommend using print() as a function versus print "" as the former is the standard for python3 which is most likely the future and I would use logging over print.

    0 讨论(0)
  • 2020-12-16 08:39

    Likely it came from Line 27100 in your data file... and it had 12 columns instead of 16. I.e. it had:

    separator,1,2,3,4,5,6,7,8,9,10,11,12,separator
    

    And it was expecting something like this:

    separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator
    

    I'm not sure how you want to convert your data, but if you have irregular line lengths, the easiest way would be something like this:

    lines = f.read().split('someseparator')
    for line in lines:
        splitline = line.split(',')
        #do something with splitline
    
    0 讨论(0)
  • 2020-12-16 08:51

    I have successfully used two methodologies; (1): if I simply need to read arbitrary CSV, I used the CSV module (as pointed out by other users), and (2): if I require repeated processing of a known CSV (or any) format, I write a simple parser.

    It seems that your problem fits in the second category, and a parser should be very simple:

    f = open('file.txt', 'r').readlines()
    for line in f:
     tokens = line.strip().split('\t')
     gene = tokens[0]
     vals = [float(k) for k in tokens[1:10]]
     stuff = tokens[10:]
     # do something with gene, vals, and stuff
    

    You can add a line in the reader for skipping comments (`if tokens[0] == '#': continue') or to handle blank lines ('if tokens == []: continue'). You get the idea.

    0 讨论(0)
提交回复
热议问题