I am having trouble reading a csv file, delimited by tabs, in python. I use the following function:
def csv2array(filename, skiprows=0, delimiter=\'\\t\', ra
Check out the python CSV module: http://docs.python.org/library/csv.html
import csv
reader = csv.reader(open("myfile.csv", "rb"),
delimiter='\t', quoting=csv.QUOTE_NONE)
header = []
records = []
fields = 16
if thereIsAHeader: header = reader.next()
for row, record in enumerate(reader):
if len(record) != fields:
print "Skipping malformed record %i, contains %i fields (%i expected)" %
(record, len(record), fields)
else:
records.append(record)
# do numpy stuff.
May I ask why you're not using the built-in csv reader? http://docs.python.org/library/csv.html
I've used it very effectively with numpy/scipy. I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own.
I think Nick T's approach would be the better way to go. I would make one change. As I would replace the following code:
for row, record in enumerate(reader):
if len(record) != fields:
print "Skipping malformed record %i, contains %i fields (%i expected)" %
(record, len(record), fields)
else:
records.append(record)
with
records = np.asrray([row for row in reader if len(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple
The list comprehension will return a numpy array and take advantage of pre-compiled libraries which should speed things up greatly. Also, I would recommend using print() as a function versus print "" as the former is the standard for python3 which is most likely the future and I would use logging over print.
Likely it came from Line 27100 in your data file... and it had 12 columns instead of 16. I.e. it had:
separator,1,2,3,4,5,6,7,8,9,10,11,12,separator
And it was expecting something like this:
separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator
I'm not sure how you want to convert your data, but if you have irregular line lengths, the easiest way would be something like this:
lines = f.read().split('someseparator')
for line in lines:
splitline = line.split(',')
#do something with splitline
I have successfully used two methodologies; (1): if I simply need to read arbitrary CSV, I used the CSV module (as pointed out by other users), and (2): if I require repeated processing of a known CSV (or any) format, I write a simple parser.
It seems that your problem fits in the second category, and a parser should be very simple:
f = open('file.txt', 'r').readlines()
for line in f:
tokens = line.strip().split('\t')
gene = tokens[0]
vals = [float(k) for k in tokens[1:10]]
stuff = tokens[10:]
# do something with gene, vals, and stuff
You can add a line in the reader for skipping comments (`if tokens[0] == '#': continue') or to handle blank lines ('if tokens == []: continue'). You get the idea.