问题
I'm wondering what the best way to parse long form data into wide for is in python. I've previously been doing this sort of task in R but it really is taking to long as my files can be upwards of 1 gb. Here is some dummy data:
Sequence Position Strand Score
Gene1 0 + 1
Gene1 1 + 0.25
Gene1 0 - 1
Gene1 1 - 0.5
Gene2 0 + 0
Gene2 1 + 0.1
Gene2 0 - 0
Gene2 1 - 0.5
But I'd like to have it in the wide form where I've summed the scores over the strands at each position. Here is output I hope for:
Sequence 0 1
Gene1 2 0.75
Gene2 0 0.6
Any help on how to attack such a problem conceptually would be really helpful.
回答1:
Both of these solutions seem like overkill when you can do it with pandas in a one-liner:
In [7]: df
Out[7]:
Sequence Position Strand Score
0 Gene1 0 + 1.00
1 Gene1 1 + 0.25
2 Gene1 0 - 1.00
3 Gene1 1 - 0.50
4 Gene2 0 + 0.00
5 Gene2 1 + 0.10
6 Gene2 0 - 0.00
7 Gene2 1 - 0.50
In [8]: df.groupby(['Sequence', 'Position']).Score.sum().unstack('Position')
Out[8]:
Position 0 1
Sequence
Gene1 2 0.75
Gene2 0 0.60
If you cannot load the file into memory then an out-of-core solution in the other answers will work too.
回答2:
The easy way to solve a problem like this is by using a dict
, or a pair of them, or even a pair of collections.Counter
s:
positions, scores = Counter(), Counter()
for sequence, position, strand, score in reader:
positions[sequence] += position
scores[sequence] += scores
for sequence in positions:
writer.writerow(sequence, positions[sequence], scores[sequence])
The problem is that this may be too large to fit into memory.
I'd certainly try it and see first. A 1GB input file doesn't mean you need 1GB of memory. Remember, you're only keeping track of the sums for each distinct gene, not for each row in the input file. In your example, that means only two genes, out of eight values.
But if you can't fit it in memory, you need a database. Here, you really just need a simple key-value database that acts like a dict
, not something fancy. And Python has that built in, as dbm.
For example (writing things overly verbose to make sure it's easy to understand):
import csv
import shelve
with dbm.open('p.db', 'w') as positions, dbm.open('s.db', 'w') as scores:
with open('input.tsv') as infile:
for row in csv.DictReader(infile, delimiter='\t'):
sequence = row['Sequence']
position = row['Position']
score = row['Score']
old_position = positions.get(sequence, '0')
positions[sequence] = str(int(old_position) + int(position))
old_score = scores.get(sequence, '0')
scores[sequence] = str(int(old_score) + int(score))
with open('output.tsv', 'w') as outfile:
writer = csv.writer(outfile, delimiter='\t')
for sequence in positions:
writer.writerow((sequence, positions[sequence], scores[sequence]))
If you need to do something more complicated, where a simple key-value database won't work, then you'll probably want to rewrite your logic in SQL and use the sqlite3 module to execute it for you.
And if your database is so huge that SQLite can't handle it, you can look at MySQL or another external database engine.
来源:https://stackoverflow.com/questions/18969034/parsing-data-from-long-to-wide-format-in-python