Parsing Data From Long to Wide Format in Python

岁酱吖の 提交于 2019-12-11 02:12:17

问题


I'm wondering what the best way to parse long form data into wide for is in python. I've previously been doing this sort of task in R but it really is taking to long as my files can be upwards of 1 gb. Here is some dummy data:

Sequence Position Strand Score
Gene1    0        +      1
Gene1    1        +      0.25
Gene1    0        -      1
Gene1    1        -      0.5
Gene2    0        +      0
Gene2    1        +      0.1
Gene2    0        -      0
Gene2    1        -      0.5

But I'd like to have it in the wide form where I've summed the scores over the strands at each position. Here is output I hope for:

Sequence 0 1
Gene1    2 0.75
Gene2    0 0.6

Any help on how to attack such a problem conceptually would be really helpful.


回答1:


Both of these solutions seem like overkill when you can do it with pandas in a one-liner:

In [7]: df
Out[7]: 
  Sequence  Position Strand  Score
0    Gene1         0      +   1.00
1    Gene1         1      +   0.25
2    Gene1         0      -   1.00
3    Gene1         1      -   0.50
4    Gene2         0      +   0.00
5    Gene2         1      +   0.10
6    Gene2         0      -   0.00
7    Gene2         1      -   0.50

In [8]: df.groupby(['Sequence', 'Position']).Score.sum().unstack('Position')
Out[8]: 
Position  0     1
Sequence         
Gene1     2  0.75
Gene2     0  0.60

If you cannot load the file into memory then an out-of-core solution in the other answers will work too.




回答2:


The easy way to solve a problem like this is by using a dict, or a pair of them, or even a pair of collections.Counters:

positions, scores = Counter(), Counter()
for sequence, position, strand, score in reader:
    positions[sequence] += position
    scores[sequence] += scores

for sequence in positions:
    writer.writerow(sequence, positions[sequence], scores[sequence])

The problem is that this may be too large to fit into memory.

I'd certainly try it and see first. A 1GB input file doesn't mean you need 1GB of memory. Remember, you're only keeping track of the sums for each distinct gene, not for each row in the input file. In your example, that means only two genes, out of eight values.


But if you can't fit it in memory, you need a database. Here, you really just need a simple key-value database that acts like a dict, not something fancy. And Python has that built in, as dbm.

For example (writing things overly verbose to make sure it's easy to understand):

import csv
import shelve

with dbm.open('p.db', 'w') as positions, dbm.open('s.db', 'w') as scores:
    with open('input.tsv') as infile:
        for row in csv.DictReader(infile, delimiter='\t'):
            sequence = row['Sequence']
            position = row['Position']
            score = row['Score']
            old_position = positions.get(sequence, '0')
            positions[sequence] = str(int(old_position) + int(position))
            old_score = scores.get(sequence, '0')
            scores[sequence] = str(int(old_score) + int(score))
   with open('output.tsv', 'w') as outfile:
       writer = csv.writer(outfile, delimiter='\t')
       for sequence in positions:
           writer.writerow((sequence, positions[sequence], scores[sequence]))

If you need to do something more complicated, where a simple key-value database won't work, then you'll probably want to rewrite your logic in SQL and use the sqlite3 module to execute it for you.

And if your database is so huge that SQLite can't handle it, you can look at MySQL or another external database engine.



来源:https://stackoverflow.com/questions/18969034/parsing-data-from-long-to-wide-format-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!