问题
I need to parse a file which has contents that look like this:
20 31022550 G 1396 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00:0.98 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:1391:60.00:36.08:36.97:719:672:0.51:0.01:7.59:719:0.49:126.00:0.50 T:1:60.00:33.00:37.00:0:1:0.37:0.02:47.00:0:0.00:126.00:0.18 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 +A:2:60.00:0.00:37.00:2:0:0.67:0.01:0.00:2:0.65:126.00:0.65
20 31022551 A 1271 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:960:60.00:35.23:36.99:496:464:0.50:0.00:6.38:496:0.49:126.00:0.52 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:13:60.00:35.00:35.92:4:9:0.13:0.02:44.92:4:0.98:126.00:0.37 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 +G:288:60.00:0.00:37.00:171:117:0.57:0.01:8.17:171:0.54:126.00:0.53 +GG:9:60.00:0.00:37.00:5:4:0.71:0.03:23.67:5:0.50:126.00:0.57 +GGG:1:60.00:0.00:37.00:1:0:0.51:0.03:14.00:1:0.24:126.00:0.24
After parsing I would want it to look
20 31022550 G 1396 = 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 A 2 60 33 37 2 0 0.02 0.02 40 2 0.98 126
20 31022550 G 1396 C 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 G 1391 60 36.08 36.97 719 672 0.51 0.01 7.59 719 0.49 126
20 31022550 G 1396 T 1 60 33 37 0 1 0.37 0.02 47 0 0 126
20 31022550 G 1396 N 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 +A 2 60 0 37 2 0 0.67 0.01 0 2 0.65 126
20 31022551 A 1271 = 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 A 960 60 35.23 36.99 496 464 0.5 0 6.38 496 0.49 126
20 31022551 A 1271 C 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 G 13 60 35 35.92 4 9 0.13 0.02 44.92 4 0.98 126
20 31022551 A 1271 T 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 N 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 +G 288 60 0 37 171 117 0.57 0.01 8.17 171 0.54 126
20 31022551 A 1271 +GG 9 60 0 37 5 4 0.71 0.03 23.67 5 0.5 126
20 31022551 A 1271 +GGG 1 60 0 37 1 0 0.51 0.03 14 1 0.24 126
I have more lines where it increments based on column[1]
31022550...31022NNN
Code
What I am trying to do here is to only print certain parts of the file with this pseudo code keeping the column[1]
as key
from collections import defaultdict
ids = defaultdict(list)
with open('~/file.tsv', 'r') as f:
for line in f:
lines = line.strip().split('\t')
pos = (lines[0:3])
for ele in lines[4:]:
# print pos
p = pos[1].strip()
base = ele.split(':')[0]
ids[p] = {
'pos': pos[0].strip(),
'base': base,
'count': ele.split(':')[1],
'_pos': ele.split(':')[5],
'_neg': ele.split(':')[6]
}
\
for k,v in ids.iteritems():
print k,v
Output
31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}
Not sure why I do not see all the fields that 31022550 holds as key value pair.
回答1:
You are assigning only the last dictionary to your p
key:
ids[p] = {
'pos': pos[0].strip(),
'base': base,
'count': ele.split(':')[1],
'_pos': ele.split(':')[5],
'_neg': ele.split(':')[6]
}
This bypasses the factory for new keys altogether; you are just assigning a dictionary value instead. If you wanted to build a list of dictionaries per key, you'd need to use list.append()
:
ids[p].append({
'pos': pos[0].strip(),
'base': base,
'count': ele.split(':')[1],
'_pos': ele.split(':')[5],
'_neg': ele.split(':')[6]
})
This looks up the ids[p]
value (which then is created as an empty list if the key does not yet exist), and you then append your dictionary to the end of that list.
I'd simplify the code somewhat using the csv
module to handle splitting of the lines:
import csv
from collections import defaultdict
ids = defaultdict(list)
with open('~/file.tsv', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
pos, key = row[:2]
for elems in row[4:]:
elems = elems.split(':')
ids[key].append({
'pos': pos,
'base': elems[0],
'count': elems[1],
'_pos': elems[5],
'_neg': elems[6]
})
for key, rows in ids.iteritems():
for row in rows:
print '{}\t{}'.format(key, row)
This produces:
31022550 {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550 {'count': '2', 'base': 'A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022550 {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550 {'count': '1391', 'base': 'G', 'pos': '20', '_neg': '672', '_pos': '719'}
31022550 {'count': '1', 'base': 'T', 'pos': '20', '_neg': '1', '_pos': '0'}
31022550 {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '960', 'base': 'A', 'pos': '20', '_neg': '464', '_pos': '496'}
31022551 {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '13', 'base': 'G', 'pos': '20', '_neg': '9', '_pos': '4'}
31022551 {'count': '0', 'base': 'T', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '288', 'base': '+G', 'pos': '20', '_neg': '117', '_pos': '171'}
31022551 {'count': '9', 'base': '+GG', 'pos': '20', '_neg': '4', '_pos': '5'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}
来源:https://stackoverflow.com/questions/46264408/using-defaultdict-to-parse-multi-delimiter-file