问题
I have a .csv file that contain 5 columns, a_id, b_id, var, lo, up. I would like to create different combinations / patterns between two variables based on a_id, b_id, and var.
In addition, at first I would like to delete the records that have no duplicate based on a_id, b_id, because if there is no duplicate, so combination or matching would not be created. As a result, in the dataFile.csv, first record is deleted, because it has no duplicate.
For the combination / pattern between two variables, at first I would like to create single combination on each records for each a_id and b_id. In this case, the values of the 2nd variable is null. This can be shown in the resultFile. For example, if I create different combination / patterns from record 2 to 5, that means where a_id = 103 and b_id = 195, the result can be seen in the resultFile. In the same way other combination / patterns based on a_id, b_id and var will be created as resultFile.csv. On the result file, 1, 2, and 3 in the variable name is use just to identify the variables, it is usually not required in the resultFile. In addition, I used a blank row for each pattern, and it is also not required in the resultFile. I used this just to see the patterns clearly. I have shown different combination of two variables based a_id and b_id. I have different a_id, and different b_id the real data.
Any advice and suggestion is appreciated.
dataFile.csv.
a_id b_id var lo up
103 190 dwel 0 236
103 195 ses 1 3
103 195 ses 4 113
103 195 pv 1 5
103 195 pv 6 29
103 266 dwl 15 92
103 266 dwl 93 144
103 266 dwl 145 521
103 266 ses 1 2
103 266 ses 3 6
103 266 pv 1 2
103 266 pv 3 9
103 266 pv 10 23
103 266 pv 24 33
103 266 Elp 142 711
103 266 Elp 711 885
107 272 dwl 15 95
107 272 dwl 96 624
107 272 ses 1 2
107 272 ses 3 6
107 272 pv 1 2
107 272 pv 3 9
. . . . .
. . . . .
resultFile.csv.
The resultFile.csv should be as follows:
a_id b_id var1 lo up var2 lo up
103 195 ses1 1 3 null null null
103 195 ses2 4 113 null null null
103 195 pv1 1 5 null null null
103 195 pv2 6 29 null null null
103 195 ses1 1 3 pv1 1 5
103 195 ses1 1 3 pv2 6 29
103 195 ses2 4 113 pv1 1 5
103 195 ses2 4 113 pv2 6 29
103 266 dwl1 15 92 null null null
103 266 dwl2 93 144 null null null
103 266 dwl3 145 521 null null null
103 266 ses1 1 2 null null null
103 266 ses2 3 6 null null null
103 266 pv1 1 2 null null null
103 266 pv2 3 9 null null null
103 266 pv3 10 23 null null null
103 266 pv4 24 103 null null null
103 266 elp1 142 711 null null null
103 266 elp2 712 885 null null null
103 266 dwl1 15 92 ses1 1 2
103 266 dwl1 15 92 ses2 3 6
103 266 dwl2 993 144 ses1 1 2
103 266 dwl2 993 144 ses2 3 6
103 266 dwl3 145 521 ses1 1 2
103 266 dwl3 145 521 ses2 3 6
103 266 dwl1 15 92 pv1 1 2
103 266 dwl1 15 92 pv2 3 9
103 266 dwl1 15 92 pv3 10 23
103 266 dwl1 15 92 pv4 24 33
103 266 dwl2 993 144 pv1 1 2
103 266 dwl2 993 144 pv2 3 9
103 266 dwl2 993 144 pv3 10 23
103 266 dwl2 993 144 pv4 24 33
103 266 dwl3 145 521 pv1 1 2
103 266 dwl3 145 521 pv2 3 9
103 266 dwl3 145 521 pv3 10 23
103 266 dwl3 145 521 pv4 24 33
103 266 dwl1 15 92 elp1 142 711
103 266 dwl1 15 92 elp2 712 885
103 266 dwl2 993 144 elp1 142 711
103 266 dwl2 993 144 elp2 712 885
103 266 dwl3 145 521 elp1 142 711
103 266 dwl3 145 521 elp2 712 885
103 266 ses1 1 2 pv1 1 2
103 266 ses1 1 2 pv2 3 9
103 266 ses1 1 2 pv3 10 23
103 266 ses1 1 2 pv4 24 33
103 266 ses2 3 6 pv1 1 2
103 266 ses2 3 6 pv2 3 9
103 266 ses2 3 6 pv3 10 23
103 266 ses2 3 6 pv4 24 33
103 266 ses1 1 2 dwl1 615 992
103 266 ses1 1 2 dwl2 993 144
103 266 ses1 1 2 dwl3 145 210
103 266 ses2 3 6 dwl1 615 992
103 266 ses2 3 6 dwl2 993 144
103 266 ses2 3 6 dwl3 145 210
103 266 ses1 1 2 elp1 142 711
103 266 ses1 1 2 elp2 712 885
103 266 ses2 3 6 elp1 142 711
103 266 ses2 3 6 elp2 712 885
103 266 elp1 142 711 pv1 1 2
103 266 elp1 142 711 pv2 3 9
103 266 elp1 142 711 pv3 10 23
103 266 elp1 142 711 pv4 24 33
103 266 elp2 712 885 pv1 1 2
103 266 elp2 712 885 pv2 3 9
103 266 elp2 712 885 pv3 10 23
103 266 elp2 712 885 pv4 24 33
103 266 elp1 142 711 ses1 1 2
103 266 elp1 142 711 ses2 3 6
103 266 elp2 712 885 ses1 1 2
103 266 elp2 712 885 ses2 3 6
103 266 elp1 142 711 dwl1 615 992
103 266 elp1 142 711 dwl2 993 144
103 266 elp1 142 711 dwl3 145 210
103 266 elp2 712 885 dwl1 615 992
103 266 elp2 712 885 dwl2 993 144
103 266 elp2 712 885 dwl3 145 210
103 266 pv1 1 2 dwl1 615 992
103 266 pv1 1 2 dwl2 993 144
103 266 pv1 1 2 dwl3 145 210
103 266 pv2 3 9 dwl1 615 992
103 266 pv2 3 9 dwl2 993 144
103 266 pv2 3 9 dwl3 145 210
103 266 pv3 10 23 dwl1 615 992
103 266 pv3 10 23 dwl2 993 144
103 266 pv3 10 23 dwl3 145 210
103 266 pv4 24 33 dwl1 615 992
103 266 pv4 24 33 dwl2 993 144
103 266 pv4 24 33 dwl3 145 210
103 266 pv1 1 2 ses1 1 2
103 266 pv1 1 2 ses2 3 6
103 266 pv2 3 9 ses1 1 2
103 266 pv2 3 9 ses2 3 6
103 266 pv3 10 23 ses1 1 2
103 266 pv3 10 23 ses2 3 6
103 266 pv4 24 33 ses1 1 2
103 266 pv4 24 33 ses2 3 6
103 266 pv1 1 2 elp1 142 711
103 266 pv1 1 2 elp2 712 885
103 266 pv2 3 9 elp1 142 711
103 266 pv2 3 9 elp2 712 885
103 266 pv3 10 23 elp1 142 711
103 266 pv3 10 23 elp2 712 885
103 266 pv4 24 33 elp1 142 711
103 266 pv4 24 33 elp2 712 885
回答1:
The following Python solution should get your started:
from itertools import groupby, product
import csv
output_header = ["a_id", "b_id", "var1", "lo", "up", "var2", "lo", "up"]
f_input = open('dataFile.csv', 'rb')
csv_input = csv.reader(f_input)
input_header = next(csv_input)
f_output = open('resultFile.csv', 'wb')
csv_output = csv.writer(f_output)
csv_output.writerow(output_header)
for k1, g1 in groupby(csv_input, key=lambda x: (x[0], x[1])):
group1 = list(g1)
if len(group1) > 1:
for row in group1:
csv_output.writerow(row + ['null'] * 3)
p = [list(g2) for k2, g2 in groupby(group1, key=lambda x: x[2])]
for pairs in product(*p):
if len(pairs) > 1:
csv_output.writerow(pairs[0] + pairs[1][2:])
f_input.close()
f_output.close()
This will give you a resultFile.csv
file starting as follows:
a_id,b_id,var1,lo,up,var2,lo,up
103,195,ses,1,3,null,null,null
103,195,ses,4,113,null,null,null
103,195,pv,1,5,null,null,null
103,195,pv,6,29,null,null,null
103,195,ses,1,3,pv,1,5
103,195,ses,1,3,pv,6,29
103,195,ses,4,113,pv,1,5
103,195,ses,4,113,pv,6,29
103,266,dwl,15,92,null,null,null
103,266,dwl,93,144,null,null,null
103,266,dwl,145,521,null,null,null
...
Tested using Python 2.6.6 (which I believe the OP is using)
来源:https://stackoverflow.com/questions/34309176/create-different-combination-patterns-between-the-data-of-two-columns-of-a-csv