问题
This question follows the problem in question: How to read two lines from a file and create dynamics keys in a for-loop?
But, the nature of the problem has evolved to certain complexity that I want to address.
Below is the structure of my data separated by space.
chr pos M1 M2 Mk Mg1 F1_hybrid F1_PG F1_block S1 Sk1 S2 Sj
2 16229767 T/T T/T T/T G/T C|T 1|0 726 . T/C T/C T/C
2 16229783 C/C C/C C/C A/C G|C 0|1 726 G/C G/C G/C C|G
2 16229992 A/A A/A A/A G/A G|A 1|0 726 A/A A/A A/A A|G
2 16230007 T/T T/T T/T A/T A|T 1|0 726 A|T A|T A|T A|T
2 16230011 G/G G/G G/G G/G C|G 1|0 726 G/C C|G C|G G/C
2 16230049 A/A A/A A/A A/A T|A 1|0 726 A|T . A/T A/T
2 16230174 . . . C/C T|C 1|0 726 C|T T|C T|C C|T
2 16230190 A/A A/A A/A A/A T|A 1|0 726 T|G G|T T|G T|G
2 16230260 A/A A/A A/A A/A G|A 1|0 726 G/G G/G G/G G/G
Explanation:
there are two major categories of data in the above file. Data from
Group M
have sample name starting with M, and similarlygroup S
that has several columns names starting with S.And there is a hybrid column (represented by F1_hybrid).
the data is the string along the position line. The F1_hybrid is phased with pipe (|) distinguishing the two letters. So, the two strings values from F1 are
C-G-G-A-C-T-T-T-G
, while another string value is T-C-A-T-G-A-C-A-A. One of this string is from M-group while the other is from S-group but I need to do some statistical analyses to do so. However, I can tell that visually that T-C-A-T-G-A-C-A-A string most likely came from M-group.
Procedure:
I read the first line and create a unique keys using the column information.
Then I read the second and 3rd line and the values in F1_hybrid, which is C|T with G|C. Now, I need to calculate how many GgC (explained as G given C) vs. CgT (C given T) exist between M-group vs. S group.
Then read 3rd (G|C) with 4th (G|A) line in F1_hybrid. So, the states are GgG and AgC. Similarly, I now count have many GcG vs. AgC exist in M vs. S group.
Therefore, I am trying to build a Markov-model
which counts the number of state for a phased string from F1 and taking the observed counts in group M
vs group S
.
I am now explaining, how to count the number of any XgY (X given Y) based on F1_hyrbid:
- It important to note the conditions before doing the count. The existing condition may be phased (which is represented by having pipe) vs. unphased (if the if two line have at least one slash (/).
Condition 01:
The M1
sample has state as (T/T with C/C) for 2nd and 3rd line. since the separator is a slash (/) and not pipe (|) we cannot tell which exact state M1-sample is in. But, we can create combination matrix (for previous state with present state)
T T
C CgT CgT
C CgT CgT
Now, we can tell that there are 4 total CgT
and we keep doing the same matrix if this condition meets.
Condition 02
Same is the case for other samples from Group M, except for Mg1 where the G/T is preceeding A/C. So, the matrix is:
G T
A AgG AgT
C CgG CgT
So, here we observed 1 count of CgT.
Condition 03:
But, if the earlier state - present state are phased by pipe in both states (like A|T at position 16230007 with C|G at position 16230011 for sample Sk1) we can do a direct count of phase state of observed state at that position, that there are only CgA and GgT, so count of CgT is 0.
Condition 04: If one of the state has pipe (|) but other has slash (/), the condition will be same as both state having slash.
Condition 05: If any of the previous_state or present_state is period(.) the observation count is automatically zero (0) for the state expected from F1_hybrid.
So, the expected output should be something like this:
pos M1 M2 Mk Mg1 H0 H1 S1 Sk1 S2 Sj
16..9783 4-CgT 4-CgT 4-CgT 1-CgT GgC CgT 0 1-CgT 1-CgT 1-CgT
16..9992 4-AgC 4-AgC 4-AgC 2-AgC GgG AgC 1-AgC 1-AgC 1-AgC 1-AgC,1-GgG
16..0007 4-TgA 4-TgA 4-TgA 1-AgG,1-TgA AgG TgA 2-TgA 2-TgA 2-TgA1 1-TgA
..................contd
Or, the values in dictionary format for each column would equally work. Something like ['4-CgT','4-CgT','4-CgT','4-CgT']
for first M1 at position 16..9783 and same for other.
回答1:
The question is a bit old, but interesting because you have a very clear specification and you need help to write the code. I will expose a solution following a top-down approach, which is a very well known method, using plain old python. It shouldn't be difficult to adapt to pandas.
The top-down approach means to me: if you don't know how to write it, just name it!
You have a file (or a string) as input, and you want to output a file (or a string). It seems quite simple, but you want to merge pairs of rows to build every new row. The idea is:
- get the rows of the input, as dictionaries
- take them by two
- build a new row for each pair
- output the result
You don't know for now how to write the generator of rows. You don't know either how to build a new row for each pair. Don't stay blocked by the difficulties, just name the solutions. Imagine you have a function get_rows
and a function build_new_row
. Let's write this:
def build_new_rows(f):
"""generate the new rows. Output may be redirected to a file"""
rows = get_rows(f) # get a generator on rows = dictionaries.
r1 = next(rows) # store the first row
for r2 in rows: # for every following row
yield build_new_row(r1, r2) # yield a new row built of the previous stored row and the current row.
r1 = r2 # store the current row, which becomes the previous row
Now, examine the two "missing" functions: get_rows
and build_new_row
.
The function get_rows
is quite easy to write. Here's the main part:
header = process_line(next(f))
for line in f:
yield {k:v for k,v in zip(header, process_line(line))}
where process_line
just splits the line on space, e.g. with a re.split("\s+", line.strip())
.
The second part is build_new_row
. Still the top-down approach: you need to build H0 and H1 from your expected table, and then to build the count of H1 for every M and S according to the conditions you exposed. Pretend you have a pipe_compute
function that compute H0 and H1, and a build_count
function that builds the count of H1 for every M and S:
def build_new_row(r1, r2):
"""build a row"""
h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"])
# initialize the dict whith the pos, H0 and H1
new_row = {"pos":r2["pos"], "H0":h0, "H1":h1}
for key in r1.keys():
if key[0] in ("M", "S"):
new_row[key] = build_count(r1[key], r2[key], h1)
return new_row
You have almost everything now. Take a look at pipe_compute
: it's exactly what you have written in your condition 03.
def pipe_compute(v1, v2):
"""build H0 H1 according to condition 03"""
xs = v1.split("|")
ys = v2.split("|")
return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]]
And for buid_count
, stick to the top-down approach:
def build_count(v1, v2, to_count):
"""nothing funny here: just follow the conditions"""
if is_slash_count(v1, v2): # are conditions 01, 02, 04 true ?
c = slash_count(v1, v2)[to_count] # count how many "to_count" we find in the 2 x 2 table of condtions 01 or 02.
elif "|" in v1 and "|" in v2: # condition 03
c = pipe_count(v1, v2)[to_count]
elif "." in v1 or "." in v2: # condition 05
return '0'
else:
raise Exception(v1, v2)
return "{}-{}".format(c, to_count) # n-XgY
We are still going down. When do we have is_slash_count
? Two slashes (conditions 01 and 02) or one slash and one pipe (condition 04):
def is_slash_count(v1, v2):
"""conditions 01, 02, 04"""
return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2
The function slash_count
is simply the 2 x 2 table of conditions 01 and 02:
def slash_count(v1, v2):
"""count according to conditions 01, 02, 04"""
cnt = collections.Counter()
for x in re.split("[|/]", v1): # cartesian product
for y in re.split("[|/]", v2): # cartesian product
cnt[y+"g"+x] += 1
return cnt # a dictionary XgY -> count(XgY)
The function pipe_count
is even simpler, because you just have to count the result of pipe_compute
:
def pipe_count(v1, v2):
"""count according to condition 03"""
return collections.Counter(pipe_compute(v1, v2))
Now you're done (and down). I get this result, which is slightly different from your expectation, but you certainly have already seen my mistake(s?):
pos M1 M2 Mk Mg1 H0 H1 S1 Sk1 S2 Sj
16229783 4-CgT 4-CgT 4-CgT 1-CgT GgC CgT 0 1-CgT 1-CgT 1-CgT
16229992 4-AgC 4-AgC 4-AgC 1-AgC GgG AgC 2-AgC 2-AgC 2-AgC 1-AgC
16230007 4-TgA 4-TgA 4-TgA 1-TgA AgG TgA 2-TgA 2-TgA 2-TgA 0-TgA
16230011 4-GgT 4-GgT 4-GgT 2-GgT CgA GgT 1-GgT 1-GgT 1-GgT 1-GgT
16230049 4-AgG 4-AgG 4-AgG 4-AgG TgC AgG 1-AgG 0 1-AgG 1-AgG
16230174 0 0 0 4-CgA TgT CgA 1-CgA 0 1-CgA 1-CgA
16230190 0 0 0 4-AgC TgT AgC 0-AgC 0-AgC 0-AgC 0-AgC
16230260 4-AgA 4-AgA 4-AgA 4-AgA GgT AgA 0-AgA 0-AgA 0-AgA 0-AgA
Bonus: Try it online!
What is important is, beyond the solution to this specific problem, the method I used and which is widely used in software development. The code may be improved a lot.
来源:https://stackoverflow.com/questions/42012730/how-to-read-two-lines-from-a-file-and-create-dynamics-keys-in-a-for-loop-a-foll