问题
I have matrix like this (first column names rest are values, separator i tab):
name1 A1 B1 C1 D1
name2 A2 B2 C2 D2
Matrix could be huge (it is mean about hundreds rows and columns). It is allays same size. I can expect zero values.
I need output like this:
name1 A1 B1 C1 D1 A1/B1 A1/C1 A1/D1
name2 A2 B2 C2 D2 A2/B2 A2/C2 A2/D2
This combination save to new file. And then make another combination:
name1 A1 B1 C1 D1 B1/A1 B1/C1 B1/D1
name2 A2 B2 C2 D2 B2/A2 B2/C2 B2/D2
and so on so on => divide each column with rest of columns in matrix and save as TSV to new file. And also round to three decimal places.
I can do this manually with script:
awk '{OFS="\t"}{$6=$2/($3+0.001); $7=$2/($4+0.001); $8=$2/($5+0.001)}1' input_file.tsv
Reason why I add number 0.001 is that division by zero is impossible. I can create shell script with wile loop, but it takes long time.
I would be very happy for any automation this process.
回答1:
Since you tagged the question with python-3.x
, here is a script to achieve what you want (it requires Python 3.6+ though, because of f-strings
):
from pathlib import Path
import csv
source = Path('input.tsv')
with source.open() as src:
csvreader = csv.reader(src, dialect='excel-tab')
# get number of columns and rewind
cols = len(next(csvreader)[1:])
src.seek(0)
csvwriters = []
# create a csv.writer for each column
for i in range(cols):
# output_col_01.tsv, output_col_02.tsv ...
csvwriters.append(
csv.writer(
Path(f'output_col_{i + 1:02d}.tsv').open('w'),
dialect='excel-tab'
)
)
nan = float('nan')
for name, *cols in csvreader:
for i, a in enumerate(cols):
row = [name]
for j, b in enumerate(cols):
# skip the quotient of a col by itself
if i != j:
a = float(a)
b = float(b)
# nan if division by zero
row.append(round(a / b, 4) if b else nan)
csvwriters[i].writerow(row)
Instead of adding 0.001
for operations where the divisor is 0
, I opted to return float('nan')
.
It will not divide a column by itself and will round the quotients to 4 decimal places.
Finally, if you are using a Python version earlier than 3.6 (but you will still need a Python version 3.4+, because of pathlib.Path()
), then replace the following line:
Path(f'output_col_{i + 1:02d}.tsv').open('w'),
with:
Path('output_col_%02d.tsv' % (i + 1)).open('w'),
That's needed because f-strings
were introduced in Python 3.6.
回答2:
Could you please try following. Also by seeing your attempt I am assuming that your Input_file is delimited by space NOT by comma, if there is any other delimiter than space then add BEGIN{FS=","}
(comma as an example) in following code too. Thanks to @accdias adding logic to remove control M characters too
awk '
{
gsub(/\r/,"")
}
{
nf=NF
close(out_file)
for(k=2;k<=nf;k++){
out_file=""
for(i=2;i<=nf;i++){
if($i!=0){
$(NF+1)=sprintf("%.03f",$k/$i)
}
else{
$(NF+1)=sprintf("%s","NaN")
}
}
out_file=k"field_out_file"
print >> (out_file)
NF=nf
}
}' Input_file
What does code take care of:
- It creates output file names as per field's name like
2field_out_file
means 2nd field is getting divided by all elements through out the Input_file/. - In back-end all output files will be opened so
close
function is used to avoid errors liketoo many files opened
. - It checks about
0
value if anything is getting divided by zero it printsNaN
in output.
来源:https://stackoverflow.com/questions/59878040/how-to-divide-specific-column-with-rest-of-columns