How to divide specific column with rest of columns

拈花ヽ惹草 提交于 2021-02-11 12:17:29

问题


I have matrix like this (first column names rest are values, separator i tab):

name1 A1 B1 C1 D1
name2 A2 B2 C2 D2

Matrix could be huge (it is mean about hundreds rows and columns). It is allays same size. I can expect zero values.

I need output like this:

name1 A1 B1 C1 D1 A1/B1 A1/C1 A1/D1
name2 A2 B2 C2 D2 A2/B2 A2/C2 A2/D2

This combination save to new file. And then make another combination:

name1 A1 B1 C1 D1 B1/A1 B1/C1 B1/D1
name2 A2 B2 C2 D2 B2/A2 B2/C2 B2/D2

and so on so on => divide each column with rest of columns in matrix and save as TSV to new file. And also round to three decimal places.

I can do this manually with script:

awk '{OFS="\t"}{$6=$2/($3+0.001); $7=$2/($4+0.001); $8=$2/($5+0.001)}1' input_file.tsv

Reason why I add number 0.001 is that division by zero is impossible. I can create shell script with wile loop, but it takes long time.

I would be very happy for any automation this process.


回答1:


Since you tagged the question with python-3.x, here is a script to achieve what you want (it requires Python 3.6+ though, because of f-strings):

from pathlib import Path
import csv

source = Path('input.tsv')

with source.open() as src:
    csvreader = csv.reader(src, dialect='excel-tab')

    # get number of columns and rewind
    cols = len(next(csvreader)[1:])
    src.seek(0)

    csvwriters = []

    # create a csv.writer for each column
    for i in range(cols):
        # output_col_01.tsv, output_col_02.tsv ...
        csvwriters.append(
            csv.writer(
                Path(f'output_col_{i + 1:02d}.tsv').open('w'),
                dialect='excel-tab'
            )
        )

    nan = float('nan')

    for name, *cols in csvreader:
        for i, a in enumerate(cols):
            row = [name]
            for j, b in enumerate(cols):
                # skip the quotient of a col by itself
                if i != j:
                    a = float(a)
                    b = float(b)
                    # nan if division by zero
                    row.append(round(a / b, 4) if b else nan)

            csvwriters[i].writerow(row)

Instead of adding 0.001 for operations where the divisor is 0, I opted to return float('nan').

It will not divide a column by itself and will round the quotients to 4 decimal places.

Finally, if you are using a Python version earlier than 3.6 (but you will still need a Python version 3.4+, because of pathlib.Path()), then replace the following line:

Path(f'output_col_{i + 1:02d}.tsv').open('w'),

with:

Path('output_col_%02d.tsv' % (i + 1)).open('w'),

That's needed because f-strings were introduced in Python 3.6.




回答2:


Could you please try following. Also by seeing your attempt I am assuming that your Input_file is delimited by space NOT by comma, if there is any other delimiter than space then add BEGIN{FS=","}(comma as an example) in following code too. Thanks to @accdias adding logic to remove control M characters too

awk '
{
   gsub(/\r/,"")
}
{
  nf=NF
  close(out_file)
  for(k=2;k<=nf;k++){
    out_file=""
    for(i=2;i<=nf;i++){
      if($i!=0){
         $(NF+1)=sprintf("%.03f",$k/$i)
      }
      else{
         $(NF+1)=sprintf("%s","NaN")
      }
    }
    out_file=k"field_out_file"
    print >> (out_file)
    NF=nf
  }
}'  Input_file

What does code take care of:

  • It creates output file names as per field's name like 2field_out_file means 2nd field is getting divided by all elements through out the Input_file/.
  • In back-end all output files will be opened so close function is used to avoid errors like too many files opened.
  • It checks about 0 value if anything is getting divided by zero it prints NaN in output.


来源:https://stackoverflow.com/questions/59878040/how-to-divide-specific-column-with-rest-of-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!