Memory efficient transpose - Awk

后端 未结 2 1681
南笙
南笙 2021-01-20 09:48

i am trying to transpose a table (10k rows X 10K cols) using the following script.

A simple data example

$ cat rm1

        t1      t2      t         


        
相关标签:
2条回答
  • 2021-01-20 10:10

    Here's one way to do it, as I mentioned in my comments, in chunks. Here I show the mechanics on a tiny 12r x 10c file, but I also ran a chunk of 1000 rows on a 10K x 10K file in not much more than a minute (Mac Powerbook).6

    EDIT The following was updated to consider an M x N matrix with unequal number of rows and columns. The previous version only worked for an 'N x N' matrix.

    $ cat et.awk
    BEGIN {
        start = chunk_start
        limit = chunk_start + chunk_size - 1
    }
    
    {
        n = (limit > NF) ? NF : limit
        for (f = start; f <= n; f++) {
            a[NR, f] = $f
        }
    }
    
    END {
        n = (limit > NF) ? NF : limit
    
        for (f = start; f <= n; f++)
          for (r = 1; r <= NR; r++)
            printf a[r, f] (r==NR ? RS : FS)
    }
    
    
    $ cat t.txt
    10 11 12 13 14 15 16 17 18 19
    20 21 22 23 24 25 26 27 28 29 
    30 31 32 33 34 35 36 37 38 39 
    40 41 42 43 44 45 46 47 48 49 
    50 51 52 53 54 55 56 57 58 59 
    60 61 62 63 64 65 66 67 68 69 
    70 71 72 73 74 75 76 77 78 79 
    80 81 82 83 84 85 86 87 88 89 
    90 91 92 93 94 95 96 97 98 99 
    A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 
    B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 
    C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 
    
    
    $ cat et.sh
    inf=$1
    outf=$2
    
    rm -f $outf
    for i in $(seq 1 2 12); do
        echo chunk for rows $i $(expr $i + 1)
        awk -v chunk_start=$i -v chunk_size=2 -f et.awk $inf >> $outf
    done
    
    
    
    $ sh et.sh t.txt t-transpose.txt
    chunk for rows 1 2
    chunk for rows 3 4
    chunk for rows 5 6
    chunk for rows 7 8
    chunk for rows 9 10
    chunk for rows 11 12
    
    
    $ cat t-transpose.txt 
    10 20 30 40 50 60 70 80 90 A0 B0 C0
    11 21 31 41 51 61 71 81 91 A1 B1 C1
    12 22 32 42 52 62 72 82 92 A2 B2 C2
    13 23 33 43 53 63 73 83 93 A3 B3 C3
    14 24 34 44 54 64 74 84 94 A4 B4 C4
    15 25 35 45 55 65 75 85 95 A5 B5 C5
    16 26 36 46 56 66 76 86 96 A6 B6 C6
    17 27 37 47 57 67 77 87 97 A7 B7 C7
    18 28 38 48 58 68 78 88 98 A8 B8 C8
    19 29 39 49 59 69 79 89 99 A9 B9 C9
    

    And then running the first chunk on the huge file looks like:

    $ time awk -v chunk_start=1 -v chunk_size=1000 -f et.awk tenk.txt  > tenk-transpose.txt
    
    real    1m7.899s
    user    1m5.173s
    sys     0m2.552s
    

    Doing that ten times with the next chunk_start set to 1001, etc. (and appending with >> to the output, of course) should finally give you the full transposed result.

    0 讨论(0)
  • 2021-01-20 10:28

    There is a simple and quick algorithm based on sorting:

    1) Make a pass through the input, prepending the row number and column number to each field. Output is a three-tuple of row, column, value for each cell in the matrix. Write the output to a temporary file.

    2) Sort the temporary file by column, then row.

    3) Make a pass through the sorted temporary file, reconstructing the transposed matrix.

    The two outer passes are done by awk. The sort is done by the system sort. Here's the code:

    $ echo '1 2 3
    2 3 44
    1 1 1' |
    awk '{ for (i=1; i<=NF; i++) print i, NR, $i}' |
    sort -n |
    awk ' NR>1 && $2==1 { print "" }; { printf "%s ", $3 }; END { print "" }'
    1 2 1
    2 3 1
    3 44 1
    
    0 讨论(0)
提交回复
热议问题