Why does awk seem to randomize the array?

前端 未结 4 811
忘了有多久
忘了有多久 2020-11-29 12:36

If you look at output of this awk test, you see that array in awk seems to be printed at some random pattern. It seems to be in same o

相关标签:
4条回答
  • 2020-11-29 13:12

    Awk uses hash tables to implement associative arrays. This is just an inherent property of this particular data structure. The location that a particular element is stored into the array depends on the hash of the value. Other factors to consider is the implementation of the hash table. If it is memory efficient, it will limit the range each key gets stored in using the modulus function or some other method. You also may get clashing hash values for different keys so chaining will occur, again affecting the order depending on which key was inserted first.

    The construct (key in array) is perfectly fine when used appropriately to loop over every key but you cannot count on the order and you should not update array whilst in the loop as you may end up process array[key] multiple times by mistake.

    There is a good decription of hash tables in the book Think Complexity.

    0 讨论(0)
  • 2020-11-29 13:18

    The issue is the operator you use to get the array indices, not the fact that the array is stored in a hash table.

    The in operator provides the array indices in a random(-looking) order (which IS by default related to the hash table but that's an implementation choice and can be modified).

    A for loop that explicitly provides the array indices in a numerically increasing order also operates on the same hash table that the in operator on but that produces output in a specific order regardless.

    It's just 2 different ways of getting the array indices, both of which work on a hash table.

    man awk and look up the in operator.

    If you want to control the output order using the in operator, you can do so with GNU awk (from release 4.0 on) by populating PROCINFO["sorted_in"]. See http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal for details.

    Some common ways to access array indices:

    To print array elements in an order you don't care about:

    {a[$1]=$0} END{for (i in a) print i, a[i]}
    

    To print array elements in numeric order of indices if the indices are numeric and contiguous starting at 1:

    {a[++i]=$0} END{for (i=1;i in a;i++) print i, a[i]}
    

    To print array elements in numeric order of indices if the indices are numeric but non-contiguous:

    {a[$1]=$0; min=($1<min?$1:min); max=($1>max?$1:max)} END{for (i=min;i<=max;i++) if (i in a) print i, a[i]}
    

    To print array elements in the order they were seen in the input:

    {a[$1]=$0; b[++max]=$1} END{for (i=1;i <= max;i++) print b[i], a[b[i]]}
    

    To print array elements in a specific order of indices using gawk 4.0+:

    BEGIN{PROCINFO["sorted_in"]=whatever} {a[$1]=$0} END{for (i in a) print i, a[i]}
    

    For anything else, write your own code and/or see gawk asort() and asorti().

    0 讨论(0)
  • 2020-11-29 13:27

    If you are using gawk or mawk, you can also set an env variable WHINY_USERS, which will sort indices before iterating.

    Example:

    echo "one two three four five six" | WHINY_USERS=true awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
    1 one
    2 two
    3 three
    4 four
    5 five
    6 six
    

    From mawk's manual:

    WHINY_USERS

    This is an undocumented gawk feature. It tells mawk to sort array indices before it starts to iterate over the elements of an array.

    0 讨论(0)
  • 2020-11-29 13:31

    From 8. Arrays in awk --> 8.5 Scanning All Elements of an Array in the GNU Awk user's guide when referring to the for (value in array) syntax:

    The order in which elements of the array are accessed by this statement is determined by the internal arrangement of the array elements within awk and cannot be controlled or changed. This can lead to problems if new elements are added to array by statements in the loop body; it is not predictable whether or not the for loop will reach them. Similarly, changing var inside the loop may produce strange results. It is best to avoid such things.


    So if you want to print the array in the order you store it, then you have to use the classical for loop:

    for (j=1; j<=NF; j++) print j,a[j]
    

    Example:

    $ awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j=1; j<=NF; j++) print j,a[j]}' <<< "P04637 1A1U 1AIE 1C26 1DT7 1GZH 1H26 1HS5 1JSP 1KZY 1MA3 1OLG 1OLH 1PES 1PET 1SAE 1SAF 1SAK 1SAL 1TSR 1TUP 1UOL 1XQH 1YC5 1YCQ"
    1 P04637
    2 1A1U
    3 1AIE
    4 1C26
    5 1DT7
    6 1GZH
    7 1H26
    8 1HS5
    9 1JSP
    10 1KZY
    11 1MA3
    12 1OLG
    13 1OLH
    14 1PES
    15 1PET
    16 1SAE
    17 1SAF
    18 1SAK
    19 1SAL
    20 1TSR
    21 1TUP
    22 1UOL
    23 1XQH
    24 1YC5
    25 1YCQ
    
    0 讨论(0)
提交回复
热议问题