问题
How can I use AWK to compute the median of a column of numerical data?
I can think of a simple algorithm but I can't seem to program it:
What I have so far is:
sort | awk 'END{print NR}'
And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2)
. If NR/2
is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1
and (NR/2)-1
.
回答1:
This awk
program assumes one column of numerically sorted data:
#/usr/bin/env awk
{
count[NR] = $1;
}
END {
if (NR % 2) {
print count[(NR + 1) / 2];
} else {
print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
}
}
Sample usage:
sort -n data_file | awk -f median.awk
回答2:
With awk
you have to store the values in an array and compute the median at the end, assuming we look at the first column:
sort -n file | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Sure, for real median computation do the rounding as described in the question:
sort -n file | awk ' { a[i++]=$1; }
END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
回答3:
OK, just saw this topic and thought I could add my two cents, since I looked for something similar in the past. Even though the title says awk
, all the answers make use of sort
as well. Calculating the median for a column of data can be easily accomplished with datamash:
> seq 10 | datamash median 1
5.5
Note that sort
is not needed, even if you have an unsorted column:
> seq 10 | gshuf | datamash median 1
5.5
The documentation gives all the functions it can perform, and good examples as well for files with many columns. Anyway, it has nothing to do with awk
, but I think datamash
is of great help in cases like this, and could also be used in conjunction with awk
. Hope it helps somebody!
回答4:
This AWK based answer to a similar question on unix.stackexchange.com gives the same results as Excel for calculating the median.
回答5:
If you have an array to compute median from (contains one-liner of Johnsyweb solution):
array=(5 6 4 2 7 9 3 1 8) # numbers 1-9
IFS=$'\n'
median=$(awk '{arr[NR]=$1} END {if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' <<< sort <<< "${array[*]}")
unset IFS
来源:https://stackoverflow.com/questions/6166375/median-of-column-with-awk