How do I calculate the standard deviation in my shell script?

非 Y 不嫁゛ 提交于 2019-12-08 09:13:15

问题


I have a shell script:

dir=$1 
cd $dir 
grep -P -o '(?<=<rating>).*' * | 
awk -F: '{A[$1]+=$2;L[$1]++;next}END
{for(i in A){print i, A[i]/L[i]}}' | sort -nr -k2 | 
awk '{ sub(/.dat/, " "); print }'

which sums up all of the numbers that follow the <rating> field in each file of my folder but now I need to calculate the standard deviation of the numbers rather than getting the average. By summing up the difference of each rating in the file from the mean squared and then dividing this by the sample size -1. I do not need to do this in every file in the folder, but instead in 2 specific files, hotel_188937.dat and hotel_203921.dat. Here is an example of the contents of one of these files:

<Overall Rating>
<Avg. Price>$155
<URL>

<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5

<Author>...
repeat fields again...

The sample size of the first file is 127 with a mean of 4.78 compared with a sample size of 324 and a mean of 4.78 for the second file. Is there anyway that I can alter my script to calculate the standard deviation for these two specific files rather than calculating the average for every file in my directory? Thanks for your time.


回答1:


You can do all in one awk script

$ awk -F'>' '
    $1=="<rating" {k=FILENAME;sub(/.dat/,"",k);
                   s[k]+=$2;ss[k]+=$2^2;c[k]++}
               END{for(i in s) 
                   print i,m=s[i]/c[i],sqrt(ss[i]/c[i]-m^2)}' r1.dat r2.dat

r1 2.5 1.11803
r2 3 1.41421

s is for sum, ss for square sum, c for count, m for mean. Note that this computes population standard deviation not sample standard deviation. For latter you need to do some scaling adjustments with (count-1).




回答2:


Yes.

The * in the grep line tells it to search in all the files.

Change the line

grep -P -o '(?<=<rating>).*' * | 

to

grep -P -o '(?<=<rating>).*' hotel_188937.dat hotel_203921.dat | 


来源:https://stackoverflow.com/questions/35628103/how-do-i-calculate-the-standard-deviation-in-my-shell-script

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!