shell: select unique row of flat file

情到浓时终转凉″ 提交于 2019-12-14 01:41:08

问题


I have a flat file looks like this

cat file

ID1, VALUE1_1
ID1, VALUE1_2
ID1, VALUE1_3
ID2, VALUE2_1
ID2, VALUE2_1
ID3, VALUE3_1
ID3...

As you can see from the data sample, for each ID, there are several values for that ID and they could be whatever value - same or not. For me, I don't care which value it is picking up. Any value works for me.

So I only want one value from each ID. I don't really care which one, but if I have to choose, I would say the row which has the longest length.

ID1, VALUE1_2
ID2, VALUE2_1
ID3, VALUE3_1

It might be done in Python but is there an easy way to do that in Shell itself, I am open to use sed or awk but please don't write a whole paragraph of awk code, please..

it might look like:

# Pseudo code
# sort -k 1 file | uniq (max(length) by id)  

thanks a lot !!


回答1:


This will find the first line for each ID:

awk -F, '!seen[$1]++' file

Explained:

  • awk associative arrays to not have to be pre-declared, so the first time an ID is encountered, seen[$1] will have the value zero (for numeric context).
  • seen[$1]++ post-increments the associative array element, so that expression evaluates to zero the first time an ID is seen, and evaluates to some positive integer any other time.
  • awk treats zero as false and any other number as true, so we negate the post-increment expression with the ! operator. Now we have a true expression only when an ID is seen for the first time: !seen[$1]++
  • awk programs look like condition1 {body1} condition2 {body2} ....
    • The body will be executed only when its corresponding condition evaluates to true.
    • If the condition is present but the body is omitted, the default action is {print}
    • to be complete, when the body is present but the condition is omitted, the default condition evaluates to true and the action will be performed for every record.

To sum up, this awk program will print the current record whenever the expression evaluates to true, which will only be the first time an ID is seen.


If you really want the longest line for each ID:

awk '
    length($2) > max[$1] {max[$1] = length($2); line[$1] = $0}
    END {for (id in line) {print line[id]}}
' file

This may shuffle the order of the ids (associative arrays are unordered collections). You can always pipe that into sort if it's a problem.




回答2:


EDIT:

Hey user84771,

So I reworked my answer completely based on what you said. It has a couple more lines in it, but hopefully this is what you're looking for:

In order to find the largest row from 'each ID' similar to a group by in Mysql, I would do the following.

Given the following text file:

[root@dev7 ~]# cat stackoverflow2.log 
ID1, fdsgfdsggfdsgsdfg
ID1, fdsgsdfg
ID1, fdsgfdgdsfgdsgsdfgdgffdsgfsdg
ID1, fdsgsdfg
ID2, fdgsfdsgfdshshdsfhdfghdsfhdfhdshsdfhsfdh
ID2, fsfgsdgf
ID3, fdgfdgdgfdggfdg
[root@dev7 ~]# 

I'd do the following:

_DATAFILE=stackoverflow2.log
_KEYS=$(awk '{ $1=$1; print $1}' ${_DATAFILE} | uniq | sed "s,\,,,g" | xargs )
_LARGEST_PER_KEY=""
echo $_KEYS
for i in ${_KEYS}; do
  _LARGEST_PER_KEY="${_LARGEST_PER_KEY}\n$(grep "$i" ${_DATAFILE} | uniq | awk '{ print length ":", $0 }' | sort -n -u | tail -1 | cut -d ":" -f2 | awk '{ $1=$1; print}')"
done;
echo -e ${_LARGEST_PER_KEY}

To explain whats happening.

  • _DATAFILE - This variable is your input file.
  • _KEYS - This variable returns all of the keys within the first column ( uniq and sorted w/o associated data). I used xargs to make sure all of the keys are put into a straight line for the next step.

[root@dev7 ~]# _KEYS=$(awk '{ $1=$1; print $1}' ${_DATAFILE} | uniq | sed "s,\,,,g" | xargs )

[root@dev7 ~]# echo $_KEYS

ID1 ID2 ID3

  • _LARGEST_PER_KEY - This variable is going to be used for your result when we're done. We define it here before the for loop.

  • The for loop performs a grep for the key in quest ( eg ID1 ) then performs my form line of code to figure out which one contains the longest data value, and performs a numeric/uniq sort to see which one is the largest. We grab that value using tail and append it to our _LARGEST_PER_KEY string. ( note: we add \n characters as delimiters )

  • ONCE THE for loop finishes, we then echo out the results using echo -e to ensure that the newline characters get evaluated correctly on the screen:

[root@dev7 ~]# echo -e ${_LARGEST_PER_KEY}

ID1, fdsgfdgdsfgdsgsdfgdgffdsgfsdg

ID2, fdgsfdsgfdshshdsfhdfghdsfhdfhdshsdfhsfdh

ID3, fdgfdgdgfdggfdg

Note: since we sorted everything in the beginning, there should be no reason to sort again.

Clarification notes:

awk '{ $1=$1; print}' - This removes trailing white spaces ( beginning of line / end of line )

uniq - Gets rid of the duplicates

awk '{ print length ":", $0 }' - Gets the line length of each line, prints it out with "lenghth of line" : "line test"

sort -n -u - numeric sort ( largest number is the last item ). Also ensures that the entire file is sorted uniquely if the datafile arrives unsorted. Thanks for the tip Glenn.

tail -1 - Grab's the last line since its the largest

cut -d ":" -f2 - If you only want the exact line, get rid of the length of the line simply return the line

awk '{ $1=$1; print}' - This removes trailing white spaces ( beginning of line / end of line )

Again, im sure theres a way to do this that is a bit more efficient, but this is what I was able to come up with. Hope this helps!




回答3:


This awk script should do what you want, assuming the file is sorted:

 awk 'prev!=$1{print}{prev=$1}' datafile

Test:

$ cat datafile
ID1, VALUE1_1
ID1, VALUE1_2
ID1, VALUE1_3
ID2, VALUE2_1
ID2, VALUE2_1
ID3, VALUE3_1
$  awk 'prev!=$1{print}{prev=$1}' datafile
ID1, VALUE1_1
ID2, VALUE2_1
ID3, VALUE3_1

Explanation:

  • The prev!=$1{print} part means: if the variable prev has a different value than the first field in the record, then print the line
  • The {prev=$1} part means: Set the variable prev to the value of the first field in the record.

By default the fields are separated by whitespace (unless the -F option is used), and the records are separated by newlines.



来源:https://stackoverflow.com/questions/18110351/shell-select-unique-row-of-flat-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!