问题
I have a flat file looks like this
cat file
ID1, VALUE1_1
ID1, VALUE1_2
ID1, VALUE1_3
ID2, VALUE2_1
ID2, VALUE2_1
ID3, VALUE3_1
ID3...
As you can see from the data sample, for each ID, there are several values for that ID and they could be whatever value - same or not. For me, I don't care which value it is picking up. Any value works for me.
So I only want one value from each ID. I don't really care which one, but if I have to choose, I would say the row which has the longest length.
ID1, VALUE1_2
ID2, VALUE2_1
ID3, VALUE3_1
It might be done in Python but is there an easy way to do that in Shell itself, I am open to use sed or awk but please don't write a whole paragraph of awk code, please..
it might look like:
# Pseudo code
# sort -k 1 file | uniq (max(length) by id)
thanks a lot !!
回答1:
This will find the first line for each ID:
awk -F, '!seen[$1]++' file
Explained:
- awk associative arrays to not have to be pre-declared, so the first time an ID is encountered,
seen[$1]
will have the value zero (for numeric context). seen[$1]++
post-increments the associative array element, so that expression evaluates to zero the first time an ID is seen, and evaluates to some positive integer any other time.- awk treats zero as false and any other number as true, so we negate the post-increment expression with the
!
operator. Now we have a true expression only when an ID is seen for the first time:!seen[$1]++
- awk programs look like
condition1 {body1} condition2 {body2} ...
.- The
body
will be executed only when its correspondingcondition
evaluates to true. - If the condition is present but the body is omitted, the default action is
{print}
- to be complete, when the body is present but the condition is omitted, the default condition evaluates to true and the action will be performed for every record.
- The
To sum up, this awk program will print the current record whenever the expression evaluates to true, which will only be the first time an ID is seen.
If you really want the longest line for each ID:
awk '
length($2) > max[$1] {max[$1] = length($2); line[$1] = $0}
END {for (id in line) {print line[id]}}
' file
This may shuffle the order of the ids (associative arrays are unordered collections). You can always pipe that into sort
if it's a problem.
回答2:
EDIT:
Hey user84771,
So I reworked my answer completely based on what you said. It has a couple more lines in it, but hopefully this is what you're looking for:
In order to find the largest row from 'each ID' similar to a group by in Mysql, I would do the following.
Given the following text file:
[root@dev7 ~]# cat stackoverflow2.log
ID1, fdsgfdsggfdsgsdfg
ID1, fdsgsdfg
ID1, fdsgfdgdsfgdsgsdfgdgffdsgfsdg
ID1, fdsgsdfg
ID2, fdgsfdsgfdshshdsfhdfghdsfhdfhdshsdfhsfdh
ID2, fsfgsdgf
ID3, fdgfdgdgfdggfdg
[root@dev7 ~]#
I'd do the following:
_DATAFILE=stackoverflow2.log
_KEYS=$(awk '{ $1=$1; print $1}' ${_DATAFILE} | uniq | sed "s,\,,,g" | xargs )
_LARGEST_PER_KEY=""
echo $_KEYS
for i in ${_KEYS}; do
_LARGEST_PER_KEY="${_LARGEST_PER_KEY}\n$(grep "$i" ${_DATAFILE} | uniq | awk '{ print length ":", $0 }' | sort -n -u | tail -1 | cut -d ":" -f2 | awk '{ $1=$1; print}')"
done;
echo -e ${_LARGEST_PER_KEY}
To explain whats happening.
- _DATAFILE - This variable is your input file.
- _KEYS - This variable returns all of the keys within the first column ( uniq and sorted w/o associated data). I used xargs to make sure all of the keys are put into a straight line for the next step.
[root@dev7 ~]# _KEYS=$(awk '{ $1=$1; print $1}' ${_DATAFILE} | uniq | sed "s,\,,,g" | xargs )
[root@dev7 ~]# echo $_KEYS
ID1 ID2 ID3
_LARGEST_PER_KEY - This variable is going to be used for your result when we're done. We define it here before the for loop.
The for loop performs a grep for the key in quest ( eg ID1 ) then performs my form line of code to figure out which one contains the longest data value, and performs a numeric/uniq sort to see which one is the largest. We grab that value using tail and append it to our _LARGEST_PER_KEY string. ( note: we add \n characters as delimiters )
ONCE THE for loop finishes, we then echo out the results using echo -e to ensure that the newline characters get evaluated correctly on the screen:
[root@dev7 ~]# echo -e ${_LARGEST_PER_KEY}
ID1, fdsgfdgdsfgdsgsdfgdgffdsgfsdg
ID2, fdgsfdsgfdshshdsfhdfghdsfhdfhdshsdfhsfdh
ID3, fdgfdgdgfdggfdg
Note: since we sorted everything in the beginning, there should be no reason to sort again.
Clarification notes:
awk '{ $1=$1; print}' - This removes trailing white spaces ( beginning of line / end of line )
uniq - Gets rid of the duplicates
awk '{ print length ":", $0 }' - Gets the line length of each line, prints it out with "lenghth of line" : "line test"
sort -n -u - numeric sort ( largest number is the last item ). Also ensures that the entire file is sorted uniquely if the datafile arrives unsorted. Thanks for the tip Glenn.
tail -1 - Grab's the last line since its the largest
cut -d ":" -f2 - If you only want the exact line, get rid of the length of the line simply return the line
awk '{ $1=$1; print}' - This removes trailing white spaces ( beginning of line / end of line )
Again, im sure theres a way to do this that is a bit more efficient, but this is what I was able to come up with. Hope this helps!
回答3:
This awk script should do what you want, assuming the file is sorted:
awk 'prev!=$1{print}{prev=$1}' datafile
Test:
$ cat datafile
ID1, VALUE1_1
ID1, VALUE1_2
ID1, VALUE1_3
ID2, VALUE2_1
ID2, VALUE2_1
ID3, VALUE3_1
$ awk 'prev!=$1{print}{prev=$1}' datafile
ID1, VALUE1_1
ID2, VALUE2_1
ID3, VALUE3_1
Explanation:
- The
prev!=$1{print}
part means: if the variableprev
has a different value than the first field in the record, then print the line - The
{prev=$1}
part means: Set the variable prev to the value of the first field in the record.
By default the fields are separated by whitespace (unless the -F
option is used), and the records are separated by newlines.
来源:https://stackoverflow.com/questions/18110351/shell-select-unique-row-of-flat-file