问题
I need to grep a term over thousands of files in S3, and list those file names in some output file. I'm quite new using cli, so I've been testing both on my local, and in a small subset in s3.
So far I've got this:
aws s3 cp s3://mybucket/path/to/file.csv - | grep -iln searchterm > output.txt
The problem with this is with the hyphen. Since I'm copying over to standard output, the -l switch in grep returns (standard input) instead of file.csv
My desired output is
file.csv
Eventually, I'll need to iterate this over the whole bucket, and then all buckets, to get
file1.csv
file2.csv
file3.csv
But I need to get over this hurdle first. Thanks!
回答1:
Because you print the file in STDOUT and pipe that to grep STDIN, grep has no idea that the original file was file.csv
. If you have a long list of files, I would do:
while read -r file; do aws s3 cp s3://mybucket/path/to/${file} - | grep -q searchterm && { echo ${file} >> output.txt; }; done < files_list.txt
I cannot try it, because I do not have access to an AWS S3 instance, but the trick is to use grep quietly (-q
), it will return true if it finds at least a match, false otherwise; Then you can print the name of the file.
EDIT: Explanation
- The while loop will iterate over each line of
files_list.txt
- The
aws
command will print this file instdout
- We redirect
stdout
togrep
in quiet mode (-q
) which acts as a pattern matcher, returning true if a match was found, false ohter wise. - If grep returns true, we append the name of the file (
${file}
) to our output file.
EDIT2: Other solution
while read -r file; do aws s3 cp s3://mybucket/path/to/${file} - | sed -n /searchpattern/{F;q} >> output.txt; done < files_list.txt
Explanation
Steps 1 and 2 are the same, then:
stdout
is redirected to sed, which will look in the file line by line until it finds the firststream pattern
, and then quit (q
), printing the file name (F
) in the output file.
来源:https://stackoverflow.com/questions/42707646/how-to-grep-a-term-from-s3-and-output-object-name