How to grep into files stored in S3

后端未结

关注

 3  1213

Do anybody know how to perform grep on S3 files with aws S3 directly into the bucket? For example I have FILE1.csv, FILE2.csv with many rows and want to look for the rows that

相关标签:

3条回答

鱼传尺愫

2021-02-07 14:26
You can also use the GLUE/Athena combo which allows you to execute directly within AWS. Depending on data volumes, queries' cost can be significant and take time.

Basically
- Create a GLUE classifier that reads byline
- Create a crawler to your S3 data directory against a database (csvdumpdb) - it will create a table with all the lines across all the csvs found
- Use Athena to query, e.g.
  
  select "$path",line from where line like '%some%fancy%string%'
- and get something like
  
  $path line
  
  s3://mybucket/mydir/my.csv "some I did find some,yes, "fancy, yes, string"
Saves you from having to run any external infrastructure.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-02-07 14:42
The aws s3 cp command can send output to stdout:
```
aws s3 cp s3://mybucket/foo.csv - | grep 'JZZ'
```
The dash (-) signals the command to send output to stdout.

See: How to use AWS S3 CLI to dump files to stdout in BASH?
0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2021-02-07 14:45

You can do it locally with the following command:

aws s3 ls --recursive s3://<bucket_name>/<path>/ | awk '{print $4}' | xargs -I FNAME sh -c "echo FNAME; aws s3 cp s3://<bucket_name>/FNAME - | grep --color=always '<regex_pattern>'"

Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output.

I don't recommend this approach if you have to download a lot of data from S3 (due to transfer costs). You can avoid the costs for internet transfer though if you run the command on some EC2 instance that is located in a VPC with an S3 VPC endpoint attached to it.

0 讨论(0)
发布评论:

提交评论
- 加载中...