问题
I have a large file s3://my-bucket/in.tsv.gz
that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz
.
- How do I streamline the
in.tsv.gz
directly from s3 without loading all the file to memory (it cannot fit the memory) - How do I write the processed gzipped stream directly to s3?
In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv
if it were located locally bucket_dir_local = ./
.
import pandas as pd
import s3fs
import os
import gzip
import csv
import io
bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")
bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
w.writeheader()
for index, row in df.iterrows():
my_dict = {"test": index, "testing": row[6]}
w.writerow(my_dict)
Edit: smart_open looks like the way to go.
回答1:
For downloading the file you can stream the S3 object directly in python. I'd recommend reading that entire post but some key lines from it
import boto3
s3 = boto3.client('s3', aws_access_key_id='mykey', aws_secret_access_key='mysecret') # your authentication may vary
obj = s3.get_object(Bucket='my-bucket', Key='my/precious/object')
import gzip
body = obj['Body']
with gzip.open(body, 'rt') as gf:
for ln in gf:
process(ln)
Unfortunately S3 doesn't support true streaming input but this SO answer has an implementation that chunks out the file and sends each chunk up to S3. While not a "true stream" it will let you upload large files without needing to keep the entire thing in memory
回答2:
Here is a dummy example to read a file from s3 and write it back to s3 using smart_open
from smart_open import open
import os
bucket_dir = "s3://my-bucket/annotations/"
with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
with open(
os.path.join(bucket_dir, "out.tsv.gz"), "wb"
) as fout:
for line in fin:
l = [i.strip() for i in line.decode().split("\t")]
string = "\t".join(l) + "\n"
fout.write(string.encode())
来源:https://stackoverflow.com/questions/65067909/how-to-stream-a-large-gzipped-tsv-file-from-s3-process-it-and-write-back-to-a