S3: How to do a partial read / seek without downloading the complete file?

前端 未结 3 1826
不思量自难忘°
不思量自难忘° 2020-12-02 17:12

Although they resemble files, objects in Amazon S3 aren\'t really \"files\", just like S3 buckets aren\'t really directories. On a Unix system I can use head to

相关标签:
3条回答
  • 2020-12-02 17:49

    The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.

    0 讨论(0)
  • 2020-12-02 17:58

    Using Python you can preview first records of compressed file.

    Connect using boto.

    #Connect:
    s3 = boto.connect_s3()
    bname='my_bucket'
    self.bucket = s3.get_bucket(bname, validate=False)
    

    Read first 20 lines from gzip compressed file

    #Read first 20 records
    limit=20
    k = Key(self.bucket)
    k.key = 'my_file.gz'
    k.open()
    gzipped = GzipFile(None, 'rb', fileobj=k)
    reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
    for id,line in enumerate(reader):
        if id>=int(limit): break
        print(id, line)
    

    So it's an equivalent of a following Unix command:

    zcat my_file.gz|head -20
    
    0 讨论(0)
  • 2020-12-02 17:59

    S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.

    Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.

    0 讨论(0)
提交回复
热议问题