How to unload a table on RedShift to a single CSV file?

后端 未结 5 530
醉话见心
醉话见心 2021-02-02 11:36

I want to migrate a table from Amazon RedShift to MySQL, but using \"unload\" will generate multiple data files which are hard to imported into MySQL directly.

Is there

相关标签:
5条回答
  • 2021-02-02 12:08

    There is no way to force Redshift to generate only a single output file, for sure.

    Under a standard UNLOAD you will have output files created equivalent to the number of system slices, i.e. a system with 8 slices will create 8 files for a single unload command(This is the fastest method to unload.)

    If you add a clause PARALLEL OFF in to he Unload Command, your output will be created as a single file, upto the time where the data extract soze does not go beyond 6.25GB, after which Redshift will automatically break the file into a new chunk.

    The same thing holds true, if you produce compressed output files as well(There of course you will have greater chances to produce a single output file, considering that your file can accommodate more number of records in it.).

    0 讨论(0)
  • 2021-02-02 12:17

    This is an old question at this point, but I feel like all the existing answers are slightly misleading. If your question is, "Can I absolutely 100% guarantee that Redshift will ALWAYS unload to a SINGLE file in S3?", the answer is simply NO.

    That being said, for most cases, you can generally limit your query in such a way that you'll end up with a single file. Per the documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the main factor in limiting the number of files you generate is the actual raw size in bytes of your export (NOT the number of rows). The limit on the size of an output file generated by the Redshift UNLOAD command is 6.2GB.

    So if you want to try to guarantee that you get a single output file from UNLOAD, here's what you should try:

    • Specify PARALLEL OFF. Parallel is "ON" by default and will generally write to multiple files unless you have a tiny cluster (The number of output files with "PARALLEL ON" set is proportional to the number of slices in your cluster). PARALLEL OFF will write files serially to S3 instead of in parallel and will only spill over to using multiple files if you exceed the size limit.
    • Limit the size of your output. The raw size of the data must be less than 6.2GB if you want a single file. So you need to make your query have a more restrictive WHERE clause or use a LIMIT clause to keep the number of records down. Unfortunately neither of these techniques are perfect since rows can be of variable size. It's also not clear to me if the GZIP option affects the output file size spillover limit or not (it's unclear if 6.2GB is the pre-GZIP size limit or the post-GZIP size limit).

    For me, the UNLOAD command that ending up generating a single CSV file in most cases was:

    UNLOAD
    ('SELECT <fields> FROM <table> WHERE <restrict_query>')
    TO 's3://<bucket_name>/<filename_prefix>'
    CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>'
    DELIMITER AS ','
    ADDQUOTES
    NULL AS ''
    PARALLEL OFF;
    

    The other nice side effect of PARALLEL OFF is that it will respect your ORDER BY clause if you have one and generate the files in an order that keeps all the records ordered, even across multiple output files.

    Addendum: There seems to be some folkloric knowledge around using LIMIT 2147483647 to force the leader node to do all the processing and generate a single output file, but this doesn't seem to be actually documented anywhere in the Redshift documentation and as such, relying on it seems like a bad idea since it could change at any time.

    0 讨论(0)
  • 2021-02-02 12:18

    Nope. { You can use a manifest and tell Redshift to direct all output to a single file. } Previous answer was wrong, I had used manifests for loading but not unloading.

    There appears to be 2 possible ways to get a single file:

    1. Easier: Wrap a SELECT … LIMIT query around your actual output query, as per this SO answer but this is limited to ~2 billion rows.
    2. Harder: Use the Unix cat utility to join the files together cat File1.txt File2.txt > union.txt. This will require you to download the files from S3 first.
    0 讨论(0)
  • 2021-02-02 12:21

    It is a bit of a workaround, but you need to make your query a subquery and include a limit. It will then output to one file. E.g.

    select * from (select * from bizdata LIMIT 2147483647);
    

    So basically you are selecting all from a limited set. That is the only way it works. 2147483647 is your max limit, as a limit clause takes an unsigned integer argument.

    So the following will unload to one file:

    unload(' select * from (
    select bizid, data
    from biztable
    limit 2147483647);
     ') to 's3://.......' CREDENTIALS 'aws_access_key_id=<<aws_access_key_id>>;aws_secret_access_key=<<aws_secret_access_key>>' csv ; 
    
    0 讨论(0)
  • 2021-02-02 12:30

    In order to send to a single file use parallel off

    unload ('select * from venue')
    to 's3://mybucket/tickit/unload/venue_' credentials 
    'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
    parallel off;
    

    Also I recommend using Gzip, to make that file even smaller for download.

    unload ('select * from venue')
    to 's3://mybucket/tickit/unload/venue_' credentials 
    'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
    parallel off
    gzip;
    
    0 讨论(0)
提交回复
热议问题