Amazon redshift: bulk insert vs COPYing from s3

前端 未结 5 1217
小鲜肉
小鲜肉 2021-01-30 03:06

I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks table. Let\'s say I have ~10 new \'cl

5条回答
  •  借酒劲吻你
    2021-01-30 04:03

    Redshift is an Analytical DB, and it is optimized to allow you to query millions and billions of records. It is also optimized to allow you to ingest these records very quickly into Redshift using the COPY command.

    The design of the COPY command is to work with parallel loading of multiple files into the multiple nodes of the cluster. For example, if you have a 5 small node (dw2.xl) cluster, you can copy data 10 times faster if you have your data is multiple number of files (20, for example). There is a balance between the number of files and the number of records in each file, as each file has some small overhead.

    This should lead you to the balance between the frequency of the COPY, for example every 5 or 15 minutes and not every 30 seconds, and the size and number of the events files.

    Another point to consider is the 2 types of Redshift nodes you have, the SSD ones (dw2.xl and dw2.8xl) and the magnetic ones (dx1.xl and dw1.8xl). The SSD ones are faster in terms of ingestion as well. Since you are looking for very fresh data, you probably prefer to run with the SSD ones, which are usually lower cost for less than 500GB of compressed data. If over time you have more than 500GB of compressed data, you can consider running 2 different clusters, one for "hot" data on SSD with the data of the last week or month, and one for "cold" data on magnetic disks with all your historical data.

    Lastly, you don't really need to upload the data into S3, which is the major part of your ingestion timing. You can copy the data directly from your servers using the SSH COPY option. See more information about it here: http://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-remote-hosts.html

    If you are able to split your Redis queues to multiple servers or at least multiple queues with different log files, you can probably get very good records per second ingestion speed.

    Another pattern that you may want to consider to allow near real time analytics is the usage of Amazon Kinesis, the streaming service. It allows to run analytics on data in delay of seconds, and in the same time prepare the data to copy into Redshift in a more optimized way.

提交回复
热议问题