In near real time analytics, why is Lambda-->Firehose-->S3 preferred over Lambda -->S3?

眉间皱痕 提交于 2021-01-04 06:38:18

问题


Many AWS reference architectures for serverless real-time analytics, suggest pushing processed data from Lambda to S3 through Kinesis Firehose.

e.g. https://aws.amazon.com/blogs/big-data/create-real-time-clickstream-sessions-and-run-analytics-with-amazon-kinesis-data-analytics-aws-glue-and-amazon-athena/

Why can’t we push data from Lambda to S3 directly? Isn't it better to avoid complexity and additional cost by skipping the mediator Kinesis Firehose component? Is there any problem with writing real-time data by Lambda directly to S3?


回答1:


Mainly because Firehose enables you to batch the data. It will e.g. only write files of 128mb of data gzipped into S3. It will collect incoming data until a threshold is reached, write it to S3 and wait for the next data. If you let the lambda write to S3 directly then you would have to do the batching yourself, which is pretty difficult if you only have state-less lambdas.

That being said this mainly applies if your data consists of MANY records / rows. If on the other hand you are basically dealing with blobs of lets say 50MB of data that your lambda outputs then you can / should write to S3 directly because the batching may not be possible or useful in your case.

Wether or not you should use firehose simply depends on what data / throughput you have and what requirements there may be.

One problem of writing real time data to S3 directly is that if you want to e.g. query it with Athena you will get into a lot of trouble if you have millions of files a few bytes large instead of 100s of files 10s of MB large.



来源:https://stackoverflow.com/questions/65458856/in-near-real-time-analytics-why-is-lambda-firehose-s3-preferred-over-lambda

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!