Read Parquet file stored in S3 with AWS Lambda (Python 3)

前端未结

关注

 4  933

星月不相逢

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

https://github.com/lambci/docker-lambda as a

相关标签:

4条回答

深忆病人

2021-01-02 03:59
One can also achieve this through the AWS sam cli and Docker (we'll explain this requirement later).

1.Create a directory and initialize sam
```
mkdir some_module_layer
cd some_module_layer
sam init
```
by typing the last command a series of three question would be prompted. One could choose the following series of answers (I'm considering working under Python3.7, but other options are possible).

1 - AWS Quick Start Templates

8 - Python 3.7

Project name [sam-app]: some_module_layer

1 - Hello World Example

2. Modify requirements.txt file
```
cd some_module_layer
vim hello_world/requirements.txt
```
this will open requirements.txt file on vim, on Windows you could type instead code hello_world/requirements.txt to edit the file on Visual Studio Code.

3. Add pyarrow to requirements.txt

Alongside pyarrow, it will work to include additionnaly pandas and s3fs. In this case including pandas will avoid it to not recognize pyarrow as an engine to read parquet files.
```
pandas
pyarrow
s3fs
```
4. Build with a container

Docker is required to use the option --use-container when running the sam build command. If it's the first time, it will pull the lambci/lambda:build-python3.7 Docker image.
```
sam build --use-container
rm .aws-sam/build/HelloWorldFunction/app.py
rm .aws-sam/build/HelloWorldFunction/__init__.py
rm .aws-sam/build/HelloWorldFunction/requirements.txt
```
notice that we're keeping only the python libraries.

5. Zip files
```
cp -r .aws-sam/build/HelloWorldFunction/ python/
zip -r some_module_layer.zip python/
```
On Windows, it would work to run Compress-Archive python/ some_module_layer.zip.

6. Upload zip file to AWS

The following link is useful for this.
0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2021-01-02 04:05

This was an environment issue (Lambda in VPC not getting access to the bucket). Pyarrow is now working.
Hopefully the question itself will give a good-enough overview on how to make all that work.

0 讨论(0)
发布评论:

提交评论
- 加载中...

暗喜

2021-01-02 04:08

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.

In the Docs there is a step-by-step to do it.

Code example:

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Reference

0 讨论(0)

野趣味

2021-01-02 04:15
I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.

Here's how I did it:

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2

Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:
```
sudo yum list | grep python3
```
I installed:
```
python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64
```
2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:
```
mkdir parquet
cd parquet
pip install -t . fastparquet 
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda
```
Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.

Source: Write parquet from AWS Kinesis firehose to AWS S3
0 讨论(0)
发布评论:

提交评论
- 加载中...

Read Parquet file stored in S3 with AWS Lambda (Python 3)

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command: