How can I use an external python library in AWS Glue?

末鹿安然 提交于 2021-02-07 03:59:42

问题


First stack overflow question here. Hope I do this correctly:

I need to use an external python library in AWS glue. "Openpyxl" is the name of the library.

I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script.

I tried your typical Import openpyxl , but that just returns the following error:

ImportError: No module named openpyxl

Obviously I don't know what to do here - also relatively new to programming so I'm not sure if this is a noob question or what. Thanks in advance!


回答1:


It depends if the job is Spark or Python Shell. For Spark you just need to zip the library and then when you point the job to the library S3 path, the job will import it. You just need to make sure that the zip contains this file: __init__.py

For example, for the library you are trying to import, if you download it from https://pypi.org/project/openpyxl/#files, you can zip the folder openpyxl inside the openpyxl-3.0.0.tar.gz, and store it in S3.


On the other hand, if it is a Python Shell job, a zip file will not work. You will need to create an egg file from the library. If you are using this version openpyxl-3.0.0, then you can download it from that same website, extract everything, and run the command python setup.py bdist_egg or python3 instead of python if you use python3 instead.

This will generate an egg file inside dist folder which is also generated. You just need to put that egg file in S3 and point the Glue Job Python Libraries to that path.

If you already have the library and for some reason you don't have the setup.py, then you must create it in order to run the command to generate the egg file. Please refer to http://www.blog.pythonlibrary.org/2012/07/12/python-101-easy_install-or-how-to-create-eggs/. There you can find an example.




回答2:


You can now (as of Glue version 2) directly add external libraries using --additional-python-modules parameter.

For example to update or to add a new scikit-learn module use the following key/value:

"--additional-python-modules", "scikit-learn==0.21.3".

More details could be found in the docs.




回答3:


You may use following boilerplate code to use extra files as well as external libraries - https://github.com/fatangare/aws-python-shell-deploy



来源:https://stackoverflow.com/questions/58205999/how-can-i-use-an-external-python-library-in-aws-glue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!