nltk dependencies in dataflow

六月ゝ 毕业季﹏ 提交于 2019-12-11 12:48:24

问题


I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run

nltk.download('stopwords')
nltk.download('punkt')

and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.

According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.


回答1:


You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?



来源:https://stackoverflow.com/questions/48653847/nltk-dependencies-in-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!