I am interested in using Google cloud Dataflow to parallel process videos. My job uses both OpenCV and tensorflow. Is it possible to just run the workers inside a docker ins
One solution is to issue the pip install commands through the setup.py option listed for Non-Python Dependencies.
Doing this will download the manylinux wheel instead of the source distribution that the requirements file processing will stage.
If you have a large number of videos you will have to incur the large startup cost regardless. Thus is the nature of Grid Computing in general.
The other side of this is that you could use larger machines under the job than the n1-standard-1 machines, thus amortizing the cost of the download across less machines that could potentially process more videos at once if the processing was coded correctly.
It is not possible to modify or switch the default Dataflow worker container. You need to install the dependencies according to the documentation.