How can this be completed with the Google Vision-API please?
After two years, its the same. I am facing similar challenges and I am thinking of opting other solutions. I think custom solutions like TensorFlow object detection API or DarkNet YOLO object API will do this job very easily.
TensorFlow object detection API
DarkNet YOLO object API