BigQuery streaming 'insertAll' performance with PHP

前端 未结 1 1964
半阙折子戏
半阙折子戏 2021-01-14 08:14

We\'re streaming a high volume of data server-side into BigQuery using the google-api-php-client library. The streaming works fine apart from the performance.

Our l

相关标签:
1条回答
  • 2021-01-14 09:16

    Reading all your comments, and side notes. The approach you've chosen does not scale, and won't scale. You need to rethink the approach with async processes.

    Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.

    Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.

    Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.

    You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.

    On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.

    It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.

    A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.

    Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

    0 讨论(0)
提交回复
热议问题