Python - How to - Big Query asynchronous tasks

前端 未结 3 1525
轻奢々
轻奢々 2021-02-10 13:35

This may be a dummy question but I cannot seem to be able to run python google-clood-bigquery asynchronously.

My goal is to run multiple queries concurrently and wait fo

相关标签:
3条回答
  • 2021-02-10 13:54

    If you are working inside of a coroutine and want to run different queries without blocking the event_loop then you can use the run_in_executor function which basically runs your queries in background threads without blocking the loop. Here's a good example of how to use that.

    Make sure though that that's exactly what you need; jobs created to run queries in the Python API are already asynchronous and they only block when you call job.result(). This means that you don't need to use asyncio unless you are inside of a coroutine.

    Here's a quick possible example of retrieving results as soon as the jobs are finished:

    from concurrent.futures import ThreadPoolExecutor, as_completed
    import google.cloud.bigquery as bq
    
    
    client = bq.Client.from_service_account_json('path/to/key.json')
    query1 = 'SELECT 1'
    query2 = 'SELECT 2'
    
    threads = []
    results = []
    
    executor = ThreadPoolExecutor(5)
    
    for job in [client.query(query1), client.query(query2)]:
        threads.append(executor.submit(job.result))
    
    # Here you can run any code you like. The interpreter is free
    
    for future in as_completed(threads):
        results.append(list(future.result()))
    

    results will be:

    [[Row((2,), {'f0_': 0})], [Row((1,), {'f0_': 0})]]
    
    0 讨论(0)
  • 2021-02-10 13:54

    In fact I found a way to wrap my query in an asyinc call quite easily thanks to the asyncio.create_task() function. I just needed to wrap the job.result() in a coroutine; here is the implementation. It does run asynchronously now.

    class BQApi(object):                                                                                                 
        def __init__(self):                                                                                              
            self.api = bigquery.Client.from_service_account_json(BQ_CONFIG["credentials"])                               
    
        async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:                                       
            job = self.api.query(query, **kwargs)                                                                        
            task = asyncio.create_task(self.coroutine_job(job))                                                          
            return await task                                                                                            
    
        @staticmethod                                                                                                    
        async def coroutine_job(job):                                                                                    
            return job.result()                                                                                          
    
    0 讨论(0)
  • 2021-02-10 14:00

    just to share a different solution:

    import numpy as np
    from time import sleep
    
    
    query1 = """
    SELECT
      language.name,
      average(language.bytes)
    FROM `bigquery-public-data.github_repos.languages` 
    , UNNEST(language) AS language
    GROUP BY language.name"""
    query2 = 'SELECT 2'
    
    
    def dummy_callback(future):
        global jobs_done
        jobs_done[future.job_id] = True
    
    
    jobs = [bq.query(query1), bq.query(query2)]
    jobs_done = {job.job_id: False for job in jobs}
    [job.add_done_callback(dummy_callback) for job in jobs]
    
    # blocking loop to wait for jobs to finish
    while not (np.all(list(jobs_done.values()))):
        print('waiting for jobs to finish ... sleeping for 1s')
        sleep(1)
    
    print('all jobs done, do your stuff')
    

    Rather than using as_completed I prefer to use the built-in async functionality from the bigquery jobs themselves. This also makes it possible for me to decompose the datapipeline into separate Cloud Functions, without having to keep the main ThreadPoolExecutor live for the duration of the whole pipeline. Incidentally, this was the reason why I was looking into this: my pipelines are longer than the max timeout of 9 minutes for Cloud Functions (or even 15 minutes for Cloud Run).

    Downside is I need to keep track of all the job_ids across the various functions, but that is relatively easy to solve when configuring the pipeline by specifying inputs and outputs such that they form a directed acyclic graph.

    0 讨论(0)
提交回复
热议问题