I have a big dataset in BigQuery table (~45M lines, 13Gb of data). I would like to process that data in my Google Datalab Notebook to do some basic statistics with pandas to visualise data later with matplotlib in Datalab cell. I think it is not a good idea to try to load all dataset into pandas' Dataframe (at least I will have RAM issues).
Is it possible to read data from BigQuery in batches (say 10K lines) to consume it in Datalab?
Thanks in advance!
If your purpose is to visualize the data, would sampling be better than loading a small batch?
You can sample your data such as:
import google.datalab.bigquery as bq
df = bq.Query(sql='SELECT image_url, label FROM coast.train WHERE rand() < 0.01').execute().result().to_dataframe()
Or, a use convenient class:
from google.datalab.ml import BigQueryDataSet
sampled_df = BigQueryDataSet(table='myds.mytable').sample(1000)
Have you tried just iterating over the table? The Table
object is an iterable that uses a paged fetcher to get data from the BigQuery table, it is streaming in a way. The page size is 1024 by default.
来源:https://stackoverflow.com/questions/43328039/reading-batches-of-data-from-bigquery-into-datalab