问题
I have a big dataset in BigQuery table (~45M lines, 13Gb of data). I would like to process that data in my Google Datalab Notebook to do some basic statistics with pandas to visualise data later with matplotlib in Datalab cell. I think it is not a good idea to try to load all dataset into pandas' Dataframe (at least I will have RAM issues).
Is it possible to read data from BigQuery in batches (say 10K lines) to consume it in Datalab?
Thanks in advance!
回答1:
If your purpose is to visualize the data, would sampling be better than loading a small batch?
You can sample your data such as:
import google.datalab.bigquery as bq
df = bq.Query(sql='SELECT image_url, label FROM coast.train WHERE rand() < 0.01').execute().result().to_dataframe()
Or, a use convenient class:
from google.datalab.ml import BigQueryDataSet
sampled_df = BigQueryDataSet(table='myds.mytable').sample(1000)
回答2:
Have you tried just iterating over the table? The Table
object is an iterable that uses a paged fetcher to get data from the BigQuery table, it is streaming in a way. The page size is 1024 by default.
来源:https://stackoverflow.com/questions/43328039/reading-batches-of-data-from-bigquery-into-datalab