loading a text files (.txt) in cloud storage into big query table

♀尐吖头ヾ 提交于 2021-02-05 12:17:37

问题


I have a set of text files that are uploaded every 5 minutes into the google cloud storage. I want to put them into BigQuery in every 5 minutes (because text files uploaded into Cloud Storage in every 5 min). I know text files cant to be uploaded into BigQuery. What is the best approach for this?

Sample of a text file

Thanks in advance.


回答1:


He is an alternative approach, which will use an event-based Cloud Function to load data into BigQuery. Create a cloud function with "Trigger Type" as cloud storage. As soon as file/files loaded into cloud storage bucket, it will invoke/trigger cloud function event and data from cloud storage will be loaded into BigQuery.

import pandas as pd
from google.cloud import bigquery

def bqDataLoad(event, context):
    bucketName = event['bucket']
    blobName = event['name']
    fileName = "gs://" + bucketName + "/" + blobName
    
    bigqueryClient = bigquery.Client()
    tableRef = bigqueryClient.dataset("bq-dataset-name").table("bq-table-name")

    dataFrame = pd.read_csv(fileName)

    bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
    bigqueryJob.result()



回答2:


You can take advantage of BigQuery transfers.

  1. Create an empty BigQuery table with Schema (edit as text) Text:STRING
  2. Transform your .txt files into .csv files
  3. Create the BigQuery transfer from Google Cloud Storage
  4. Upload your .csv files into the GCS bucket
  5. Check if your transfer was successful

For now, this service transfers the newly added files every hour with a 1h minimum file age limitation that is on the way to be removed soon.

The service checks the presence of new files that are older than 1h from the time they were uploaded in the bucket, for example:

  • text1.csv was uploaded at 4:46
  • text2.csv was uploaded at 5:01
  • text3.csv was uploaded at 5:06
    Results:
  • The transfer run of 5:00 will not transfer any file
  • The transfer run of 6:00 will transfer text1.csv
  • The transfer run of 7:00 will transfer text2.csv and text3.csv

For step 2, you need to process your text file as to be accepted by BigQuery. I think the easiest way is to use .csv files. Edit your .txt file as follows:

  • adding the character " in the beginning and at the end of the text e.g. "I am going to the market to buy vegetables."
  • 'save as' the file as text1.csv
  • name the files to have the same beginning characters e.g. text[...].csv so to be able to use wildcards
  • repeat this for your next files (text2.csv, text3.csv, text4.csv ...)

You also need to make sure of the followings:

  • your text doesn't contains " characters inside the text - replace them with the ' character
  • make sure your whole text is inline as newlines (EOF) are not supported

For step 3, find below the suggested transfer configurations:

  • Schedule options:

Custom --> every 1 hours

  • Cloud Storage URI:

yourbucket/yourfolder/text*

The transfer will pick up all the files that start with the name text

  • Write preference:

APPEND

  • File format:

CSV

For step 5, verify in the Transfer details page each hour if the transfer was successful. If you get errors, the whole batch of files will not be transferred. Use the CLI (see the below command) to get information on which file has issues and the nature of the error. You will need to delete the respective file from the bucket, correct it and re-upload it.

bq --format=prettyjson show -j [bqts_...]

Also preview your BigQuery table to see your transferred texts.



来源:https://stackoverflow.com/questions/63186809/loading-a-text-files-txt-in-cloud-storage-into-big-query-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!