data-pipeline

Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory

徘徊边缘 提交于 2021-02-19 08:48:05
问题 I am trying to establish an Azure Data Factory copy data pipeline. The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want to unzip and save the extracted XML files in Azure Blob Storage using Azure Data Factory. I was trying to follow the configurations mentioned here: How to decompress a zip file in Azure Data Factory v2 but I am getting the following error:

Dataflow with python flex template - launcher timeout

£可爱£侵袭症+ 提交于 2021-02-10 05:22:50
问题 I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with timeout. Here is some of logs I found in GCE console: INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template

Dataflow with python flex template - launcher timeout

北慕城南 提交于 2021-02-10 05:22:09
问题 I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with timeout. Here is some of logs I found in GCE console: INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template

Google data fusion Execution error “INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0.”

梦想与她 提交于 2020-02-24 12:20:29
问题 I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0. at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) ~[na:na] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)

Truncate DynamoDb or rewrite data via Data Pipeline

老子叫甜甜 提交于 2019-12-23 20:42:29
问题 There is possibility to dump DynamoDb via Data Pipeline and also import data in DynamoDb. Import is going well, but all the time data appends to already exists data in DynamoDb. For now I found work examples that scan DynamoDb and delete items one by one or via Batch. But at any rate for big amount of data it is not good variant. Also it is possible to delete table at all and create it. But with that variant indexes will be lost. So, best way would be to override DynamoDb data via import by

Is there a way to continuously pipe data from Azure Blob into BigQuery?

Deadly 提交于 2019-12-23 04:59:42
问题 I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that all new data in there gets sent over to BigQuery? 回答1: BigQuery offers support for querying data directly from these external data sources: Google Cloud Bigtable , Google Cloud Storage , Google Drive . Not include Azure Blob storage. As Adam Lydick

How to access the response from Airflow SimpleHttpOperator GET request

别等时光非礼了梦想. 提交于 2019-12-06 00:43:30
问题 I'm learning Airflow and have a simple quesiton. Below is my DAG called dog_retriever import airflow from airflow import DAG from airflow.operators.http_operator import SimpleHttpOperator from airflow.operators.sensors import HttpSensor from datetime import datetime, timedelta import json default_args = { 'owner': 'Loftium', 'depends_on_past': False, 'start_date': datetime(2017, 10, 9), 'email': 'rachel@loftium.com', 'email_on_failure': False, 'email_on_retry': False, 'retries': 3, 'retry

Feeding .npy (numpy files) into tensorflow data pipeline

喜你入骨 提交于 2019-11-29 06:56:00
Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory. Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label. Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs: Consuming NumPy arrays If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

Feeding .npy (numpy files) into tensorflow data pipeline

十年热恋 提交于 2019-11-27 23:55:38
问题 Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory. Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label. 回答1: Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs: Consuming NumPy arrays If all of your input data fit in memory, the simplest way