data-ingestion

Apache Kudu slow insert, high queuing time

两盒软妹~` 提交于 2021-02-07 10:15:38
问题 I have been using Spark Data Source to write to Kudu from Parquet, and the write performance is terrible: about 12000 rows / seconds. Each row roughly 160 bytes. We have 7 kudu nodes, 24 core + 64 GB RAM each + 12 SATA disk each. None of the resources seem to be the bottleneck: tserver cpu usage ~3-4 core, RAM 10G, no disk congestion. Still I see most of the time write requests were stuck in queuing. Any ideas are appreciated. W0811 12:34:03.526340 7753 rpcz_store.cc:251] Call kudu.tserver

Google data fusion Execution error “INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0.”

梦想与她 提交于 2020-02-24 12:20:29
问题 I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0. at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) ~[na:na] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)

Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

僤鯓⒐⒋嵵緔 提交于 2020-01-04 01:58:27
问题 We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly). I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations. The idea I have right now, is to use custom Java that takes care of REST API calls and

Python Multiprocessing Loop

非 Y 不嫁゛ 提交于 2019-12-22 18:30:29
问题 I'm hoping to use multiprocessing to speed up a sluggish loop. However, from what I've seen of multiprocessing examples, I'm not sure if this sort of implementation is good practice, feasible or possible. There are broadly two parts to the loop: data ingestion and data processing . I would like to have the next part of data ingestion starting while processing is going on, so the data is available as soon as possible. Pseudo code: d = get_data(n) for n in range(N): p = process_data(d) d = get

Best way ho to validate ingested data

会有一股神秘感。 提交于 2019-12-13 05:19:19
问题 I am ingesting data daily from various external sources like GA, scrapers, Google BQ, etc. I store created CSV file into HDFS, create stage table from it and then append it to historical table in Hadoop. Can you share some best practices how to valide new data with historical one? Like for example compare row count of actual data with average of last 10 days or someting like that. Is there any prepared solution in spark or something? Thanks for advices. 来源: https://stackoverflow.com/questions

MarkLogic Cluster - Configure Forest with all documents

喜你入骨 提交于 2019-12-11 20:08:45
问题 We are working on MarkLogic 9.0.8.2 We are setting up MarkLogic Cluster (3 VMs) on Azure and as per failover design, want to have 3 forests (each for Node) in Azure Blob. I am done with Setup and when started ingestion, i found that documents are distributed across 3 forests and not stored all in each Forest. For e.g. i ingested 30000 records and each forest contains 10000 records. What i need is to have all forest with 30000 records. Is there any configuration (at DB or forest level) i need

How do you ingest Spring boot logs directly into elastic

不羁岁月 提交于 2019-11-29 15:50:26
问题 I’m investigating feasability of sending spring boot application logs directly into elastic search. Without using filebeats or logstash. I believe the Ingest plugin may help with this. My initial thoughts are to do this using logback over TCP. https://github.com/logstash/logstash-logback-encoder <?xml version="1.0" encoding="UTF-8"?> <configuration> <appender name="stash" class="net.logstash.logback.appender.LogstashTcpSocketAppender"> <destination>127.0.0.1:4560</destination> <encoder class=