hortonworks-data-platform

Apache NiFi - OutOfMemory Error: GC overhead limit exceeded on SplitText processor

妖精的绣舞 提交于 2019-12-04 10:24:22
I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2. I've implemented my flow, and everything is working fine for small files. The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded from the SplitText processor responsible of splitting the file into single records. I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space. I expect this means that too many flow files are being generated

How to load SQL data into the Hortonworks?

不打扰是莪最后的温柔 提交于 2019-12-04 09:46:50
I have Installed Hortonworks SandBox in my pc. also tried with a CSV file and its getting in a table structerd manner its OK (Hive + Hadoop), nw I want to migrate my current SQL Databse into Sandbox (MS SQL 2008 r2).How I will do this? Also want to connect to my project (VS 2010 C#). Is it possible to connect through ODBC? I Heard sqoop is using for transferring data from SQL to Hadoop so how I can do this migration with sqoop? You could write your own job to migrate the data. But Sqoop would be more convenient. To do that you have to download Sqoop and the appropriate connector, Microsoft SQL

Install error: ftheader.h: No such file or directory

烂漫一生 提交于 2019-12-04 01:09:44
When I am trying to build matplotlib-1.3.1, I am getting the below freetype header errors. Probably it is not finding the ftheader.h. Any idea on how to solve this problem? NOTE: I just installed Freetype-2.5.0.1 following the instructions as mentioned in FreeType Install because manually building Matplotlib-1.3.1 from source was failing due to the required package 'freetype' which was not found initially. In file included from src/ft2font.h:16, from src/ft2font.cpp:3: /usr/include/ft2build.h:56:38: error: freetype/config/ftheader.h: No such file or directory In file included from src/ft2font

How to delete files from the HDFS?

纵然是瞬间 提交于 2019-12-04 00:03:03
I just downloaded Hortonworks sandbox VM, inside it there are Hadoop with the version 2.7.1. I adding some files by using the hadoop fs -put /hw1/* /hw1 ...command. After it I am deleting the added files, by the hadoop fs -rm /hw1/* ...command, and after it cleaning the recycle bin, by the hadoop fs -expunge ...command. But the DFS Remaining space not changed after recyle bin cleaned. Even I can see that the data was truly deleted from the /hw1/ and the recyle bin. I have the fs.trash.interval parameter = 1 . Actually I can find all my data split in chunks in the /hadoop/hdfs/data/current/BP

How to install libraries to python in zeppelin-spark2 in HDP

左心房为你撑大大i 提交于 2019-12-03 21:47:04
I am using HDP Version: 2.6.4 Can you provide a step by step instructions on how to install libraries to the following python directory under spark2 ? The sc.version (spark version) returns res0: String = 2.2.0.2.6.4.0-91 The spark2 interpreter name and value is as following zeppelin.pyspark.python: /usr/local/Python-3.4.8/bin/python3.4 The python version and current libraries are %spark2.pyspark import pip import sys sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()]) print("--") print (sys.version) print("--") print(installed_packages_list) -- 3.4.8 (default,

nifi ConvertRecord JSON to CSV getting only single record?

自古美人都是妖i 提交于 2019-12-03 10:15:03
I have the below flow set up for reading json data and convert it to csv using the convertRecord processor. However, the output flowfile is only populated with single record (I am assuming only the first record) instead of all the records. Can someone help provide the correct configuration? Source json data: {"creation_Date": "2018-08-19", "Hour_of_day": 7, "log_count": 2136} {"creation_Date": "2018-08-19", "Hour_of_day": 17, "log_count": 606} {"creation_Date": "2018-08-19", "Hour_of_day": 14, "log_count": 1328} {"creation_Date": "2018-08-19", "Hour_of_day": 20, "log_count": 363} flow:

Issue in connecting kafka from outside

限于喜欢 提交于 2019-12-03 09:12:17
I am using hortonwork Sandbox for kafka server trying to connect kafka from eclipse with java code . Use this configuration to connect to producer to send the message metadata.broker.list=sandbox.hortonworks.com:45000 serializer.class=kafka.serializer.DefaultEncoder zk.connect=sandbox.hortonworks.com:2181 request.required.acks=0 producer.type=sync where sandbox.hortonworks.com is sandboxname to whom i connect in kafka server.properties I changed this configuration host.name=sandbox.hortonworks.com advertised.host.name=System IP(on which my eclipse is running) advertised.port=45000 did the port

Find port number where HDFS is listening

六月ゝ 毕业季﹏ 提交于 2019-12-03 03:10:25
问题 I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names. I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name . But this seems to be different on different distributions. For example, hdfs is

Find port number where HDFS is listening

谁说我不能喝 提交于 2019-12-02 17:10:37
I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names. I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name . But this seems to be different on different distributions. For example, hdfs is maprfs on MapR. IBM BigInsights don't even have core-site.xml in $HADOOP_HOME/conf . There doesn't seem to

Distcp - Container is running beyond physical memory limits

别来无恙 提交于 2019-12-02 02:34:41
问题 I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case: USE CASE I have a main folder in a certain location say /hdfs/root , with a lot of subdirs (deepness is not fixed) and files. Volume: 200,000 files ~= 30 GO I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest This subset is defined by a list of absolute path that can be updated over time. Volume: 50,000 files ~= 5 GO You understand that I can't use a