hortonworks-data-platform

Java - MySQL to Hive Import where MySQL Running on Windows and Hive Running on Cent OS (Horton Sandbox)

阅读更多关于 Java - MySQL to Hive Import where MySQL Running on Windows and Hive Running on Cent OS (Horton Sandbox)

Before any Answer and Comments. I tried several option I found in Stackoverflow but end with a failure. Following are those links - How can I execute Sqoop in Java? How to use Sqoop in Java Program? How to import table from MySQL to Hive using Java? How to load SQL data into the Hortonworks? I tried it in Horton Sandbox through command line and succeded. sqoop import --connect jdbc:mysql://192.168.56.101:3316/database_name --username=user --password=pwd --table table_name --hive-import -m 1 -- --schema default Where 192.168.56.101 is for Windows and 192.168.56.102 for Horton Sandbox 2.6. Now I

Kafka Java Producer with kerberos

阅读更多关于 Kafka Java Producer with kerberos

问题 Getting error while sending message to kafka topic in kerberosed enviornment. We have cluster on hdp 2.3 I followed this http://henning.kropponline.de/2016/02/21/secure-kafka-java-producer-with-kerberos/ But for sending messages, I have to do kinit explicitly first, then only I am able to send message to kafka topic. I tried to do knit through java class but that also doesn't work. PFB code: package com.ct.test.kafka; import java.util.Date; import java.util.Properties; import java.util.Random

Java - MySQL to Hive Import where MySQL Running on Windows and Hive Running on Cent OS (Horton Sandbox)

阅读更多关于 Java - MySQL to Hive Import where MySQL Running on Windows and Hive Running on Cent OS (Horton Sandbox)

问题 Before any Answer and Comments. I tried several option I found in Stackoverflow but end with a failure. Following are those links - How can I execute Sqoop in Java? How to use Sqoop in Java Program? How to import table from MySQL to Hive using Java? How to load SQL data into the Hortonworks? I tried it in Horton Sandbox through command line and succeded. sqoop import --connect jdbc:mysql://192.168.56.101:3316/database_name --username=user --password=pwd --table table_name --hive-import -m 1 -

How to set data block size in Hadoop ? Is it advantage to change it?

阅读更多关于 How to set data block size in Hadoop ? Is it advantage to change it?

If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how? There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented: HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

阅读更多关于 Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors . I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster configuration. The Hadoop cluster (HDP 2.4) I'm working on: 1 Master Node: CPU: 2 CPUs with 6 cores each = 12 cores RAM: 64 GB SSD: 2 x 512 GB 5 Slave Nodes: CPU: 2 CPUs with 6 cores each = 12 cores RAM: 64 GB HDD: 4 x 3 TB = 12 TB HBase is installed (this is one of the parameters for the script below) So I ran python yarn-utils.py -c 12 -m 64 -d 4 -k

How to load CSVs with timestamps in custom format?

阅读更多关于 How to load CSVs with timestamps in custom format?

问题 I have a timestamp field in a csv file that I load to a dataframe using spark csv library. The same piece of code works on my local machine with Spark 2.0 version but throws an error on Azure Hortonworks HDP 3.5 and 3.6. I have checked and Azure HDInsight 3.5 is also using the same Spark version so I don't think it's a problem with Spark version. import org.apache.spark.sql.types._ val sourceFile = "C:\\2017\\datetest" val sourceSchemaStruct = new StructType() .add("EventDate",DataTypes

How to set data block size in Hadoop ? Is it advantage to change it?

阅读更多关于 How to set data block size in Hadoop ? Is it advantage to change it?

问题 If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how? 回答1: There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented: HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

阅读更多关于 Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

问题 I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors . I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster configuration. The Hadoop cluster (HDP 2.4) I'm working on: 1 Master Node: CPU: 2 CPUs with 6 cores each = 12 cores RAM: 64 GB SSD: 2 x 512 GB 5 Slave Nodes: CPU: 2 CPUs with 6 cores each = 12 cores RAM: 64 GB HDD: 4 x 3 TB = 12 TB HBase is installed

Hive tables not found when running in YARN-Cluster mode

阅读更多关于 Hive tables not found when running in YARN-Cluster mode

I have a Spark (version 1.4.1) application on HDP 2.3. It works fine when running it in YARN-Client mode. However, when running it on YARN-Cluster mode none of my Hive tables can be found by the application. I submit the application like so: ./bin/spark-submit --class com.myCompany.Main --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 10g --executor-cores 1 --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar /home/spark/apps/YarnClusterTest.jar --files /etc/hive/conf/hive-site.xml Here's a excerpt from the logs: 5

Requests hang when using Hiveserver2 Thrift Java client

阅读更多关于 Requests hang when using Hiveserver2 Thrift Java client

This is a follow up question to this question where I ask what the Hiveserver 2 thrift java client API is. This question should be able to stand along without that background if you don't need any more context. Unable to find any documentation on how to use the hiverserver2 thrift api, I put this together. The best reference I could find was the Apache JDBC implementation . TSocket transport = new TSocket("hive.example.com", 10002); transport.setTimeout(999999999); TBinaryProtocol protocol = new TBinaryProtocol(transport); TCLIService.Client client = new TCLIService.Client(protocol); transport