MapR

Why is parquet slower for me against text file format in hive?

我们两清 提交于 2019-12-07 05:12:20
问题 OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A]

configure Druid to connect to Zookeeper on port 5181

十年热恋 提交于 2019-12-06 07:54:25
I'm running a MapR cluster and want to do some timeseries analysis with Druid . MapR uses a non-standard port for Zookeeper (port 5181 instead of the conventional port 2181). When I start the Druid coordinator service, it attempts to connect on the conventional Zookeeper port and fails: 2015-03-03T17:46:49,614 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. 2015-03-03T17:46:49,617 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,

Why is parquet slower for me against text file format in hive?

99封情书 提交于 2019-12-05 11:15:47
OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A] Table C Format - Parquet with snappy compression Table size - 1.9 Gb [Create table C stored as parquet

Does HBase impose a maximum size per row?

こ雲淡風輕ζ 提交于 2019-12-05 02:59:08
问题 High-Level Question: Does HBase impose a maximum size per row which is common to all distributions (and thus not an artifact of implementation), either in terms of bytes-stored or in terms of number of cells ? If so: What is the limit? What is the reason the limit exists? Where is the limit documented? If not: Is documentation (or results of a test) available demonstrating the ability of HBase to handle rows in excess of 2GB? 4GB? Is there a practical or "best practice" maximum under which

Difference Between typical Hadoop Architecture and MapR architecture

二次信任 提交于 2019-12-04 01:48:41
问题 I know that Hadoop is based on Master/Slave architecture HDFS works with NameNodes and DataNodes and MapReduce works with jobtrackers and Tasktrackers But I can't find all these services on MapR , I find out that it has its own Architecture with its own services I'm a little bit confused, could any one please tell me what is the difference between using Hadoop only and using it with MapR ! 回答1: MapR and apache Hadoop DO NOT have same architecture at storage level. MapR uses its own filesystem

Does HBase impose a maximum size per row?

拟墨画扇 提交于 2019-12-03 17:32:52
High-Level Question: Does HBase impose a maximum size per row which is common to all distributions (and thus not an artifact of implementation), either in terms of bytes-stored or in terms of number of cells ? If so: What is the limit? What is the reason the limit exists? Where is the limit documented? If not: Is documentation (or results of a test) available demonstrating the ability of HBase to handle rows in excess of 2GB? 4GB? Is there a practical or "best practice" maximum under which HBase API users should keep row sizes in order to avoid severe performance degradation? If so, what kind

Find port number where HDFS is listening

六月ゝ 毕业季﹏ 提交于 2019-12-03 03:10:25
问题 I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names. I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name . But this seems to be different on different distributions. For example, hdfs is

Find port number where HDFS is listening

谁说我不能喝 提交于 2019-12-02 17:10:37
I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names. I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name . But this seems to be different on different distributions. For example, hdfs is maprfs on MapR. IBM BigInsights don't even have core-site.xml in $HADOOP_HOME/conf . There doesn't seem to

Standard practices for logging in MapReduce jobs

穿精又带淫゛_ 提交于 2019-11-30 15:55:21
问题 I'm trying to find the best approach for logging in MapReduce jobs. I'm using slf4j with log4j appender as in my other Java applications, but since MapReduce job runs in a distributed manner across the cluster I don't know where should I set the log file location, since it is a shared cluster with limited access privileges. Is there any standard practices for logging in MapReduce jobs, so you can easily be able to look at the logs across the cluster after the job completes? 回答1: You could use

Standard practices for logging in MapReduce jobs

做~自己de王妃 提交于 2019-11-30 15:17:14
I'm trying to find the best approach for logging in MapReduce jobs. I'm using slf4j with log4j appender as in my other Java applications, but since MapReduce job runs in a distributed manner across the cluster I don't know where should I set the log file location, since it is a shared cluster with limited access privileges. Is there any standard practices for logging in MapReduce jobs, so you can easily be able to look at the logs across the cluster after the job completes? Ashrith You could use log4j which is the default logging framework that hadoop uses. So, from your MapReduce application