MapR | 易学教程

Why is parquet slower for me against text file format in hive?

阅读更多关于 Why is parquet slower for me against text file format in hive?

问题 OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A]

configure Druid to connect to Zookeeper on port 5181

阅读更多关于 configure Druid to connect to Zookeeper on port 5181

I'm running a MapR cluster and want to do some timeseries analysis with Druid . MapR uses a non-standard port for Zookeeper (port 5181 instead of the conventional port 2181). When I start the Druid coordinator service, it attempts to connect on the conventional Zookeeper port and fails: 2015-03-03T17:46:49,614 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. 2015-03-03T17:46:49,617 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,

Why is parquet slower for me against text file format in hive?

阅读更多关于 Why is parquet slower for me against text file format in hive?

OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A] Table C Format - Parquet with snappy compression Table size - 1.9 Gb [Create table C stored as parquet

Does HBase impose a maximum size per row?

阅读更多关于 Does HBase impose a maximum size per row?

问题 High-Level Question: Does HBase impose a maximum size per row which is common to all distributions (and thus not an artifact of implementation), either in terms of bytes-stored or in terms of number of cells ? If so: What is the limit? What is the reason the limit exists? Where is the limit documented? If not: Is documentation (or results of a test) available demonstrating the ability of HBase to handle rows in excess of 2GB? 4GB? Is there a practical or "best practice" maximum under which

Difference Between typical Hadoop Architecture and MapR architecture

阅读更多关于 Difference Between typical Hadoop Architecture and MapR architecture

问题 I know that Hadoop is based on Master/Slave architecture HDFS works with NameNodes and DataNodes and MapReduce works with jobtrackers and Tasktrackers But I can't find all these services on MapR , I find out that it has its own Architecture with its own services I'm a little bit confused, could any one please tell me what is the difference between using Hadoop only and using it with MapR ! 回答1: MapR and apache Hadoop DO NOT have same architecture at storage level. MapR uses its own filesystem

Does HBase impose a maximum size per row?

阅读更多关于 Does HBase impose a maximum size per row?

High-Level Question: Does HBase impose a maximum size per row which is common to all distributions (and thus not an artifact of implementation), either in terms of bytes-stored or in terms of number of cells ? If so: What is the limit? What is the reason the limit exists? Where is the limit documented? If not: Is documentation (or results of a test) available demonstrating the ability of HBase to handle rows in excess of 2GB? 4GB? Is there a practical or "best practice" maximum under which HBase API users should keep row sizes in order to avoid severe performance degradation? If so, what kind

Find port number where HDFS is listening

阅读更多关于 Find port number where HDFS is listening

问题 I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names. I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name . But this seems to be different on different distributions. For example, hdfs is

Find port number where HDFS is listening

阅读更多关于 Find port number where HDFS is listening

I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP, Cloudera, MapR...etc) which involves accessing hdfs files with qualified names. I understand that hdfs://machine-name:8020 is defined in core-site.xml as fs.default.name . But this seems to be different on different distributions. For example, hdfs is maprfs on MapR. IBM BigInsights don't even have core-site.xml in $HADOOP_HOME/conf . There doesn't seem to

Standard practices for logging in MapReduce jobs

阅读更多关于 Standard practices for logging in MapReduce jobs

问题 I'm trying to find the best approach for logging in MapReduce jobs. I'm using slf4j with log4j appender as in my other Java applications, but since MapReduce job runs in a distributed manner across the cluster I don't know where should I set the log file location, since it is a shared cluster with limited access privileges. Is there any standard practices for logging in MapReduce jobs, so you can easily be able to look at the logs across the cluster after the job completes? 回答1: You could use

Standard practices for logging in MapReduce jobs

阅读更多关于 Standard practices for logging in MapReduce jobs

I'm trying to find the best approach for logging in MapReduce jobs. I'm using slf4j with log4j appender as in my other Java applications, but since MapReduce job runs in a distributed manner across the cluster I don't know where should I set the log file location, since it is a shared cluster with limited access privileges. Is there any standard practices for logging in MapReduce jobs, so you can easily be able to look at the logs across the cluster after the job completes? Ashrith You could use log4j which is the default logging framework that hadoop uses. So, from your MapReduce application