orc

Sorted Table in Hive (ORC file format)

我与影子孤独终老i 提交于 2019-12-12 14:49:55
问题 I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format) I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL. CREATE TABLE trades ( trade_id INT, name STRING, contract_type STRING, ts INT ) PARTITIONED BY (dt STRING) CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS STORED AS ORC; This will mean that every time I make a query to this table, the data

Spark job that use hive context failing in oozie

亡梦爱人 提交于 2019-12-12 02:13:14
问题 In one of our pipelines we are doing aggregation using spark(java) and it is orchestrated using oozie. This pipelines writes the aggregated data to an ORC file using the following lines. HiveContext hc = new HiveContext(sc); DataFrame modifiedFrame = hc.createDataFrame(aggregateddatainrdd, schema); modifiedFrame.write().format("org.apache.spark.sql.hive.orc").partitionBy("partition_column_name").save(output); When the spark action in the oozie job gets triggered it throws the following

HIVE very long field gives OOM Heap

冷暖自知 提交于 2019-12-11 15:27:12
问题 We are storing string fields which varies in length from small(few kB) to very long(<400MB) in HIVE table. Now we are facing the issue of OOM when copying data from one table to another(without any conditions or joins), which is not exactly what we are running in production, but it is the most simple use case where this problem occurs. So the HQL is basically just: INSERT INTO new_table SELECT * FROM old_table; Container and Java Heap was set to 16GB, we had tried different file formats

How can I convert local ORC files to CSV?

给你一囗甜甜゛ 提交于 2019-12-11 09:37:56
问题 I have an ORC file on my local machine and I need any reasonable format from it (e.g. CSV, JSON, YAML, ...). How can I convert ORC to CSV? 回答1: Download Extract the files, go to the java folder and execute maven: mvn install Use ORC-Tools This is how I use them - you will likely need to adjust the paths: java -jar ~/.m2/repository/org/apache/orc/orc-tools/1.5.4/orc-tools-1.5.4-uber.jar data ~/your_file.orc > output.json The output is JSON Lines which is easy to convert to CSV. First I needed

R read ORC file from S3

我怕爱的太早我们不能终老 提交于 2019-12-11 03:06:08
问题 We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data. I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable. The two options I founnd online were - SparkR - dataconnector (vertica) Since

how to read orc transaction hive table in spark?

老子叫甜甜 提交于 2019-12-10 19:06:23
问题 how to read orc transaction hive table in spark? I am facing issue while reading ORC transactional table through spark I get schema of hive table but not able to read actual data See complete scenario : hive> create table default.Hello(id int,name string) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into default.hello values(10,'abc'); now I am trying to access Hive Orc data from Spark sql but it show only schema spark.sql("select * from

Spark DataFrame ORC Hive table reading issue

て烟熏妆下的殇ゞ 提交于 2019-12-09 03:40:30
I am trying to read a Hive table in Spark. Below is the Hive Table format: # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim \u0001 serialization.format \u0001 When I am trying to read it using the Spark SQL with the below command: val c = hiveContext.sql("""select a from c_db.c cs where dt >= '2016-05-12' """) c. show I am getting the below

Why is Apache Orc RecordReader.searchArgument() not filtering correctly?

妖精的绣舞 提交于 2019-12-08 20:25:59
问题 Here is a simple program that: Writes records into an Orc file Then tries to read the file using predicate pushdown ( searchArgument ) Questions: Is this the right way to use predicate push down in Orc? The read(..) method seems to return all the records, completely ignoring the searchArguments . Why is that? Notes: I have not been able to find any useful unit test that demonstrates how predicate pushdown works in Orc (Orc on GitHub). Nor am I able to find any clear documentation on this

Java: Read JSON from a file, convert to ORC and write to a file

断了今生、忘了曾经 提交于 2019-12-08 11:52:01
问题 I need to automate JSON-to-ORC conversion process. I was able to almost get there by using Apache's ORC-tools package except that JsonReader is doesn't handle Map type and throws an exception. So, the following works but doesn't handle Map type. Path hadoopInputPath = new Path(input); try (RecordReader recordReader = new JsonReader(hadoopInputPath, schema, hadoopConf)) { // throws when schema contains Map type try (Writer writer = OrcFile.createWriter(new Path(output), OrcFile.writerOptions

How to set ORC stripe size in Spark

旧巷老猫 提交于 2019-12-08 03:45:15
问题 I am trying to generate a dataset in Spark(2.3) and write it in ORC file format. I'm trying to set some properties for ORC stripe size and compress size. I took hints from this SO post. But spark is not honoring those properties and my stripe size in the resulting ORC files is much lower than what I've set. val conf: SparkConf = new SparkConf().setAppName("App") .set("spark.sql.orc.impl", "native") .set("spark.sql.hive.convertMetastoreOrc", "true") .set("spark.sql.orc.stripe.size", "67108864"