For some requirement I want to convert text file(delimited) to ORC(Optimized Row Columnar) format. As I have to run it in regular intervals
You can insert text data into a orc table by such command:
insert overwrite table orcTable select * from textTable;
The first table is orcTable is created by the following command:
create table orcTable(name string, city string) stored as orc;
And the textTable is as the same structure as orcTable.
You can use Spark dataframes to convert a delimited file to orc format very easily. You can also specify/impose a schema and filter specific columns as well.
public class OrcConvert {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("OrcConvert");
JavaSparkContext jsc = new JavaSparkContext(conf);
HiveContext hiveContext = new HiveContext(jsc);
String inputPath = args[0];
String outputPath = args[1];
DataFrame inputDf = hiveContext.read().format("com.databricks.spark.csv")
.option("quote", "'").option("delimiter", "\001")
.load(inputPath);
inputDf.write().orc(outputPath);
}
}
Make sure all dependencies are met, a hive should be running to use HiveContext also, currently in Spark ORC format is only supported in HiveContext.