Sqoop import as OrC file

前端未结

关注

 4  2049

野趣味

Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?

Alternatives tried: imported as text format and used a temp table to r

相关标签:

4条回答

小鲜肉

2021-01-19 01:27

Sqoop import supports only below formats.

--as-avrodatafile   Imports data to Avro Data Files

--as-sequencefile   Imports data to SequenceFiles

--as-textfile   Imports data as plain text (default)

--as-parquetfile    Imports data as parquet file (from sqoop 1.4.6 version)

0 讨论(0)

南旧

2021-01-19 01:28

Currently there is no option to import the rdms table data directly as ORC file using sqoop. We can achieve the same using two steps.

Import the data in any available format (say text).
Read the data using Spark SQL and save it as an orc file.

Example: Step 1: Import the table data as a text file.

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \
--table orders \
--target-dir /user/cloudera/text \
--as-textfile

Step 2: Use spark-shell on command prompt to get scala REPL command shell.

scala> val sqlHiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlHiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@638a9d61

scala> val textDF = sqlHiveContext.read.text("/user/cloudera/text")
textDF: org.apache.spark.sql.DataFrame = [value: string]

scala> textDF.write.orc("/user/cloudera/orc/")

Step 3: Check the output.

[root@quickstart exercises]# hadoop fs -ls /user/cloudera/orc/
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2018-02-13 05:59 /user/cloudera/orc/_SUCCESS
-rw-r--r--   1 cloudera cloudera     153598 2018-02-13 05:59 /user/cloudera/orc/part-r-00000-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r--   1 cloudera cloudera     153466 2018-02-13 05:59 /user/cloudera/orc/part-r-00001-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r--   1 cloudera cloudera     153725 2018-02-13 05:59 /user/cloudera/orc/part-r-00002-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r--   1 cloudera cloudera     160907 2018-02-13 05:59 /user/cloudera/orc/part-r-00003-24f75a77-4dd9-44b1-9e25-6692740360d5.orc

0 讨论(0)

小鲜肉

2021-01-19 01:30

In current version of sqoop available, it is not possible to import data from RDBS to HDFS in ORC format in a single shoot command. This is something known issue in sqoop. Reference link for this issue raised: https://issues.apache.org/jira/browse/SQOOP-2192

I think the only alternative available for now, is the same as you mentioned. I also came across the similar use case, and have used the alternative two step approach.

0 讨论(0)
发布评论:

提交评论
- 加载中...

天命终不由人

2021-01-19 01:50

At least in Sqoop 1.4.5 there exists hcatalog integration that support orc file format (amongst others).

For example you have the option

--hcatalog-storage-stanza

which can be set to

stored as orc tblproperties ("orc.compress"="SNAPPY")

Example:

sqoop import 
 --connect jdbc:postgresql://foobar:5432/my_db 
 --driver org.postgresql.Driver 
 --connection-manager org.apache.sqoop.manager.GenericJdbcManager 
 --username foo 
 --password-file hdfs:///user/foobar/foo.txt 
 --table fact 
 --hcatalog-home /usr/hdp/current/hive-webhcat 
 --hcatalog-database my_hcat_db 
 --hcatalog-table fact 
 --create-hcatalog-table 
 --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'

0 讨论(0)