Sqoop import as OrC file

前端 未结 4 2049
野趣味
野趣味 2021-01-19 01:01

Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?

Alternatives tried: imported as text format and used a temp table to r

相关标签:
4条回答
  • 2021-01-19 01:27

    Sqoop import supports only below formats.

    --as-avrodatafile   Imports data to Avro Data Files
    
    --as-sequencefile   Imports data to SequenceFiles
    
    --as-textfile   Imports data as plain text (default)
    
    --as-parquetfile    Imports data as parquet file (from sqoop 1.4.6 version)
    
    0 讨论(0)
  • 2021-01-19 01:28

    Currently there is no option to import the rdms table data directly as ORC file using sqoop. We can achieve the same using two steps.

    1. Import the data in any available format (say text).
    2. Read the data using Spark SQL and save it as an orc file.

    Example: Step 1: Import the table data as a text file.

    sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \
    --table orders \
    --target-dir /user/cloudera/text \
    --as-textfile
    

    Step 2: Use spark-shell on command prompt to get scala REPL command shell.

    scala> val sqlHiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
    sqlHiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@638a9d61
    
    scala> val textDF = sqlHiveContext.read.text("/user/cloudera/text")
    textDF: org.apache.spark.sql.DataFrame = [value: string]
    
    scala> textDF.write.orc("/user/cloudera/orc/")
    

    Step 3: Check the output.

    [root@quickstart exercises]# hadoop fs -ls /user/cloudera/orc/
    Found 5 items
    -rw-r--r--   1 cloudera cloudera          0 2018-02-13 05:59 /user/cloudera/orc/_SUCCESS
    -rw-r--r--   1 cloudera cloudera     153598 2018-02-13 05:59 /user/cloudera/orc/part-r-00000-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
    -rw-r--r--   1 cloudera cloudera     153466 2018-02-13 05:59 /user/cloudera/orc/part-r-00001-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
    -rw-r--r--   1 cloudera cloudera     153725 2018-02-13 05:59 /user/cloudera/orc/part-r-00002-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
    -rw-r--r--   1 cloudera cloudera     160907 2018-02-13 05:59 /user/cloudera/orc/part-r-00003-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
    
    0 讨论(0)
  • 2021-01-19 01:30

    In current version of sqoop available, it is not possible to import data from RDBS to HDFS in ORC format in a single shoot command. This is something known issue in sqoop. Reference link for this issue raised: https://issues.apache.org/jira/browse/SQOOP-2192

    I think the only alternative available for now, is the same as you mentioned. I also came across the similar use case, and have used the alternative two step approach.

    0 讨论(0)
  • 2021-01-19 01:50

    At least in Sqoop 1.4.5 there exists hcatalog integration that support orc file format (amongst others).

    For example you have the option

    --hcatalog-storage-stanza
    

    which can be set to

    stored as orc tblproperties ("orc.compress"="SNAPPY")
    

    Example:

    sqoop import 
     --connect jdbc:postgresql://foobar:5432/my_db 
     --driver org.postgresql.Driver 
     --connection-manager org.apache.sqoop.manager.GenericJdbcManager 
     --username foo 
     --password-file hdfs:///user/foobar/foo.txt 
     --table fact 
     --hcatalog-home /usr/hdp/current/hive-webhcat 
     --hcatalog-database my_hcat_db 
     --hcatalog-table fact 
     --create-hcatalog-table 
     --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'
    
    0 讨论(0)
提交回复
热议问题