问题
I have a table in Hive
CREATE TABLE tab_data (
rec_id INT,
rec_name STRING,
rec_value DECIMAL(3,1),
rec_created TIMESTAMP
) STORED AS PARQUET;
and I want to populate this table with data in .csv files like these
10|customer1|10.0|2016-09-07 08:38:00.0
20|customer2|24.0|2016-09-08 10:45:00.0
30|customer3|35.0|2016-09-10 03:26:00.0
40|customer1|46.0|2016-09-11 08:38:00.0
50|customer2|55.0|2016-09-12 10:45:00.0
60|customer3|62.0|2016-09-13 03:26:00.0
70|customer1|72.0|2016-09-14 08:38:00.0
80|customer2|23.0|2016-09-15 10:45:00.0
90|customer3|30.0|2016-09-16 03:26:00.0
using Spark and Scala with code as below
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.types.{DataTypes, IntegerType, StringType, StructField, StructType, TimestampType}
object MainApp {
val spark = SparkSession
.builder()
.appName("MainApp")
.master("local[*]")
.config("spark.sql.shuffle.partitions","200")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "hdfs://host.hdfs:8020/..../tab_data.csv"
val outputPath = "hdfs://host.hdfs:8020/...../warehouse/test.db/tab_data"
def main(args: Array[String]): Unit = {
try {
val DecimalType = DataTypes.createDecimalType(3, 1)
/**
* schema
*/
val schema = StructType(List(StructField("rec_id", IntegerType, true), StructField("rec_name",StringType, true),
StructField("rec_value",DecimalType),StructField("rec_created",TimestampType, true)))
/**
* Reading the data from HDFS
*/
val data = spark
.read
.option("sep","|")
.schema(schema)
.csv(inputPath)
data.show(truncate = false)
data.schema.printTreeString()
/**
* Writing the data as Parquet
*/
data
.write
.mode(SaveMode.Append)
.parquet(outputPath)
} finally {
sc.stop()
spark.stop()
}
}
}
The problem is that I am getting this output
+------+--------+---------+-----------+
|rec_id|rec_name|rec_value|rec_created|
+------+--------+---------+-----------+
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
root
|-- rec_id: integer (nullable = true)
|-- rec_name: string (nullable = true)
|-- rec_value: decimal(3,1) (nullable = true)
|-- rec_created: timestamp (nullable = true)
The schema is fine but the data is not loading properly in the table
SELECT * FROM tab_data;
+------------------+--------------------+---------------------+-----------------------+--+
| tab_data.rec_id | tab_data.rec_name | tab_data.rec_value | tab_data.rec_created |
+------------------+--------------------+---------------------+-----------------------+--+
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
What am I doing wrong?
I'm new with Spark and some help would be appreciated.
回答1:
To deal with issues between Spark
, Hive
and Parquet
set up your SparkSession
as follow:
val spark = SparkSession
.builder()
.appName("CsvToParquet")
.master("local[*]")
.config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
.config("spark.sql.parquet.writeLegacyFormat", true) // To skip issues with data type between Spark and Hive
// The convention used by Spark to write Parquet data is configurable.
// This is determined by the property spark.sql.parquet.writeLegacyFormat
// The default value is false. If set to "true",
// Spark will use the same convention as Hive for writing the Parquet data.
afterwards read the .csv
data as follow
val data = spark
.read
.option("sep","|")
.option("timestampFormat","yyyy-MM-dd HH:mm:ss.S") // to read timestamp fields
.option("inferSchema",false) // by default is false
.schema(schema)
.csv(inputPath)
then write the data as parquet
with no compression
(by default data is compressed) as follow
data
.write
.mode(SaveMode.Append)
.option("compression", "none") // Assuming no data compression
.parquet(outputPath)
Note: It's probably that the reason why Hive
cannot query the data is because data is compressed in snappy
format by default and your CREATE TABLE
statement stores the data as parquet
without compression.
回答2:
You are getting null
values in all columns because one of the column of type String
is not able convert to Timestamp
type.
To convert string to timestamp type, specify timestamp format by using this option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
option while loading csv data.
Check below code.
Schema
scala> val schema = StructType(List(
StructField("rec_id", IntegerType, true),
StructField("rec_name",StringType, true),
StructField("rec_value",DecimalType(3,1)),
StructField("rec_created",TimestampType, true))
)
Loading CSV Data
scala> val df = spark
.read
.option("sep","|")
.option("inferSchema","true")
.option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
.schema(schema)
.csv("/tmp/sample")
scala> df.show(false)
+------+---------+---------+-------------------+
|rec_id|rec_name |rec_value|rec_created |
+------+---------+---------+-------------------+
|10 |customer1|10.0 |2016-09-07 08:38:00|
|20 |customer2|24.0 |2016-09-08 10:45:00|
|30 |customer3|35.0 |2016-09-10 03:26:00|
|40 |customer1|46.0 |2016-09-11 08:38:00|
|50 |customer2|55.0 |2016-09-12 10:45:00|
|60 |customer3|62.0 |2016-09-13 03:26:00|
|70 |customer1|72.0 |2016-09-14 08:38:00|
|80 |customer2|23.0 |2016-09-15 10:45:00|
|90 |customer3|30.0 |2016-09-16 03:26:00|
+------+---------+---------+-------------------+
Updated
Since table is managed table, You don't need to set all those parameters, You can use insertInto
function to insert the data into table.
df.write.mode("append").insertInto("tab_data")
来源:https://stackoverflow.com/questions/62997718/csv-data-is-not-loading-properly-as-parquet-using-spark