I have saved a remote DB table in Hive using saveAsTable
method, now when i try to access the Hive table data using CLI command select * from table_name
One more way to catch possible discrepancy is to eyeball the difference in schemata of parquet files produced by both sources, say hive and spark. You can dump schema with parquet-tools (brew install parquet-tools
for macos):
λ $ parquet-tools schema /usr/local/Cellar/apache-drill/1.16.0/libexec/sample-data/nation.parquet
message root {
required int64 N_NATIONKEY;
required binary N_NAME (UTF8);
required int64 N_REGIONKEY;
required binary N_COMMENT (UTF8);
}
I had a similar error (but at a positive index in a non-negative block), and it came from the fact that I had created the Parquet data with some Spark dataframe types marked as non-nullable when they were actually null.
In my case, I thus interpret the error as Spark attempting to read data from a certain non-nullable type and stumbling across an unexpected null value.
To add to the confusion, after reading the Parquet file, Spark reports with printSchema()
that all the fields are nullable, whether they are or not. However, in my case, making them really nullable in the original Parquet file solved the problem.
Now, the fact that the question happens at "0 in block -1" is suspicious: it actually almost looks as if the data was not found, since block -1 looks like Spark has not even started reading anything (just a guess).
Problem: Facing below issue while querying the data in impyla (data written by spark job)
ERROR: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1521667682013_4868_1_00, diagnostics=[Task failed, taskId=task_1521667682013_4868_1_00_000082, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://shastina/sys/datalake_dev/venmo/data/managed_zone/integration/ACCOUNT_20180305/part-r-00082-bc0c080c-4080-4f6b-9b94-f5bafb5234db.snappy.parquet
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Root Cause:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg:
DECIMAL can be used to annotate the following types:
int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.
--conf "spark.sql.parquet.writeLegacyFormat=true"
References:
It looks like a schema mismatch problem here. If you set your schema to be not nullable, and create your dataframe with None value, Spark would throw you ValueError: This field is not nullable, but got None error.
[Pyspark]
from pyspark.sql.functions import * #udf, concat, col, lit, ltrim, rtrim
from pyspark.sql.types import *
schema = ArrayType(StructType([StructField('A', IntegerType(), nullable=False)]))
# It will throw "ValueError".
df = spark.createDataFrame([[[None]],[[2]]],schema=schema)
df.show()
But it is not the case if you use udf.
Using the same schema, if you use udf for transformation, it won't throw you ValueError even if your udf return a None. And it is the place where data schema mismatch happens.
For example:
df = spark.createDataFrame([[[1]],[[2]]], schema=schema)
def throw_none():
def _throw_none(x):
if x[0][0] == 1:
return [['I AM ONE']]
else:
return x
return udf(_throw_none, schema)
# since value col only accept intergerType, it will throw null for
# string "I AM ONE" in the first row. But spark did not throw ValueError
# error this time ! This is where data schema type mismatch happen !
df = df.select(throw_none()(col("value")).name('value'))
df.show()
Then, the following parquet write and read will throw you the parquet.io.ParquetDecodingException error.
df.write.parquet("tmp")
spark.read.parquet("tmp").collect()
So be very careful on the null value if you are using udf, return the right data type in your udf. And unless it is unnecessary, please dont set nullable=False in your StructField. Set nullable=True will solve all the problem.
I had a similar error, in my case i was missing the default constructor
Are you able to use Avro instead of Parquet to store your Hive table? I ran into this issue because I was using Hive's Decimal datatype, and Parquet from Spark doesn't play nice with Decimal. If you post your table schema and some data samples, debugging will be easier.
Another possible option, from the DataBricks Forum, is to use a Double instead of a Decimal, but that was not an option for my data so I can't report on whether it works.