How to stop such messages from coming on my spark-shell console.
5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading n
The solution from SPARK-8118 issue comment seem to work:
You can disable the chatty output by creating a properties file with these contents:
org.apache.parquet.handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=SEVERE
And then passing the path of the file to Spark when the application is submitted. Assuming the file lives in /tmp/parquet.logging.properties (of course, that needs to be available on all worker nodes):
spark-submit \
--conf spark.driver.extraJavaOptions="-Djava.util.logging.config.file=/tmp/parquet.logging.properties" \`
--conf spark.executor.extraJavaOptions="-Djava.util.logging.config.file=/tmp/parquet.logging.properties" \
...
Credits go to Justin Bailey.
not a solution but if you build your own spark then this file: https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java has most the generations of log messages which you can comment out for now.
I believe this regressed --there are some large merges/changes they are making to the parquet integration...https://issues.apache.org/jira/browse/SPARK-4412
To turn off all the messages except ERROR, you shoud edit your conf/log4j.properties file changing the following line:
log4j.rootCategory=INFO, console
into
log4j.rootCategory=ERROR, console
Hope it could help!
I know this question was WRT Spark, but I recently had this issue when using Parquet with Hive in CDH 5.x and found a work-around. Details are here: https://issues.apache.org/jira/browse/SPARK-4412?focusedCommentId=16118403
Contents of my comment from that JIRA ticket below:
This is also an issue in the version of parquet distributed in CDH 5.x. In this case, I am using
parquet-1.5.0-cdh5.8.4
(sources available here: http://archive.cloudera.com/cdh5/cdh/5)However, I've found a work-around for mapreduce jobs submitted via Hive. I'm sure this can be adapted for use with Spark as well.
- Add the following properties to your job's configuration (in my case, I added them to
hive-site.xml
since adding them tomapred-site.xml
didn't work:<property> <name>mapreduce.map.java.opts</name> <value>-Djava.util.logging.config.file=parquet-logging.properties</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Djava.util.logging.config.file=parquet-logging.properties</value> </property> <property> <name>mapreduce.child.java.opts</name> <value>-Djava.util.logging.config.file=parquet-logging.properties</value> </property>
- Create a file named
parquet-logging.properties
with the following contents:# Note: I'm certain not every line here is necessary. I just added them to cover all possible # class/facility names.you will want to tailor this as per your needs. .level=WARNING java.util.logging.ConsoleHandler.level=WARNING parquet.handlers=java.util.logging.ConsoleHandler parquet.hadoop.handlers=java.util.logging.ConsoleHandler org.apache.parquet.handlers=java.util.logging.ConsoleHandler org.apache.parquet.hadoop.handlers=java.util.logging.ConsoleHandler parquet.level=WARNING parquet.hadoop.level=WARNING org.apache.parquet.level=WARNING org.apache.parquet.hadoop.level=WARNING
- Add the file to the job. In Hive, this is most easily done like so:
ADD FILE /path/to/parquet-logging.properties;With this done, when you run your Hive queries, parquet should only log WARNING (and higher) level messages to the stdout container logs.
This will work for Spark 2.0. Edit file spark/log4j.properties and add:
log4j.logger.org.apache.spark.sql.execution.datasources.parquet=ERROR
log4j.logger.org.apache.spark.sql.execution.datasources.FileScanRDD=ERROR
log4j.logger.org.apache.hadoop.io.compress.CodecPool=ERROR
The lines for FileScanRDD and CodecPool will help with a couple of logs that are very verbose as well.