How do I inspect the content of a Parquet file from the command line?
The only option I see now is
$ hadoop fs -get my-path local-file
$ parquet-tool
On Windows 10 x64, try Parq:
choco install parq
This installs everything into the current directory. You will have to add this directory manually to the path, or run parq.exe
from within this directory.
My other answer builds parquet-reader
from source. This utility looks like it does much the same job.
By default parquet-tools in general will look for the local file directory, so to point it to hdfs, we need to add hdfs:// in the beginning of the file path. So in your case, you can do something like this
parquet-tools head hdfs://localhost/<hdfs-path> | less
I had the same issue and it worked fine for me. There is no need to download the file locally first.
I recommend just building and running the parquet-tools.jar for your Hadoop distribution.
Checkout the github project: https://github.com/apache/parquet-mr/tree/master/parquet-tools
hadoop jar ./parquet-tools-<VERSION>.jar <command>
.
I've found this program really useful: https://github.com/chhantyal/parquet-cli
Lets you view parquet files without having the whole infrastructure installed.
Just type:
pip install parquet-cli
parq input.parquet --head 10
If you're using HDFS, the following commands are very useful as they are frequently used (left here for future reference):
hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
hadoop jar parquet-tools-1.9.0.jar head -n5 hdfs://path/to/file.snappy.parquet
You can use parquet-tools
with the command cat
and the --json
option in order to view the files without a local copy and in the JSON format.
Here is an example:
parquet-tools cat --json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet
This prints out the data in JSON format:
{"name":"gil","age":48,"city":"london"}
{"name":"jane","age":30,"city":"new york"}
{"name":"jordan","age":18,"city":"toronto"}
Disclaimer: this was tested in Cloudera CDH 5.12.0