Inspect Parquet from command line

前端 未结 9 1455
再見小時候
再見小時候 2020-12-07 20:26

How do I inspect the content of a Parquet file from the command line?

The only option I see now is

$ hadoop fs -get my-path local-file
$ parquet-tool         


        
相关标签:
9条回答
  • 2020-12-07 20:42

    On Windows 10 x64, try Parq:

    choco install parq
    

    This installs everything into the current directory. You will have to add this directory manually to the path, or run parq.exe from within this directory.

    My other answer builds parquet-reader from source. This utility looks like it does much the same job.

    0 讨论(0)
  • 2020-12-07 20:43

    By default parquet-tools in general will look for the local file directory, so to point it to hdfs, we need to add hdfs:// in the beginning of the file path. So in your case, you can do something like this

    parquet-tools head hdfs://localhost/<hdfs-path> | less
    

    I had the same issue and it worked fine for me. There is no need to download the file locally first.

    0 讨论(0)
  • 2020-12-07 20:44

    I recommend just building and running the parquet-tools.jar for your Hadoop distribution.

    Checkout the github project: https://github.com/apache/parquet-mr/tree/master/parquet-tools

    hadoop jar ./parquet-tools-<VERSION>.jar <command>.

    0 讨论(0)
  • 2020-12-07 20:45

    I've found this program really useful: https://github.com/chhantyal/parquet-cli

    Lets you view parquet files without having the whole infrastructure installed.

    Just type:

    pip install parquet-cli
    parq input.parquet --head 10
    
    0 讨论(0)
  • 2020-12-07 20:45

    If you're using HDFS, the following commands are very useful as they are frequently used (left here for future reference):

    hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
    hadoop jar parquet-tools-1.9.0.jar head -n5 hdfs://path/to/file.snappy.parquet
    
    0 讨论(0)
  • 2020-12-07 20:46

    You can use parquet-tools with the command cat and the --json option in order to view the files without a local copy and in the JSON format.

    Here is an example:

    parquet-tools cat --json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet

    This prints out the data in JSON format:

    {"name":"gil","age":48,"city":"london"}
    {"name":"jane","age":30,"city":"new york"}
    {"name":"jordan","age":18,"city":"toronto"}
    

    Disclaimer: this was tested in Cloudera CDH 5.12.0

    0 讨论(0)
提交回复
热议问题