Inspect Parquet from command line

前端未结

关注

 9  1455

How do I inspect the content of a Parquet file from the command line?

The only option I see now is

$ hadoop fs -get my-path local-file
$ parquet-tool


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2020-12-07 20:42
              
            
            
                                                                       
On Windows 10 x64, try Parq:

choco install parq


This installs everything into the current directory. You will have to add this directory manually to the path, or run parq.exe from within this directory.

My other answer builds parquet-reader from source. This utility looks like it does much the same job.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-12-07 20:43
              
            
            
                                                                       
By default parquet-tools in general will look for the local file directory, so to point it to hdfs, we need to add hdfs:// in the beginning of the file path. So in your case, you can do something like this

parquet-tools head hdfs://localhost/<hdfs-path> | less


I had the same issue and it worked fine for me. There is no need to download the file locally first.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-12-07 20:44
              
            
            
                                                                       
I recommend just building and running the parquet-tools.jar for your Hadoop distribution. 

Checkout the github project: https://github.com/apache/parquet-mr/tree/master/parquet-tools 

hadoop jar ./parquet-tools-<VERSION>.jar <command>. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2020-12-07 20:45
              
            
            
                                                                       
I've found this program really useful:
https://github.com/chhantyal/parquet-cli

Lets you view parquet files without having the whole infrastructure installed. 

Just type:

pip install parquet-cli
parq input.parquet --head 10

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-07 20:45
              
            
            
                                                                       
If you're using HDFS, the following commands are very useful as they are frequently used (left here for future reference):

hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
hadoop jar parquet-tools-1.9.0.jar head -n5 hdfs://path/to/file.snappy.parquet

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-12-07 20:46
              
            
            
                                                                       
You can use parquet-tools with the command cat and the --json option in order to view the files without a local copy and in the JSON format.

Here is an example:

parquet-tools cat --json  hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet

This prints out the data in JSON format:

{"name":"gil","age":48,"city":"london"}
{"name":"jane","age":30,"city":"new york"}
{"name":"jordan","age":18,"city":"toronto"}


Disclaimer: this was tested in Cloudera CDH 5.12.0
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复