Get a few lines of HDFS data

前端未结

关注

 9  1841

I am having a 2 GB data in my HDFS.

Is it possible to get that data randomly. Like we do in the Unix command line

cat iris2.cs


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-02-04 02:50
              
            
            
                                                                       
I was using tail and cat for an avro file on HDFS cluster, but the result was not getting printed in correct encoding. I tried this and worked well for me.
hdfs dfs -text hdfs://<path_of_directory>/part-m-00000.avro | head -n 1

Change 1 to higher integer to print more samples from avro file.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-02-04 03:08
              
            
            
                                                                       
hdfs dfs -cat yourFile | shuf -n <number_of_line>


Will do the trick for you.Though its not available on mac os. You can get installed GNU coreutils.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2021-02-04 03:10
              
            
            
                                                                       
My suggestion would be to load that data into Hive table, then you can do something like this:

SELECT column1, column2 FROM (
    SELECT iris2.column1, iris2.column2, rand() AS r
    FROM iris2
    ORDER BY r
) t
LIMIT 50;


EDIT:
This is simpler version of that query:

SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2021-02-04 03:11
              
            
            
                                                                       
Native head

hadoop fs -cat /your/file | head


is efficient here, as cat will close the stream as soon as head will finish reading all the lines.

To get the tail there is a special effective command in hadoop:

hadoop fs -tail /your/file


Unfortunately it returns last kilobyte of the data, not a given number of lines.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2021-02-04 03:12
              
            
            
                                                                       
The head and tail commands on Linux display the first 10 and last 10 lines respectively. But, the output of these two commands is not randomly sampled, they are in the same order as in the file itself.

The Linux shuffle - shuf command helps us generate random permutations of input lines & using this in conjunction with the Hadoop commands would be helpful, like so:

$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>

Therefore, in this case if iris2.csv is a file on HDFS and you wanted 50 lines randomly sampled from the dataset:

$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50

Note: The Linux sort command could also be used, but the shuf command is faster and randomly samples data better.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2021-02-04 03:12
              
            
            
                                                                       
Working code:
hadoop fs -cat /tmp/a/b/20200630.xls | head -n 10

hadoop fs -cat /tmp/a/b/20200630.xls | tail -3

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复