How to extract table data from PDF as CSV from the command line?

前端未结

关注

 5  2002

生来不讨喜 2021-02-02 12:13

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   心在旅途
                                             
                
                
                (楼主)
            
              
              
                2021-02-02 12:50
              

            
            
                        
I'll offer you another solution as well.

While in this case the pdftotext method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).

Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor is the best choice.

I myself am using the direct GitHub checkout:

$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff
$ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor


I wrote myself a pretty simple wrapper script like this:

$ cat ~/bin/tabulaextr

 #!/bin/bash
 cd ${HOME}/svn-stuff/git.tabula-extractor/bin
 ./tabula $@


Since ~/bin/ is in my $PATH, I just run

$ tabulaextr --pages all                                 \
         $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \
        | tee my.csv


to extract all the tables from all pages and convert them to a single CSV file.

The first ten (out of a total of 8727) lines of the CVS look like this:

$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv 

 Retail Branding,Marketing Name,Device,Model
 "","",AD681H,Smartfren Andromax AD681H
 "","",FJL21,FJL21
 "","",Luno,Luno
 "","",T31,Panasonic T31
 "","",hws7721g,MediaPad 7 Youth 2
 3Q,OC1020A,OC1020A,OC1020A
 7Eleven,IN265,IN265,IN265
 A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
 AG Mobile,Status,Status,Status


which in the original PDF look like this:



It even got these lines on the last page, 293, right:

 nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A
 nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A


which look on the PDF page like this:



TabulaPDF and Tabula-Extractor are really, really cool for jobs like this! 



Update

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:


    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复