How to extract table data from PDF as CSV from the command line?

前端未结

关注

 5  2007

生来不讨喜 2021-02-02 12:13

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   孤街浪徒
                                             
                
                
                (楼主)
            
              
              
                2021-02-02 12:37
              

            
            
                        
What you want is rather easy, but you're having a different problem also (I'm not sure you are aware of it...).

First, you should add -nopgbrk for ("No pagebreaks, please!") to your command. Because these pesky ^L characters which otherwise appear in the output then need not be filtered out later.

Adding a grep -vE '(Supported Devices|^$)' will then filter out all the lines you do not want, including empty lines, or lines with only spaces:

pdftotext -layout -nopgbrk                           \
   DAC06E7D1302B790429AF6E84696FCFAB20B.pdf -        \
 | grep -vE '(Supported Devices|^$|Marketing Name)'  \
 | gsed '$d'                                         \
 | gsed -r 's# +#,#g'                                \
 | gsed '# ##g'                                      \
 > output2.csv


However, your other problem is this:


Some of the table fields are empty.
Empty fields appear with the -layout option as a series of space characters, sometimes even two in the same row.
However, the text columns are not spaced identically from page to page.
Therefor you will not know from line to line how many spaces you need to regard as a an "empty CSV field" (where you'd need an extra , separator).
As a consequence, your current code will show only one, two or three (instead of four) fields for some lines, and these fields end up in the wrong columns!


There is a workaround for this:


Add the -x ... -y ... -W ... -H ... parameters to pdftotext to crop the PDF column-wise.
Then append the columns with a combination of utilities like paste and column.


The following command extracts the first columns:

pdftotext -layout -x  38 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 1st-columns.txt


These are for second, third and fourth columns:

pdftotext -layout -x 214 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 2nd-columns.txt

pdftotext -layout -x 390 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 3rd-columns.txt

pdftotext -layout -x 567 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 4th-columns.txt


BTW, I cheated a bit: in order to get a clue about what values to use for -x, -y, -W and -H I did first run this command in order to find the exact coordinates of the column header words:

pdftotext -f 1 -l 1 -layout -bbox \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - | head -n 10


It's always good if you know how to read and make use of pdftotext -h.   :-)

Anyway, how to append the four text files as columns side by side, with the proper CVS separator in between, you should find out yourself. Or ask a new question :-)
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复