How to extract table data from PDF as CSV from the command line?

前端 未结 5 2007
生来不讨喜
生来不讨喜 2021-02-02 12:13

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06         


        
5条回答
  •  孤街浪徒
    2021-02-02 12:37

    What you want is rather easy, but you're having a different problem also (I'm not sure you are aware of it...).

    First, you should add -nopgbrk for ("No pagebreaks, please!") to your command. Because these pesky ^L characters which otherwise appear in the output then need not be filtered out later.

    Adding a grep -vE '(Supported Devices|^$)' will then filter out all the lines you do not want, including empty lines, or lines with only spaces:

    pdftotext -layout -nopgbrk                           \
       DAC06E7D1302B790429AF6E84696FCFAB20B.pdf -        \
     | grep -vE '(Supported Devices|^$|Marketing Name)'  \
     | gsed '$d'                                         \
     | gsed -r 's# +#,#g'                                \
     | gsed '# ##g'                                      \
     > output2.csv
    

    However, your other problem is this:

    1. Some of the table fields are empty.
    2. Empty fields appear with the -layout option as a series of space characters, sometimes even two in the same row.
    3. However, the text columns are not spaced identically from page to page.
    4. Therefor you will not know from line to line how many spaces you need to regard as a an "empty CSV field" (where you'd need an extra , separator).
    5. As a consequence, your current code will show only one, two or three (instead of four) fields for some lines, and these fields end up in the wrong columns!

    There is a workaround for this:

    1. Add the -x ... -y ... -W ... -H ... parameters to pdftotext to crop the PDF column-wise.
    2. Then append the columns with a combination of utilities like paste and column.

    The following command extracts the first columns:

    pdftotext -layout -x  38 -y 77 -W 176 -H 500  \
              DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 1st-columns.txt
    

    These are for second, third and fourth columns:

    pdftotext -layout -x 214 -y 77 -W 176 -H 500  \
              DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 2nd-columns.txt
    
    pdftotext -layout -x 390 -y 77 -W 176 -H 500  \
              DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 3rd-columns.txt
    
    pdftotext -layout -x 567 -y 77 -W 176 -H 500  \
              DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 4th-columns.txt
    

    BTW, I cheated a bit: in order to get a clue about what values to use for -x, -y, -W and -H I did first run this command in order to find the exact coordinates of the column header words:

    pdftotext -f 1 -l 1 -layout -bbox \
              DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - | head -n 10
    

    It's always good if you know how to read and make use of pdftotext -h. :-)

    Anyway, how to append the four text files as columns side by side, with the proper CVS separator in between, you should find out yourself. Or ask a new question :-)

提交回复
热议问题