How to extract table data from PDF as CSV from the command line?

前端 未结 5 1997
生来不讨喜
生来不讨喜 2021-02-02 12:13

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06         


        
5条回答
  •  清歌不尽
    2021-02-02 12:35

    For the case where you want to extract that tabular data from PDF over which you have control at creation time (for timesheets contracts your employees have to sign), the following solution will be cleaner:

    1. Create a PDF form with field IDs.

    2. Let people fill and save the PDF forms.

    3. Use a Apache PDFBox, an open source tool that allows to extract form data from a PDF. It includes a command-line example tool PrintFields that you would call as follows to print the desired field information:

      org.apache.pdfbox.examples.interactive.form.PrintFields file.pdf
      

      For other options, see this question.

    As an alternative to the above workflow, maybe you could also use a digital signature web service that allows PDF form filling and export of the data to tables. Such as SignRequest, which allows to create templates and later export the data of signed documents. (Not affiliated, just found this myself.)

提交回复
热议问题