Looking for recommendation on how to convert PDF into structured format

后端 未结 2 1436
予麋鹿
予麋鹿 2021-02-06 00:34

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured f

2条回答
  •  清歌不尽
    2021-02-06 01:01

    Convert to text with Xpdf using command pdftotext.

    I converted your file with the following:

    pdftottext.exe -layout -f 23 -l 510 AuctionBook2013.pdf AuctionBook2013.txt
    

    This conversion leaves text exactly in its original layout (due to -layout option). Options -f and -l indicate the first and last page numbers of the range of pages to extract.

    From there, parsing should be simple -- a number in column 8 indicates the first line of a record, a blank line ends the record. Follow the guide for the exact positioning of elements within a record.

提交回复
热议问题