Looking for recommendation on how to convert PDF into structured format

你离开我真会死。 提交于 2019-12-03 06:55:07

After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications.

As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you.

Here's a high-level view of my approach:

I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers.

Use Homebrew to install Poppler for Mac:

brew install poppler

After Poppler is installed, you should have access to the pdftotext utility to convert the PDF:

pdftotext -layout -f 24 -l 687 AuctionBook2013.pdf auction_book.txt

Here's a preview of the XML (Click here for full XML):

<?xml version="1.0" encoding="UTF-8"?>
<listings>
   <item id="1">
      <nsb>536</nsb>
      <minbid>3,422</minbid>
      <apn>2006 003 001</apn>
      <delinquent_year>03</delinquent_year>
      <apn_old>2006 003 001</apn_old>
      <description>LICENSED SURVEYOR'S MAP
          AS PER BK 25 PG 28 OF L S LOT 1              
          BLK 1 ASSESSED TO    J   AND   S
          LIMITED LLC C/O DUNA CSARDAS -
          JULIUS JANCSO LOCATION COUNTY OF
          LOS ANGELES</description>
      <address>VACANT LOT</address>
   </item>

Edit: Adding the Ruby I wrote to convert the XML to a CSV.

require 'rexml/document'
require 'CSV'

class Auction

  def initialize

    f = File.new('AuctionBook2013.xml', 'r')
    doc = REXML::Document.new(f)

    CSV.open("auction.csv", "w+b") do |csv|
      csv << ['id', 'minbid', 'apn', 'delinquent_year', 'apn_old', 'description', 'address']

      doc.elements.each('/listings/item') do |item|
        csv << [item.attributes['id'],
                item.elements['minbid'].text,
                item.elements['apn'].text,
                item.elements['delinquent_year'].text,
                item.elements['apn_old'].text,
                item.elements['description'].text,
                item.elements['address'].text]
      end
    end
  end
end

a = Auction.new()

Link to Final CSV

Convert to text with Xpdf using command pdftotext.

I converted your file with the following:

pdftottext.exe -layout -f 23 -l 510 AuctionBook2013.pdf AuctionBook2013.txt

This conversion leaves text exactly in its original layout (due to -layout option). Options -f and -l indicate the first and last page numbers of the range of pages to extract.

From there, parsing should be simple -- a number in column 8 indicates the first line of a record, a blank line ends the record. Follow the guide for the exact positioning of elements within a record.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!