How to convert PDF to Excel or CSV in Rails 4

前端 未结 3 2063
醉梦人生
醉梦人生 2021-01-06 15:43

I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?

相关标签:
3条回答
  • 2021-01-06 16:03

    Check Tabula-Extractor project and also check how it is used in projects like NYPD Moving Summonses Parser and CompStat criminal complaints parser.

    0 讨论(0)
  • 2021-01-06 16:05

    Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.

    I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.

    Then I convert the HTML table to CSV.

    (This is not ideal but it works)

    Here is the code:

    require 'httmultiparty'
    class PageTextReceiver
      include HTTMultiParty
      base_uri 'http://localhost:3000'
    
      def run
        response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })
    
        File.open('/path/to/save/as/html/response.html', 'w') do |f|
          f.puts response
        end
      end
    
      def convert
        f = File.open("/path/to/saved/html/response.html")
        doc = Nokogiri::HTML(f)
        csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
        doc.xpath('//table/tr').each do |row|
          tarray = []
          row.xpath('td').each do |cell|
            tarray << cell.text
          end
          csv << tarray
        end
        csv.close
      end
    end
    

    Now Run it like this:

    #> page = PageTextReceiver.new
    #> page.run
    #> page.convert
    

    It is not refactored. Just proof of concept. You need to consider performance.

    I might use the gem Sidekiq to run it in background and move the result to the main thread.

    0 讨论(0)
  • 2021-01-06 16:18

    Ryan Bates covers csv exports in his rails casts > http://railscasts.com/episodes/362-exporting-csv-and-excel this might give you some pointers.

    Edit: as you now mention you need the raw data from an uploaded PDF, you could use JavaScript to read the PDF file and the populate the data into Ryan Bates' export method. Reading PDF's was covered excellently in the following question:

    extract text from pdf in Javascript

    I would imagine the flow would be something like:

    PDF new action
        user uploads PDF 
    
    PDF show action
        PDF is displayed
        JavaScript reads PDF
        JavaScript populates Ryan's raw data
        Raw data is exported with PDF data included 
    
    0 讨论(0)
提交回复
热议问题