I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?
Check Tabula-Extractor project and also check how it is used in projects like NYPD Moving Summonses Parser and CompStat criminal complaints parser.
Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.
I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.
Then I convert the HTML table to CSV.
(This is not ideal but it works)
Here is the code:
require 'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri 'http://localhost:3000'
def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })
File.open('/path/to/save/as/html/response.html', 'w') do |f|
f.puts response
end
end
def convert
f = File.open("/path/to/saved/html/response.html")
doc = Nokogiri::HTML(f)
csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
doc.xpath('//table/tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
end
csv.close
end
end
Now Run it like this:
#> page = PageTextReceiver.new
#> page.run
#> page.convert
It is not refactored. Just proof of concept. You need to consider performance.
I might use the gem Sidekiq
to run it in background and move the result to the main thread.
Ryan Bates covers csv exports in his rails casts > http://railscasts.com/episodes/362-exporting-csv-and-excel this might give you some pointers.
Edit: as you now mention you need the raw data from an uploaded PDF, you could use JavaScript to read the PDF file and the populate the data into Ryan Bates' export method. Reading PDF's was covered excellently in the following question:
extract text from pdf in Javascript
I would imagine the flow would be something like:
PDF new action
user uploads PDF
PDF show action
PDF is displayed
JavaScript reads PDF
JavaScript populates Ryan's raw data
Raw data is exported with PDF data included