Convert data from PDFform to CSV

泄露秘密 提交于 2019-12-12 13:32:01

问题


I am trying to convert the data entered in multiple fill-able pdf-forms to one csv file.
This code consists of a few steps:

  1. Open new .csv file (header row)
  2. Open multiple pdf-forms with "for...in" loop
  3. Convert data entered in form-fields to csv

However, when running the command I receive the error:

fc-int01-generateAppearances: None
Traceback (most recent call last):
    File "C:\Python27\Scripts\test3.py", line 31, in <module>
        writer.writerow(value)
    _csv.Error: sequence expected

If I just the print value (form data) in python, it works. But importing the data does not. There is maybe also a problem of going from row to column with value. I hope I am clear.

Here is my code:

import glob
import os
import sys
import csv
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

#input file path for specific file
#filename = "C:\Python27\Scripts\MH_1.pdf"
#fp = open(filename, 'rb')

#open new csv file
out_file=open('C:\Users\Wonen\Downloads\Test\output.csv', 'w+')
writer = csv.writer(out_file)
#header row
writer.writerow(('Name coordinator', 'Date', 'Address', 'District',
                 'City', 'Complaintnr'))

#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        print '{0}: {1}'.format(name, value)
        writer.writerow(value)

The output with a text pdf (including all output) using print (repr(value)):

None
'Crip Gang'
None
None
None
/Ja
None
/1
/1
None
None
/Ja
/Ja
None
None
None
'wfwf'
'sd'
'dfwf'
'ffasf'
'tsdbd'
'dfadfasdf'
None
'df'
None
'asdff'
None
'wff'
None
'ffs'
None
None
None
None
None
None
None
None
None
None
None
'1'
'2'
'7'
/0
'Ja'
'Two unlimited'
'Captain Jack'
None
'www.kijkbijmij.nl'
'Onderverhuur'
/Ja

etc. etc. "None" stands for "empty text box"; and "1" and "0" stand for "yes" and "no" outputs.


回答1:


Try changing the last part of your code as shown:

    .
    .
    .
#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    row = []
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        row.append(value)
    writer.writerow(row)

out_file.close()

It's not clear this will work, but it may provide you with the information you need to solve your problem.

One confusing thing is that for the first header row of the csv:

writer.writerow(('Name coordinator', 'Date', 'Address','District','City', 'Complaintnr'))

which defines how many field values will be contained in each row written. This means that fields should be a list consisting of data for those 6 items in that order.

You need to figure out how to translate what's in each group of fields into a row list of 6 data items. That is what the code in my answer does — I think, but can't test.



来源:https://stackoverflow.com/questions/31521403/convert-data-from-pdfform-to-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!