Convert data from PDFform to CSV

问题

I am trying to convert the data entered in multiple fill-able pdf-forms to one csv file.
This code consists of a few steps:

Open new .csv file (header row)
Open multiple pdf-forms with "for...in" loop
Convert data entered in form-fields to csv

However, when running the command I receive the error:

fc-int01-generateAppearances: None
Traceback (most recent call last):
    File "C:\Python27\Scripts\test3.py", line 31, in <module>
        writer.writerow(value)
    _csv.Error: sequence expected

If I just the print value (form data) in python, it works. But importing the data does not. There is maybe also a problem of going from row to column with value. I hope I am clear.

Here is my code:

import glob
import os
import sys
import csv
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

#input file path for specific file
#filename = "C:\Python27\Scripts\MH_1.pdf"
#fp = open(filename, 'rb')

#open new csv file
out_file=open('C:\Users\Wonen\Downloads\Test\output.csv', 'w+')
writer = csv.writer(out_file)
#header row
writer.writerow(('Name coordinator', 'Date', 'Address', 'District',
                 'City', 'Complaintnr'))

#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        print '{0}: {1}'.format(name, value)
        writer.writerow(value)

The output with a text pdf (including all output) using print (repr(value)):

None
'Crip Gang'
None
None
None
/Ja
None
/1
/1
None
None
/Ja
/Ja
None
None
None
'wfwf'
'sd'
'dfwf'
'ffasf'
'tsdbd'
'dfadfasdf'
None
'df'
None
'asdff'
None
'wff'
None
'ffs'
None
None
None
None
None
None
None
None
None
None
None
'1'
'2'
'7'
/0
'Ja'
'Two unlimited'
'Captain Jack'
None
'www.kijkbijmij.nl'
'Onderverhuur'
/Ja

etc. etc. "None" stands for "empty text box"; and "1" and "0" stand for "yes" and "no" outputs.

回答1:

Try changing the last part of your code as shown:

    .
    .
    .
#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    row = []
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        row.append(value)
    writer.writerow(row)

out_file.close()

It's not clear this will work, but it may provide you with the information you need to solve your problem.

One confusing thing is that for the first header row of the csv:

writer.writerow(('Name coordinator', 'Date', 'Address','District','City', 'Complaintnr'))

which defines how many field values will be contained in each row written. This means that fields should be a list consisting of data for those 6 items in that order.

You need to figure out how to translate what's in each group of fields into a row list of 6 data items. That is what the code in my answer does — I think, but can't test.

来源：https://stackoverflow.com/questions/31521403/convert-data-from-pdfform-to-csv

标签

python

python-2.7

csv

pdf

pdf-form