I want to get a list of all the colleges in USA from this PDF file and put it into a CSV file. I will then import the CSV file into SQL server (so that I can run queries eas
For this kind of PDF extraction, you can use IntelliGet (http://akribiatech.com/intelliget) . A simple script like below would serve your purpose here
userVariables = school, campus;
{ start = IsNumeric(Substring(Line(0),112,115));
school = ""; campus = "";
{ start = 1;
maxCount = 2;
school = Concat(school, " ", Trim(Substring(Line(0),1,52)));
campus = Concat(campus, " ", Trim(Substring(Line(0),53,82)));
}
output = Concat(Trim(school), "|", Trim(campus));
}
As already claimed in a comment to the question,
Considering the fairly straight forward page content stream style, the data should be extractable using a not too complicated custom text extractor.
In detail:
Regular table entry content is drawn entry by entry, each entry field by field in reading order. Thus, while going through the content stream we do not have to try and re-arrange the content to establish that order. This makes this task fairly easy.
So the main work will be to ignore non-entries, i.e. the header on the first page, the bars indicating where a new first letter starts, and the page numbers.
We do so by
(Other approaches also would have done, e.g. ignoring everything in a bottom page area to take care of the page numbers.)
Now we merely have to split the entries into their fields.
Again the document structure helps, as it is a very uniform document, the table columns have the identical position and dimensions on each page. So we merely have to dissect at fixed x values.
There is just one stumbling block: in some entries atomic text chunks contain content of different columns. E.g. sometimes the contents of the F and M columns are drawn as a single string like "YN" and the optical distance is introduced via character spacing.
So we have to process the text chunks character by character, not as a whole.
I use Java and the PDF library iText (current version 5.5.7 development snapshot) here. This does not mean at all that it cannot be done a using different setup, this merely is the setup I'm most accustomed to.
As separator I use the tab character because other likely candidates also occur as part of the text and I did not want to have to cope with escaping them.
This is the custom RenderListener
class introduced to cope with the content as explained above:
public class CertifiedSchoolListExtractionStrategy implements RenderListener
{
public CertifiedSchoolListExtractionStrategy(Appendable data, Appendable nonData)
{
this.data = data;
this.nonData = nonData;
}
//
// RenderListener implementation
//
@Override
public void beginTextBlock() { }
@Override
public void endTextBlock() { }
@Override
public void renderImage(ImageRenderInfo renderInfo) { }
@Override
public void renderText(TextRenderInfo renderInfo)
{
try
{
Vector startPoint = renderInfo.getBaseline().getStartPoint();
BaseColor fillColor = renderInfo.getFillColor();
if (fillColor instanceof GrayColor && ((GrayColor)fillColor).getGray() == 0)
{
if (debug)
data.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
{
renderCharacter(info);
}
}
else
{
if (debug)
nonData.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
if (currentField > -1)
finishEntry();
entryBuilder.append(renderInfo.getText());
}
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
chunk++;
}
}
public void renderCharacter(TextRenderInfo renderInfo) throws IOException
{
Vector startPoint = renderInfo.getBaseline().getStartPoint();
float x = startPoint.get(I1);
if (currentField > -1)
{
if (isInCurrentField(x))
{
entryBuilder.append(renderInfo.getText());
return;
}
if (isInNextField(x))
{
currentField++;
entryBuilder.append('\t').append(renderInfo.getText());
return;
}
finishEntry();
}
if (isInNextField(x))
{
finishEntry();
currentField = 0;
}
entryBuilder.append(renderInfo.getText());
}
public void close() throws IOException
{
finishEntry();
}
boolean isInCurrentField(float x)
{
if (currentField == -1)
return false;
if (x < fieldstarts[currentField])
return false;
if (currentField == fieldstarts.length - 1)
return true;
return x <= fieldstarts[currentField + 1];
}
boolean isInNextField(float x)
{
if (currentField == fieldstarts.length - 1)
return false;
if (x < fieldstarts[currentField + 1])
return false;
if (currentField == fieldstarts.length - 2)
return true;
return x <= fieldstarts[currentField + 2];
}
void finishEntry() throws IOException
{
if (entryBuilder.length() > 0)
{
if (currentField == fieldstarts.length - 1)
{
data.append(entryBuilder).append('\n');
}
else
{
nonData.append(entryBuilder).append('\n');
}
entryBuilder.setLength(0);
}
currentField = -1;
}
//
// hidden members
//
final Appendable data, nonData;
boolean debug = false;
int chunk = 0;
int currentField = -1;
StringBuilder entryBuilder = new StringBuilder();
final int[] fieldstarts = {20, 254, 404, 415, 431, 508, 534};
}
(CertifiedSchoolListExtractionStrategy.java)
We can use it like this:
@Test
public void testCertifiedSchoolList_9_16_2015() throws IOException
{
try ( Writer data = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "data.txt")), "UTF-8");
Writer nonData = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "non-data.txt")), "UTF-8") )
{
CertifiedSchoolListExtractionStrategy strategy = new CertifiedSchoolListExtractionStrategy(data, nonData);
PdfReader reader = new PdfReader("certified-school-list-9-16-2015.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int page = 1; page <= reader.getNumberOfPages(); page++)
parser.processContent(page, strategy);
strategy.close();
}
}
(ExtractCertifiedSchoolList.java)
Now data.txt
contains all the entries as tab-separated lines and non-data.txt
everything ignored.
To understand what is happening here, one first has to know how page content in PDFs is organized and how (for the sample code given) iText operates on it.
PDF documents are structures built from a number of base object types, some primitive types (numbers, strings, ...) and some more complex ones (arrays or dictionaries of other objects or streams).
A page in a PDF document is represented by such a dictionary object containing entries defining some page properties (like page dimensions) and other entries referencing objects that define what is drawn on the page: the content streams.
Content streams essentially contain a sequence of operations, which may
For the question at hand we mostly are interested in the operations involved in drawing text. In contrast to word processors the operations are not take this long string of text and arrange it as a paragraph but instead more primitively move text position here, draw this short string here, move text position again, and draw another string there.
E.g. in the sample PDF the oeprations for drawing the table header and the first entry line are these:
/TT2 1 Tf
Select font TT2 at size 1.
9.72 0 0 9.72 20.16 687.36 Tm
Set the text matrix to move the text insertion coordinates to 20.16, 687.36 and scale everything following by a factor of 9.72.
0 g
Select the grayscale fill color black
0 Tc
0 Tw
Select additional character and word spacing to 0.
(SCHOOL)Tj
Draw "SCHOOL" here.
/TT1 1 Tf
Select font TT1.
3.4082 0 TD
Move text insertion point by 3.4082 in x direction.
<0003>Tj
Draw a space character (the current font uses a different encoding which uses 16 bit per character, not 8, and here is represented hexadecimally).
/TT2 1 Tf
.2261 0 TD
[(NAME)-17887.4(CAMPUS)]TJ
Select font, move text insertion point, and draw the string "NAME", then a gap of 17887.4 text units, then draw "CAMPUS".
/TT1 1 Tf
24.1809 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
[(NAME)-8986.6(F)-923.7(M)-459.3(CITY)-6349.9(ST)-1390.2(CAMPUS)]TJ
/TT1 1 Tf
28.5147 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
(ID)Tj
Draw the rest of the header line.
/TT4 1 Tf
-56.782 -1.3086 TD
Move left by 56.782 and down by 1.3086, i.e. to the start of the first entry line.
("I)Tj
/TT3 1 Tf
.6528 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-16894.2("I)]TJ
/TT3 1 Tf
18.9997 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-8239.9(Y)-1018.9(N)-576.7(Mount)]TJ
/TT3 1 Tf
15.189 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Shasta)-2423.3(CA)-2443.7(41789)]TJ
And draw the first entry line.
As you see and as I had mentioned above, the table content is drawn in reading order. Even multi line column entries come in the needed order, e.g. the campus name "A F International of Westlake Village":
[(Inc.)-7228.7(A)]TJ
/TT3 1 Tf
9.26 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(F)Tj
/TT3 1 Tf
.4595 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(International)Tj
/TT3 1 Tf
5.2886 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(of)Tj
/TT3 1 Tf
.8325 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Westlake)Tj
/TT3 1 Tf
3.7739 0 TD
<0003>Tj
/TT4 1 Tf
-11.8374 -1.3086 TD
Move down to the second line of the column.
(Village)Tj
15.4938 1.3086 TD
Move up again to the main line of the entry.
[(Y)-1018.9(N)-576.7(Westlake)]TJ
So we can digest the text as it comes, no need for sorting (the content could be ordered in a completely different way).
But we also see that there are no obvious column start and end points. To associate the text with a column, therefore, we have to calculate the positions of each character and compare them with externally given column start positions.
PDF libraries usually provide some mechanism to help parsing such content streams.
There are two basic architectures for this, a library may parse the content stream
The former variant at first seems easier to handle but may have big resource requirements (I have come across multi-MB content streams), while the second one seems a bit more difficult to handle but has smaller memory requirements.
The library I used (iText) follows the latter approach but your problem could also have been solved using a library following the former one.
RenderListener
is the listener interface to implement here, the renderText
methods retrieves the tindividual text chunks with positions.
In the implementation above (CertifiedSchoolListExtractionStrategy
) the renderText
method first checks the fill color associated with the chunk and only forwards black text for further processing in renderCharacter
. That method (and some helpers) in turn checks the field the text is in (by hard coded position boundaries) and accordingly exports tab separated values. This logic would similarly have been implemented using other libraries, too.
Another solution without much effort for writing code to get the pdf read: There is a linux tool with a great -layout
flag as already mentioned in askubuntu. It's called pdftotext:
$ pdftotext -layout <input.pdf> <output.txt>
It worked very promising for your provided pdf file. Sure, it's not a complete solution for your problem. But all you have to do then is to clean the text-output. This could be less time-sufficient than other solutions.
Here is a sample:
$ head -30 test.txt
Updated
SEVP Certified Schools September 16, 2015
SCHOOL NAME CAMPUS NAME F M CITY ST CAMPUS ID
"I Am" School Inc. "I Am" School Inc. Y N Mount Shasta CA 41789
424 Aviation 424 Aviation N Y Miami FL 103705
‐ A ‐
A F International School of Languages Inc. A F International College Y N Los Angeles CA 9538
A F International School of Languages Inc. A F International of Westlake Y N Westlake Village CA 57589
Village
A. T. Still University of Health Sciences Kirksville Coll of Osteopathic Y N Kirksville MO 3606
Medicine
Aaron School Aaron School ‐ 30th Street Y N New York NY 159091
Aaron School Aaron School Y N New York NY 114558
ABC Beauty Academy, INC. ABC Beauty Academy, INC. N Y Flushing NY 95879
ABC Beauty Academy, LLC ABC Beauty Academy N Y Garland TX 50677
Abcott Institute Abcott Institute N Y Southfield MI 197890
Aberdeen Catholic School System Roncalli Primary Y N Aberdeen SD 180510
Aberdeen Catholic School System Roncalli Y N Aberdeen SD 21405
Aberdeen Catholic School System Roncalli Elementary Y N Aberdeen SD 180511
Aberdeen School District 6‐1 Aberdeen Central High School Y N Aberdeen SD 36568
Abiding Savior Lutheran School Abiding Savior Lutheran School Y N Lake Forest CA 9920
Abilene Christian Schools Abilene Christian Schools Y N Abilene TX 8973
Abilene Christian University Abilene Christian University Y N Abilene TX 7498
Abington Friends School Abington Friends School Y N Jenkintown PA 20191
Above It All, Inc Benchmark Flight /Hawaii Flight N Y Kailua‐Kona HI 24353
Academy
Abraham Baldwin Agricultural College Tifton Campus Y N Tifton GA 6931
Abraham Joshua Heschel School Abraham Joshua Heschel School Y N New York NY 106824
ABT Jacqueline Kennedy Onassis School ABT Jacqueline Kennedy Onassis Y Y New York NY 52401
So this turns your problem in transforming that text output to a database readable csv file. Maybe you or another could prefer this way of doing it.
I once had a Ruby project that did this kind of work. I used the gem pdf/reader and evenutally it dit work but I advise against using this aproach, the contents of your PDF have no markers where the fields start and stop, instead you have to measure the position of each piece of text (and there are many pieces per field), here an example of the first field
"I
NUL ETX
Am"
NUL ETX
School
NUL ETX
Inc.
and compare it with boundaries you have to find by experimenting like "if the position is > 2.54cm from the left margin and < 5.78cm from the left margin" etc.. It is tedious and error prone.
The easiest solution is to somehow read the entire textcontents of your second url by manually scrolling, selecting and copying the contents into an editor and remove the head and tail extra's or use some web scraping gem like mechanize and then convert this text to CSV. The last part is easy since the structure is fixed
"I Am" School
118 Siskiyou Avenue
Mount Shasta , CA , 96067
5309266263 <--end of first record
424 Aviation
13230 SW 132 Ave.
Miami , FL , 33186
7862424848 <--end of second record
If you need help with this last part, no problem
If this is a once operation you could also use a tool like able2extract (if you'r on windows) it reads pdf and saves in Excel, the times I used it the result was decent and layout was intact.