Scraping large pdf tables which span across multiple pages

后端 未结 7 1875
野的像风
野的像风 2021-02-04 07:14

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is t

相关标签:
7条回答
  • 2021-02-04 07:48

    Ok I took a shot at this and I think it will help, although I'm not sure what you want your final output to look like. I'm happy to work more on this so let me know if there are parts you need help with.


    I started by downloading a PDF to Text application from CNET.

    After installing, I checked these settings:

    PDF to text conversion

    The important part here is we're using the physical layout option.

    This gave us output that looks like this:

    Taules de Dades de la Xarxa d’Estacions
        Meteorològiques Automàtiques
                2                                                                                                   Anuari de dades meteorològiques 2012 / Servei Meteorològic de Catalunya
                2                                                           TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012
    
    COMARCA          CODI i NOM EMA                    GEN    FEB    MAR         ABR       MAI      JUN      JUL          AGO        SET        OCT        NOV         DES         ANY
    
    Alt Camp         VY   Nulles                        7,5    5,5   10,9         12,3     16,7     21,6     22,3         24,4       20,1        15,9       11,0        8,5         14,8
    Alt Camp         DQ   Vila-rodona                   7,9    5,6   11,0         12,0     16,6     21,6     22,0         24,3       19,9        15,8       11,0        8,6         14,7
    Alt Empordà      U1   Cabanes                       8,2    6,5   11,7         12,6     17,5     22,0     23,1         24,4       20,4        16,6       11,8        8,3         15,3
    Alt Empordà      W1   Castelló d'Empúries           8,1    6,4   11,6         12,9     17,0     21,1     22,0         23,4       20,1        16,4       12,1        8,5         15,0
    Alt Empordà      VZ   Espolla                       9,0    6,7   12,4         12,7     17,8     22,0     23,3         24,8       20,9        16,7       12,0        8,9         15,6
    
    [......]
    
                 3                                                                                                           Anuari de dades meteorològiques 2012 / Servei Meteorològic de Catalunya
                 2                                                                   TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012
    
    COMARCA          CODI i NOM EMA                             GEN    FEB    MAR         ABR       MAI      JUN      JUL          AGO        SET        OCT        NOV         DES         ANY
    
    Baix Empordà     DF   la Bisbal d'Empordà                    6,6    5,3   10,9         12,6     17,2     21,9     22,9         24,6       20,3        16,6       11,9        7,6         14,9
    Baix Empordà     UB   la Tallada d'Empordà                   6,1    5,2   10,7         12,3     16,6     21,3     22,2         23,8       19,7        15,8       11,7        7,6         14,4
    Baix Empordà     UC   Monells                                6,1    4,6    9,9         11,4     16,5     21,7     23,0         24,5       19,6        15,7       11,7        7,2         14,3
    Baix Empordà     UD   Serra de Daró                          6,3    5,3   10,6         12,3     16,8     21,6     22,7         24,3       20,3        16,6       12,2        7,7         14,8
    
    [......]
    
                 4                                                                                                              Anuari de dades meteorològiques 2012 / Servei Meteorològic de Catalunya
                 2                                                                      TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012
    
    COMARCA           CODI i NOM EMA                               GEN    FEB    MAR         ABR       MAI      JUN      JUL          AGO        SET        OCT        NOV         DES         ANY
    
    Maresme           UQ   Dosrius - PN Montnegre Corredor          7,2    4,6   10,8         10,7     15,8     20,4     20,8         23,4       18,6        15,1       10,7        7,8         13,9
    Maresme           WT   Malgrat de Mar                           7,4    5,4   11,0         13,0     16,7     21,5     22,8         24,6       20,9        17,2       12,9        8,8         15,2
    Maresme           DD   Vilassar de Mar                         10,1    7,5   12,6         13,9     17,9     22,4     23,7         25,7       22,1        18,4       13,8       10,8         16,6
    Montsià           US   Alcanar                                 10,0    7,6   11,8         14,2     17,9     22,7     24,0         25,8       22,0        18,2       13,7       10,7         16,6
    Montsià           UU   Amposta                                  9,6    7,5   12,1         14,3     18,3     22,8     23,5         25,3       21,6        18,0       13,1       10,8         16,4
    
    [......]
    

    You can see the columns line up much better, but we also have headers and page numbers. Also the COMARCA and i NOM EMA columns were variying length. We want to normalize this to fixed width columns.

    I wrote a Perl program to do normalize it, and it also combines tables with the same title, and only prints the headers at the top. It creates an output folder with all the files with the title as the file name.

    Here's the code:

    #!/bin/perl
    
    use strict;
    use warnings;
    use open qw(:std :utf8);
    use utf8;
    
    my $comarca;
    my $nom;
    my $print_headers;
    my $title = "";
    my $fh;
    
    while(<>) {
    
        if (    !/Xarxa d’Estacions/
            and !/Meteorològiques Automàtiques/
            and !/Servei/
            and !/^\s*\d+\s*$/
            and !/^\s*$/ ) {
    
            chomp($_);
    
    
            if ( /^\s*2/ ) { #title
                s/^\s*2\s*//;
                if ( $title ne $_ ) {
                    $title = $_;
                    $print_headers = 1;
                }
    
            } elsif ( /COMARCA/ ) { #column headers
    
                my ($first_col, $second_col, @the_rest) = split(/(CODI +i NOM EMA *)/, $_);
    
    
                $comarca = length $first_col;
                $nom = length $second_col;
    
                if ( $print_headers ) {
                    my $str = sprintf "%-50s %-50s %s\n", $first_col, $second_col, join("", @the_rest);
                    write_string($str);
                    $print_headers = 0;
                }
    
            } else { #data
    
                my ($one, $two, $three) = unpack("A${comarca}A${nom}A*", $_);
                my $str = sprintf "%-50s %-50s $three\n", $one, $two;
                write_string($str);
            }
    
        }
    }
    
    sub write_string {
    
        my $string = shift;
        my $file_name = $title;
        $file_name =~ s/[\/\\]//g;
    
        open ($fh, '>>', ".\/output_folder\/${file_name}.txt") or die "Couldn't open: $!";
        print $fh $string;
        close ($fh);
    }
    

    There are still a few imperfections in the output (you'll see these when you run this), but I wanted to get some feedback on what output would work best for you. There is definitely more we can do to improve the code! The output directory tree looks like this:

    Matt@MattPC ~/perl/pdftotext
    $ find .
    .
    ./convert.pl
    ./EMAtaules2012.txt
    ./output.txt
    ./output_folder
    ./output_folder/AMPLITUD TÈRMICA MITJANA MENSUAL ( ºC ) - 2012?.txt
    ./output_folder/AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012?.txt
    ./output_folder/DIRECCIÓ DOMINANT DEL VENT - 2012?.txt
    ./output_folder/GRUIX MÀXIM MENSUAL DE NEU AL TERRA ( cm ) - 2012?.txt
    ./output_folder/HUMITAT RELATIVA MITJANA MENSUAL ( % ) - 2012?.txt
    ./output_folder/MITJANA MENSUAL DE LA HUMITAT RELATIVA MÀXIMA DIÀRIA ( % ) - 2012?.txt
    ./output_folder/MITJANA MENSUAL DE LA HUMITAT RELATIVA MÍNIMA DIÀRIA ( % ) - 2012?.txt
    [......]
    

    Where a file might look like this:

    COMARCA                                            CODI i NOM EMA                                     GEN    FEB    MAR         ABR       MAI      JUN      JUL          AGO        SET        OCT        NOV         DES         ANY
    Alt Camp                                           VY   Nulles                                         7,5    5,5   10,9         12,3     16,7     21,6     22,3         24,4       20,1        15,9       11,0        8,5         14,8
    Alt Camp                                           DQ   Vila-rodona                                    7,9    5,6   11,0         12,0     16,6     21,6     22,0         24,3       19,9        15,8       11,0        8,6         14,7
    Alt Empordà                                        U1   Cabanes                                        8,2    6,5   11,7         12,6     17,5     22,0     23,1         24,4       20,4        16,6       11,8        8,3         15,3
    Alt Empordà                                        W1   Castelló d'Empúries                            8,1    6,4   11,6         12,9     17,0     21,1     22,0         23,4       20,1        16,4       12,1        8,5         15,0
    Alt Empordà                                        VZ   Espolla                                        9,0    6,7   12,4         12,7     17,8     22,0     23,3         24,8       20,9        16,7       12,0        8,9         15,6
    Alt Empordà                                        D6   Portbou                                        9,6    5,5   12,7         12,5     17,4     21,5     22,9         24,4       19,8        17,0       12,3       10,1         15,5
    [......]
    

    Headers are only at the top and all the columns line up. This one is TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012.

    I've been thinking of uploading more of the output to a file hosting site, but I don't know which would be a good one, suggestions?

    Hope this helps you Tomas!

    EDIT: Example of missing entries from AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012:

    Solsonès                                           VP   Pinós                          1              3,1   26   16,9   13   16,7   15   16,6   17   19,2   11   19,6   24   20,4    17      19,1   01   17,5   16   16,5   06   13,1   08   13,9   24   20,4    17/07
    Solsonès                                           XT   Solsona                                                                                                              22,2    25      22,2   09   20,1   16   18,6   06   15,3   07   18,2   23   22,2    09/08
    Tarragonès                                         VQ   Constantí                      1              6,4   19   21,9   23   19,7   11   12,9   07   17,4   23   17,2   21   15,1    18      14,2   18   18,0   15   15,1   02   14,9   07   16,0   10   21,9    23/02
    

    Update

    Updated scripts for processing the input file:

    #!/bin/perl
    
    use strict;
    use warnings;
    use open qw(:std :utf8);
    use utf8;
    use charnames ':full';
    
    my @column_lengths;
    my $print_headers;
    my $title = "";
    my $fh;
    
    while(<>) {
    
        if (    !/Xarxa d’Estacions/
            and !/Meteorològiques Automàtiques/
            and !/Servei/
            and !/^\s*\d+\s*$/
            and !/^\s*$/ ) {
    
            s/[\r\n]+//g;
            s/ +\d+$//;
            if ( /^\s*2/ ) { #title
                s/^\s*2\s*//;
                if ( $title ne $_ ) {
                    $title = $_;
                    $print_headers = 1;
                }
    
            } elsif ( /COMARCA/ ) { #column headers
    
                my $comarca = (split(/(COMARCA *)/, $_))[1];
                my $codi = (split(/(CODI *)/, $_))[1];
                my $inomema = (split(/(i NOM EMA *)  /, $_))[1];
    
                my $the_rest = (split(/(i NOM EMA *)  /, $_))[2];
    
                my @rest = split(/( \w+ *)/, $the_rest);
    
                undef @column_lengths;
    
                push @column_lengths, length $comarca;
                push @column_lengths, length $codi;
                push @column_lengths, length $inomema;
    
                for (@rest) {
                    if ( $_ ) {
                        push @column_lengths, length $_;
                    }
                }
    
                $column_lengths[-1] = "*";
    
                if ( $print_headers ) {
                    $print_headers = 0;
                    write_string(join(";", unpack( "A" . join("A", @column_lengths), $_)) . "\n");
                }
    
            } else { #data
    
                write_string(join(";", unpack( "A" . join("A", @column_lengths), $_)) . "\n");
    
            }
    
        }
    }
    
    sub write_string {
    
        my $string = shift;
        my $file_name = $title;
        $file_name =~ s/[º]//g;
        $file_name =~ s/[^\w ]//g;
        $file_name =~ s/ +/ /g;
        $file_name =~ s/È/E/g;
        $file_name =~ s/À/A/g;
        $file_name =~ s/Ó/O/g;
        $file_name =~ s/Í/I/g;
        $file_name =~ s/Ç/C/g;
    
        open ($fh, '>>', ".\/output_folder\/${file_name}.txt") or die "Couldn't open: $!";
        print $fh $string;
        close ($fh);
    }
    

    This one combines lines with the d.i. on the next line.

    #!/bin/perl -i
    
    use strict;
    use warnings;
    
    my $last = <>;
    
    while(<>) {
    
        my @current_array = split(";", $_);
    
        if ( /^;+[ \t]+.d\.i\./ ) {
    
            my @last_array = split(";", $last);
            my @combined_array;
    
            #print "matches\n";
    
            for my $element (@current_array) {
    
                if ( $element =~ /d\.i\./ ) {
                    push @combined_array, $element;
                    shift @last_array;
                } else {
                    push @combined_array, $last_array[0];
                    shift @last_array;
                }
    
            }
            undef @current_array;
            @current_array = @combined_array;
        }
        $last = join ";", @current_array;
        print $last;
    
    }
    

    The output is in csv format with semicolon delimiters.

    0 讨论(0)
提交回复
热议问题