I have been asked to automate a particular task at work today which takes up a lot of our time! The following is what needs to be done and I would appreciate any help on how I can do this (Implementation Advice) within the realms of my knowledge, if possible.
Problem
I have a PowerPoint document (.ppt). I would like to extract text from there (the text is in bullet point format). I would like to insert these bullets points into an Excel sheet, each bullet point should be a row. I would also like to put in the adjacent column the page this bullet point text was taken from.
So, basically: Extract from ppt --> Insert into Excel sheet each row being a bullet point.
Technologies available to me
Perl, PHP and Java.
I would prefer PHP to be honest as this is my primary language, but I am happy to consider anything else you guys/gals think is best. Second would be Perl and then Java. I do not want to be compiling classes and installing the JDK just for this! :)
Key Questions
- How do you reference a bullet point?
- Am I likely to end up with just a load of unstructured text in the Excel sheet?
- Are there any barriers to reading from a ppt file?
Update
I would consider MS technologies (VB, etc.) if it makes life easier but I have never used it and I despise MS technology! Hope I don't get flamed by the evangelists! :)
Here is a sample script using Win32::OLE.
By the way, once you have converted the slides into a format you can process, you can use Spreadsheet::WriteExcel on non-MS systems to write the output. Therefore, I would recommend two programs: One to transform the PowerPoint documents and another to generate the Excel files.
Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via Tools → Macro → Visual Basic Editor. Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.
#!/usr/bin/perl
use strict;
use warnings;
use FindBin qw( $Bin );
use File::Spec::Functions qw( catfile );
use Win32::OLE;
use Win32::OLE::Const 'Microsoft PowerPoint';
$Win32::OLE::Warn = 3;
my $ppt = get_ppt();
$ppt->{Visible} = 1;
my $ppt_file = catfile $Bin, 'test.ppt';
my $doc = $ppt->Presentations->open( $ppt_file );
my $slides = $doc->Slides;
my $num_slides = $slides->Count;
for my $slide_idx (1 .. $num_slides) {
print "=== Begin Slide $slide_idx ===\n";
my $slide = $doc->Slides->Item( $slide_idx );
my $shapes = $slide->Shapes;
my $num_shapes = $shapes->Count;
for my $shape_idx (1 .. $num_shapes) {
my $shape = $shapes->Item($shape_idx);
next unless $shape->HasTextFrame;
my $pars = $shape->TextFrame->TextRange->Paragraphs;
my $num_pars = $pars->Count;
for my $par_idx (1 .. $num_pars) {
my $par = $pars->Paragraphs($par_idx,1);
print_par( $par );
}
}
print "=== End Slide $slide_idx ===\n\n";
}
sub print_par {
my ($par) = @_;
my @bullets = qw( - * > + = @ );
my $bullet_format = $par->ParagraphFormat->Bullet;
my $bullet_type = $bullet_format->Type;
my $bullet_char = '';
if ($bullet_type == ppBulletNumbered) {
$bullet_char = $bullet_format->Number . "\t";
}
elsif( $bullet_type == ppBulletUnnumbered ) {
# Need a Unicode => ASCII mapping if you want to use
# $bullet_format->Character
my $indent = $par->IndentLevel % scalar @bullets;
$bullet_char = $bullets[$indent] . "\t";
}
my $text = $par->Text;
$text =~ s/\s+$//;
print $bullet_char, $text,"\n";
}
sub get_ppt {
my $app;
eval {
$app = Win32::OLE->GetActiveObject('PowerPoint.Application');
};
die "$@\n" if $@;
unless($app) {
$app = Win32::OLE->new(
'PowerPoint.Application', sub { $_[0]->Quit }
) or die "Oops, cannot start PowerPoint: ",
Win32::OLE->LastError, "\n";
}
return $app;
}
It can be done with Perl. Pretty much anything you can do with VBA can be done with Perl via Win32::OLE. I've used the Win32::OLE module to work with MS-Office documents, both extracting and creating content. It's been awhile though. Start here, http://win32.perl.org/wiki/index.php?title=Win32_Perl_Modules about the middle of the page.
The VBA documentation on each of the objects is useful for reference, finding what objects there are and the methods and properties available on them.
Puh...you'll have a hard time in dealing with MS Office files using Perl, PHP or Java. When I had to do automation for MS Office files I most often used VBA (Visual Basic for Applications). Take a look at it. For a lot of stuff you can just record a macro, take a look at the produced code and learn how things are being referenced. Then take pieces of the generated code and create your own VBA modules and tools. I could imagine that such a functionality could perfectly fit as an Add-in for Power Point.
Is Visual Basic for Applications not available to you? That should be built-in to Office, and since you're going Office-to-Office it might be easier.
You could use OpenOffice.org's presentation app (Impress) to import the Powerpoint file. You could then export it in native OpenOffice.org format, which is XML. You should then be able to parse the plain-text XML with the language of your choice.
As other people have pointed out, if you want to work with the Powerpoint format directly, you really need to use a Microsoft language (VB, VBA, C#, etc.).
I'm thinking you may want to look into programs that convert from PPT to a CSV file possibly with PDF in the middle? Once the data is in CSV format, you may be able to process it by php/perl much more easily.
Doing this from scratch will be very time consuming because the Office document formats are very complicated in general.
This sounds a lot like what I do at work, though I work mostly in Excel and Word. Your best bet would be use VBA in PowerPoint to look at each page and find the bullets. Then write to a file in CSV format, each bullet on a different line, which will open in Excel, each bullet on it's own row.
Finding what is a bullet and what isn't might be trick. Try recording some macros selecting, adding and deleting bullets, and maybe changing the level of a few. That should give you enough info about what objects to be looking for, and how they can be worked with.
来源:https://stackoverflow.com/questions/1214452/automating-a-job-at-work-importing-powerpoint-bullet-text-into-an-excel-sheet