Convert Word doc or docx files into text files?

前端 未结 11 476
难免孤独
难免孤独 2020-12-05 01:28

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don\'t want to have to manually open Wor

相关标签:
11条回答
  • 2020-12-05 01:44

    Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.

    You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.

    On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.

    0 讨论(0)
  • 2020-12-05 01:46

    I strongly recommend AsposeWords if you can do Java or .NET. It can convert, without Word installed, between all major text file types.

    0 讨论(0)
  • 2020-12-05 01:48

    A simple Perl only solution for docx:

    1. Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)

    2. Use XML::LibXML to parse it.

    3. Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)

    Cheers !

    J.

    0 讨论(0)
  • 2020-12-05 01:48

    For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.

    For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.

    Aspose.Words has a very simple API with great support too I have found.

    There is also this bash command from commandlinefu.com which works by unzipping the .docx:

    unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
    
    0 讨论(0)
  • 2020-12-05 01:49

    .doc's that use the WordprocessingML and .docx's XML format can have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.

    0 讨论(0)
  • 2020-12-05 01:53

    The method of Sinan Ünür works well.
    However, I got some crash with the files I was transforming.

    Another method is to use Win32::OLE and Win32::Clipboard as such:

    • Open the Word document
    • Select all the text
    • Copy in the Clipboard
    • Print the content of Clipboard in a txt file
    • Empty the Clipboard and close the Word document

    Based on the script given by Sigvald Refsu in http://computer-programming-forum.com/53-perl/c44063de8613483b.htm, I came up with the following script.

    Note: I chose to save the txt file with the same basename as the .docx file and in the same folder but this can easily be changed

    ########################################### 
    use strict; 
    use File::Spec::Functions qw( catfile );
    use FindBin '$Bin';
    use Win32::OLE qw(in with); 
    use Win32::OLE::Const 'Microsoft Word'; 
    use Win32::Clipboard; 
    
    my $monitor_word=0; #set 1 to watch MS Word being opened and closed
    
    sub docx2txt {
        ##Note: the path shall be in the form "C:\dir\ with\ space\file.docx"; 
        my $docx_file=shift; 
    
        #MS Word object
        my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word"; 
        #Monitor what happens in MS Word 
        $Word->{Visible} = 1 if $monitor_word; 
    
        #Open file 
        my $Doc = $Word->Documents->Open($docx_file); 
        with ($Doc, ShowRevisions => 0); #Turn of revision marks 
    
        #Select the complete document
        $Doc->Select(); 
        my $Range = $Word->Selection();
        with ($Range, ExtendMode => 1);
        $Range->SelectAll(); 
    
        #Copy selection to clipboard 
        $Range->Copy();
    
        #Create txt file 
        my $txt_file=$docx_file; 
        $txt_file =~ s/\.docx$/.txt/;
        open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)"; 
        printf TextFile ("%s\n", Win32::Clipboard::Get()); 
        close TextFile; 
    
        #Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
        Win32::Clipboard::Set("");
    
        #Close Word file without saving 
        $Doc->Close({SaveChanges => wdDoNotSaveChanges});
    
        # Disconnect OLE 
        undef $Word; 
    }
    

    Hope it can helps you.

    0 讨论(0)
提交回复
热议问题