问题
I am trying to extract images from Word documents using the ActiveXObject in JavaScript (IE only).
I was unable to find any API reference for the Word object, only a few hints from around the Internet:
var filename = 'path/to/word/doc.docx'
var word = new ActiveXObject('Word.Application')
var doc = w.Documents.Open(filename)
// Displays the text
var docText = doc.Content
How would I access images in the Word doc using something like doc.Content
?
Also, if anyone has a definitive source (preferably from Microsoft) for the API that'd be extremely helpful.
回答1:
So after a few weeks of research, I found it would be easiest to extract the images by using the SaveAs
function that is part of the Word ActiveXObject. If the file is saved as an HTML document, Word will make a folder containing the images.
From there, you can use XMLHttp to grab the HTML file and create new IMG tags that can be viewable by the browser (I'm using IE (9) because the ActiveXObject only works in Internet Explorer).
Let's begin with the SaveAs
portion:
// Define the path to the file
var filepath = 'path/to/the/word/doc.docx'
// Make a new ActiveXWord application
var word = new ActiveXObject('Word.Application')
// Open the document
var doc = word.Documents.Open(filepath)
// Save the DOCX as an HTML file (the 8 specifies you want to save it as an HTML document)
doc.SaveAs(filepath + '.htm', 8)
Now we should have a folder in the same directory with the image files in them.
Note: In the Word HTML the images use <v:imagedata>
tags which are stored in a <v:shape>
tag; for example:
<v:shape style="width: 241.5pt; height: 71.25pt;">
<v:imagedata src="path/to/the/word/doc.docx_files/image001.png">
...
</v:imagedata>
</v:shape>
I've removed the extraneous attributes and tags that Word saves.
To access the HTML using JavaScript, use an XMLHttpRequest object.
var xmlhttp = new XMLHttpRequest()
var html_text = ""
Because I am accessing hundreds of Word docs, I've found it is best to define the XMLHttp's onreadystatechange
callback before sending the call.
// Define the onreadystatechange callback function
xmlhttp.onreadystatechange = function() {
// Check to make sure the response has fully loaded
if (xmlhttp.readyState==4 && xmlhttp.status==200) {
// Grab the response text
var html_text=xmlhttp.responseText
// Load the HTML into the innerHTML of a DIV to add the HTML to the DOM
document.getElementById('doc_html').innerHTML=html_text.replace("<html>", "").replace("</html>","")
// Define a new array of all HTML elements with the "v:imagedata" tag
var images =document.getElementById('doc_html').getElementsByTagName("v:imagedata")
// Loop through each image
for(j=0;j<images.length;j++) {
// Grab the source attribute to get the image name
var src = images[j].getAttribute('src')
// Check to make sure the image has a 'src' attribute
if(src!=undefined) {
...
I've had many issues loading the correct src
attribute because of the way IE escapes it's HTML attributes when it loads them into the innerHTML doc_html
div so in the below example I am using a pseudo-path and src.split('/')[1]
to grab the image name (this method won't work if there are more than 1 forward slashes!):
...
images[j].setAttribute('src', '/path/to/the/folder/containing/the/images/'+src.split('/')[1])
...
Here is where we add a new img
tag to the HTML div using the parent's (the v:shape
object) parent (happens to be a p
object). We append the new img
tag to the innerHTML by grabbing the src
attribute from the image and the style
information from the v:shape
element:
...
images[j].parentElement.parentElement.innerHTML+="<img src='"+images[j].getAttribute('src')+"' style='"+images[j].parentElement.getAttribute('style')+"'>"
}
}
}
}
// Read the HTML Document using XMLHttpRequest
xmlhttp.open("POST", filepath + '.htm', false)
xmlhttp.send()
Although it is a bit specific, the above method was able to successfully add img tags to the HTML where they were in the original document.
来源:https://stackoverflow.com/questions/15212895/how-to-extract-images-from-word-documents-using-javascript