问题
I'm trying to get the plain text from a word document. Specifically, the xpath is giving me trouble. How do you select the tags? Here's the code I have.
public static string TextDump(Package package)
{
StringBuilder builder = new StringBuilder();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream());
foreach (XmlNode node in xmlDoc.SelectNodes("/descendant::w:t"))
{
builder.AppendLine(node.InnerText);
}
return builder.ToString();
}
回答1:
Your problem is the XML namespaces. SelectNodes
don't know how to translate <w:t/>
to the full namespace. Therefore, you need to use the overload, that takes an XmlNamespaceManager
as the second argument. I modified your code a bit, and it seems to work:
public static string TextDump(Package package)
{
StringBuilder builder = new StringBuilder();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream());
XmlNamespaceManager mgr = new XmlNamespaceManager(xmlDoc.NameTable);
mgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");
foreach (XmlNode node in xmlDoc.SelectNodes("/descendant::w:t", mgr))
{
builder.AppendLine(node.InnerText);
}
return builder.ToString();
}
回答2:
Take a look at the Open XML Format SDK 2.0. There some examples on how to process documents, like this.
Although I have not used it, there is this Open Office XML C# Library that you can take a look at as well.
来源:https://stackoverflow.com/questions/1099458/how-to-grab-text-from-word-docx-document-in-c