I have large batches of XHTML files that are manually updated. During the review phase of the updates i would like to programmatically check the well-formedness of the files. I am currently using a XmlReader, but the time required on an average CPU is much longer than i expected.
The XHTML files range in size from 4KB to 40KB and verifying takes several seconds per file. Checking is essential but i would like to keep the time as short as possible as the check is performed while files are being read into the next process step.
Is there a faster way of doing a simple XML well-formedness check? Maybe using external XML libraries?
I can confirm that validating "regular" XML based content is lightning fast using the XmlReader, and as suggested the problem seems to be related to the fact that the XHTML DTD is read each time a file is validated.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Note that in addition to the DTD, corresponding .ent files (xhtml-lat1.ent, xhtml-symbol.ent, xhtml-special.ent) are also downloaded.
Since ignoring the DTD completely is not really an option for XHTML as the well-formedness is closely linked to allowed HTML entities (e.g., a will promptly introduce validation errors when we ignore the DTD).
The problem was solved by using a custom XmlResolver as suggested, in combination with local (embedded) copies of both the DTD and entity files.
I will post the solution here once i cleaned up the code
I would expect that XmlReader
with while(reader.Read)() {}
would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?
Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver
(set via XmlReaderSettings
) that uses locally cached schemas rather than a remote fetch...
The following does ~300KB virtually instantly:
using(MemoryStream ms = new MemoryStream()) {
XmlWriterSettings settings = new XmlWriterSettings();
settings.CloseOutput = false;
using (XmlWriter writer = XmlWriter.Create(ms, settings))
{
writer.WriteStartElement("xml");
for (int i = 0; i < 15000; i++)
{
writer.WriteElementString("value", i.ToString());
}
writer.WriteEndElement();
}
Console.WriteLine(ms.Length + " bytes");
ms.Position = 0;
int nodes = 0;
Stopwatch watch = Stopwatch.StartNew();
using (XmlReader reader = XmlReader.Create(ms))
{
while (reader.Read()) { nodes++; }
}
watch.Stop();
Console.WriteLine("{0} nodes in {1}ms", nodes,
watch.ElapsedMilliseconds);
}
Create an XmlReader
object by passing in an XmlReaderSettings
object that has the ConformanceLevel.Document
.
This will validate well-formedness.
This MSDN article should explain the details.
On my fairly ordinary laptop, reading a 250K XML document from start to finish with an XmlReader
takes 6 milliseconds. Something else besides just parsing XML is the culprit.
i know im necro posting but i think this could be a solution
- use HTML Tidy to clear your xml. set the option to remove the doctype
- then read the generated xhtml/xml from tidy.
here's a same code
public void GetDocumentStructure(int documentID)
{
string scmRepoPath = ConfigurationManager.AppSettings["SCMRepositoryFolder"];
string docFilePath = scmRepoPath + "\\" + documentID.ToString() + ".xml";
string docFilePath2 = scmRepoPath + "\\" + documentID.ToString() + "_clean.xml";
Tidy tidy = new Tidy();
tidy.Options.MakeClean = true;
tidy.Options.NumEntities = true;
tidy.Options.Xhtml = true;
// this option removes the DTD on the generated output of Tidy
tidy.Options.DocType = DocType.Omit;
FileStream input = new FileStream(docFilePath, FileMode.Open);
MemoryStream output = new MemoryStream();
TidyMessageCollection msgs = new TidyMessageCollection();
tidy.Parse(input, output, msgs);
output.Seek(0, SeekOrigin.Begin);
XmlReader rd = XmlReader.Create(output);
int node = 0;
System.Diagnostics.Stopwatch watch = System.Diagnostics.Stopwatch.StartNew();
while (rd.Read())
{
++node;
}
watch.Stop();
Console.WriteLine("Duration was : " + watch.Elapsed.ToString());
}
As others mentioned, the bottleneck is most likely not the XmlReader.
Check if you wouldn't happen to do a lot of string concatenation without a stringbuilder.
That can really nuke your performance.
Personally, I'm pretty lazy ... so I look for .NET libraries that already solve the problem. Try using the DataSet.ReadXML()
function and catch the exceptions. It does a pretty amazing job of explaining the XML format errors.
I'm using this function for verifying strings/fragments
<Runtime.CompilerServices.Extension()>
Public Function IsValidXMLFragment(ByVal xmlFragment As String, Optional Strict As Boolean = False) As Boolean
IsValidXMLFragment = True
Dim NameTable As New Xml.NameTable
Dim XmlNamespaceManager As New Xml.XmlNamespaceManager(NameTable)
XmlNamespaceManager.AddNamespace("xsd", "http://www.w3.org/2001/XMLSchema")
XmlNamespaceManager.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance")
Dim XmlParserContext As New Xml.XmlParserContext(Nothing, XmlNamespaceManager, Nothing, Xml.XmlSpace.None)
Dim XmlReaderSettings As New Xml.XmlReaderSettings
XmlReaderSettings.ConformanceLevel = Xml.ConformanceLevel.Fragment
XmlReaderSettings.ValidationType = Xml.ValidationType.Schema
If Strict Then
XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ProcessInlineSchema)
XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ReportValidationWarnings)
Else
XmlReaderSettings.ValidationFlags = XmlSchemaValidationFlags.None
XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.AllowXmlAttributes)
End If
AddHandler XmlReaderSettings.ValidationEventHandler, Sub() IsValidXMLFragment = False
AddHandler XmlReaderSettings.ValidationEventHandler, AddressOf XMLValidationCallBack
Dim XmlReader As Xml.XmlReader = Xml.XmlReader.Create(New IO.StringReader(xmlFragment), XmlReaderSettings, XmlParserContext)
While XmlReader.Read
'Read entire XML
End While
End Function
I'm using this function for verifying files:
Public Function IsValidXMLDocument(ByVal Path As String, Optional Strict As Boolean = False) As Boolean
IsValidXMLDocument = IO.File.Exists(Path)
If Not IsValidXMLDocument Then Exit Function
Dim XmlReaderSettings As New Xml.XmlReaderSettings
XmlReaderSettings.ConformanceLevel = Xml.ConformanceLevel.Document
XmlReaderSettings.ValidationType = Xml.ValidationType.Schema
If Strict Then
XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ProcessInlineSchema)
XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ReportValidationWarnings)
Else
XmlReaderSettings.ValidationFlags = XmlSchemaValidationFlags.None
XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.AllowXmlAttributes)
End If
XmlReaderSettings.CloseInput = True
AddHandler XmlReaderSettings.ValidationEventHandler, Sub() IsValidXMLDocument = False
AddHandler XmlReaderSettings.ValidationEventHandler, AddressOf XMLValidationCallBack
Using FileStream As New IO.FileStream(Path, IO.FileMode.Open)
Using XmlReader As Xml.XmlReader = Xml.XmlReader.Create(FileStream, XmlReaderSettings)
While XmlReader.Read
'Read entire XML
End While
End Using
End Using
End Function
来源:https://stackoverflow.com/questions/527415/what-is-the-fastest-way-to-programatically-check-the-well-formedness-of-xml-file