Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:<
HTML dtd's are huge, using includes. They take forever. Use an XML catalog. There one can store the dtds locally and map them by their system ID.
If you use a tool, like maven, you will find sufficient pointers.
The advantage i.o. intercepting entities as the accepted answer suggests, is that you receive the correct characters.
Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy.