I\'m looking for a machine-readable version of the HTML5 specs, akin to a DTD, although any format would do as long as it\'s parsable.
The HTML5 specs don\'t seem to
Trawling through W3's site I can only see two things of interest on this:
I've looked as hard as you probably have with nothing concrete. I think validator.nu's approach is the best as the working draft is likely to change several times before a specification is ever agreed upon. If someone did publish an unofficial DTD it would need constant maintenance.
+1 great question, I wish I could find a concrete answer. I hope someone else can!
NEW as of April 2019 The WHATWG HTML5 spec as JSON, although very incomplete and a work in progress.
Uses Python to parse the multipage standard.
Full disclosure: I made this.
See also
HTML5 RelaxNG schemas
I've read this question and it's answers and decided to start a new project: WHATWG HTML5 Standard Parser. Currently, it parsers the singlepage version of the standard html page and provides the elements together with allowed attributes.
Hope to get something started... Pull requests are welcome!!!
Since 2014-10-28 the HTML5 is a recommendation (!)... But this question is not obsolete (the validators now are more complex tham simple DTD).
there are no simple parser, as @ruediste clues show... Today, perhaps the best parser is at https://validator.nu/ ... so,
There isn't a BNF/CFG for HTML5 because HTML5 is partially about progressive enhancement and fixing errors silently. If a page features broken markup, it's the browser's duty to display the page as well as it can and not complain to the user.
More about this history can be read at Dive Into HTML5 / How Did We Get Here?:
As you might expect, the fact that “broken” HTML markup still worked in web browsers led authors to create broken HTML pages. A lot of broken pages. By some estimates, over 99% of HTML pages on the web today have at least one error in them. But because these errors don’t cause browsers to display visible error messages, nobody ever fixes them.
I guess this isn't particularly helpful, so my apologies. You could try looking at the XHTML 1.1 DTD or SGML DTD as starting points. Or, if you want a heuristic-based best-attempt approach, check out an HTML parser such as Beautiful Soup.