I\'m looking at the output of a tool, dumping a database table to XML. One of the columns is named 64kbit , the tool encodes that as such, and I need to rep
An XML name cannot start with a digit, so some other representation must be used that can be understood to mean '6'.
The tool has chosen to write the hexadecimal representation of the character instead, surrounded by underscores. The code \x0036
is the hexadecimal code for the character '6', which is 54 in decimal. Underscores are valid characters at the start of an XML name so this works.
This same technique could be used to escape other characters which are invalid in XML names. This technique is used for example by Microsoft's XmlConvert, as described here, but I'm sure there are other tools which use the same technique too.
Well, it doesn't seem to be too standard, but XML explicitly disallows numbers (and some other things) as the first character of an element name:
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
[#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
[#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
[#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
This encoding here just kinda escapes the first character if it doesn't fit that requirements. It uses the hexadecimal value of that character. _x0036_
obviously corresponds to hexadeximal 0x36
which is 54
in decimal and represents the digit 6
.
IIRC (I was there, but it was a long time ago) the thinking was that it would be very common to map XML element & attributes to programming-language constructs, which are represented by variables, and very few (any?) programming languages allow variable names that begin with numbers. So, the idea is that XML element/attribute names should fit nicely into most languages’ variable-naming rules. Do I still believe this? If we were doing XML again, would I be OK with this? Dunno; it’d be an interesting discussion though.
That encoding isn't default to XML, but seems required by your tool, since elements must start with a small character set.
That _x0036_
sequence represents haxadecimal number 36 (decimal 54), which represents your 6
character in ASCII table.
The official word is that the restriction imposed on Xml naming conventions are inherited from Xml's parent-set SGML, with one exception only: In Xml, as an additional option, names may begin with an underscore '_' character.
SGML was developed by IBM in the 1960s, by a group of minds that were thinking '1960s style'.
As a result, the brain-storm that lead to the creation of SGML was likely to have been distracted by the overwhelming notion that space-ships, time-travel and flairs made of kitchen foil to protect against 'them aliens' and their fool-hardy attempts at thought-provocation and mind-control were justified thought processes.
So. The question still remains. Why doesn't SGML allow numbers? Furthermore, why would there be any sort of restriction imposed on the use of any character other than the control-characters; <, >, & and empty space? It would be madness, surely to present the computer geek with so many keys for so many different characters, only to prevent him or her from using them.
The most significant reason is the 1960s thinking parser, and it's following of the complexity rule to a degree of outright pedantry.
'The simpler the parser is, the faster it will perform'
The alphabet is 26 capital + 26 uncapital characters big in total, and that's 52. Allowing numbers is an additional ten digits, which is about a sixth more!
In human terms, this would be like having to wash six hideously filth-encrusted pots, each one taking an hour to clean, and then hidden underneath the last pot is an extra bonus pot to wash, and you must wash it! You have to repeat this routine every single day for the rest of your life, and that's exactly what it like. Precisely!
Mark-up language documents have a tendency to bulge in content. So, the less jobs for the parser, mean a direct increase in performance speed. The benefits then trickle down through the ranks until they metamorphose into pure lucrative performance.
In the 'Ye olde days of horse, carriage and a Commodore 64' it was far more the user's responsibility to count their bits and bytes manually, in order for the kilobytes to take care of themselves. However, as the modern CPU is more able to cope than its ancient predecessor, the restrictions imposed by the parser have become more significant than the performance issues.
If it's any consolation, if I were to design a Mark-up language myself (which for argument's sake, we will call NAM-LIT-MAML, because Nicholas' awesome mark-up language is the most awesome mark-up language (ever!), then it would allow you to use any number of all the characters in the entire history of the world, and indeed universe, without exception, and I would work really hard to create some never been used before characters for the language's own use, which could still be used within the document by use of its own escape character that looks nothing like any other character that's ever been used before by anyone ever.
The restrictions imposed by Xml are inherited from SGML, and we can all agree that in this day and age of space-ship camels and other useful robotic mammals, they are unnecessary, stupid and go against the grain of Object Oriented programming.
Further reading at http://www.w3.org/TR/REC-xml/
Although the simpliest way that I have found to make a name xml compatible is to include a suffix of '_', there is no standard and as such other methods are in use.
In your example, the first character has been converted into a hex value. This hex value represents the '6' character in both ASCII, Unicode and undoubtedly others.
A good thing about using hex values is that all characters in a code-set e.g. Unicode may be represented.
A bad thing is that they aren't as readable at a glance.