问题
In what 8-bit ASCII-like character set for English is 0x9d
meaningful?
I'm cleaning up some old data files, and occasionally finding a 0x9d
in otherwise-ASCII text. (No, it's not UTF-8.)
It's not valid in Windows-1252. The Python "latin-1" codec translates it to Unicode 0x9D, which is "Operating System Command". That makes little sense. In Unicode you get a box with [009d]. (In Python, you can convert anything to Latin-1 without errors being raised, but that doesn't mean it's meaningful to do so.)
Examples, with Python-type escapes, from a messy database I'm cleaning up that combines text from many sources:
Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
for example \\"I\\\'ve seen the bull run in Pamplona, Spain\x9d.\\" Everything
Netwise Depot is a \\"One Stop Web Shop\\"\x9d that provides sustainable \\"green\\"\x9d living
are looking for a \\"Do It for Me\\"\x9d solution
From the context, I'd suspect ™ or ®. But what 8-bit code had those?
回答1:
Here's a completely wild hypothesis:
Some prior (really broken) system working on this data attempted to write each character as UTF-8, but actually only wrote the last byte of each sequence (maybe it had a weird one-byte-long buffer somewhere). Alternatively, it was in UTF-8 in the past, but somebody viewing it in a different encoding did a search-and-replace to remove bytes 0xE2 0x80 because they clearly "didn't belong" and didn't realize that the remaining "special character" wasn't the one they wanted either.
ASCII, would of course, be passed through as its UTF-8 encoding would be one byte long.
The 'RIGHT SINGLE QUOTATION MARK' (U+2019) ’
is encoded in UTF-8 with bytes 0xE2 0x80 0x99. The places where you have \x99s
is what made me go down this path, since the apostrophe before an s would often be translated to a right curly quotation mark in popular word processing software. If only the last byte of the character was saved, you'd just have the 0x99 there.
The 'RIGHT DOUBLE QUOTATION MARK' (U+201D) ”
is encoded in UTF-8 with bytes 0xE2 0x80 0x9D. The 0x9D that you have in your text is often at the end of a double-quoted string. And, it's often right next to a regular straight "
double-quote. I wonder if somebody had tried to do some sort of prior clean-up pass on the data, and managed to put back in the closing quote, but left the "weird" 0x9D in there.
As I said, it's a wild hypothesis, but if this is a conglomeration of data from a variety of old systems, it's hard to know what exactly may have happened to it. The last byte of UTF-8 was just the closest "normal" English encoding I could find that would have something reasonable in English text and included the bytes you were looking for.
回答2:
In Windows-1256, used for Arabic locales, \x99
is a trademark sign and \x9d
is a zero width non-joiner. That would seem to be plausible in the listed positions, though likely redundant. There's certainly no shortage of character sets to try though.
One tool to attempt the guess automatically is chardet.
回答3:
May be the data comes from a DOS file (CP850).
In my experience in that case the character 0x9D was used as a "diameter" sign when referring to pipes or tubes.
回答4:
I'm going to close this out, because, after asking in several places, it's clear that there's no common extended ASCII 8-bit data encoding that uses 0x9D in a way that makes sense here.
This may be the result of long-ago munging on the data. There are other Stack Overflow questions about Python charset conversions failing on 0x9D specifically, so it's not unique to this data. Somewhere, there's something that sticks in a 0x9D once in a while, usually after quotes. Maybe some old word processor. Thanks, everyone.
来源:https://stackoverflow.com/questions/45749093/in-what-8-bit-character-set-is-0x9d-meaningful