问题
In a previous question I was told that Google passes UTF-8 encoded responses to queries. This solved a problem with non-breaking spaces (A0) being muddled after being passed by curl to my terminal. This was solved by piping the curl output to inconv and converting to UTF-8. However, even with this solution in place, I am still getting some strange output.
Consider the following conversion of 2 m to feet:
http://www.google.com/ig/calculator?hl=en&q=2%20m%20in%20feet
This is the output I'm seeing in my browser and elsewhere:
{lhs: "2 meters",rhs: "6.56167979 feet (6 feet 6\x3csup\x3e47\x3c/sup\x3e\x26#8260;\x3csub\x3e64\x3c/sub\x3e inches)",error: "",icc: false}
The expected output is:
{lhs: "2 meters",rhs: "6.56167979 feet (6 feet 6 47/64 inches)",error: "",icc: false}
I could just do a text replace using regular expressions or some other solution, but I would like to know what's happening here. Any insight?
I am running Mac OS X Mountain Lion 10.8.2
回答1:
Google Calculator as accessed via curl is returning JSON. Google is using \xHH notation as that is standard for JSON. If the output was being sent to a browser (or anything else that parses HTML) instead of standard output, only a good JSON decoder would be necessary.
Let's see what we can do from the command line to parse the JSON.
echo -en $(curl -s 'http://www.google.com/ig/calculator?hl=en&q=4^22') > ~/temp.html
This gets us valid HTML which we can view via a browser, but we need to reduce everything to something that can display via standard output.
echo -en "$(curl -s --connect-timeout 10 "http://www.google.com/ig/calculator?hl=en&q=2%20m%20in%20feet")" | sed -e 's/<sup>/ &/g' -e :a -e 's/<[^>]*>//g;/</N;//ba' | perl -MHTML::Entities -ne 'print decode_entities($_)' | iconv -f ISO-8859-1 -t UTF-8
For the echo command, the -e interprets escapes such as \x3e, \x3c, and \x26 (<, >, and & respectively), while the -n suppresses the newline that echo would normally add.
The pipe to sed adds a space before all (superscript) tags and then removes all HTML tags.
The pipe to perl then decodes all the HTML entities such as ⁄ to ⁄ (fraction slash). http://en.wikipedia.org/wiki/Html_special_characters#Character_entity_references_in_HTML
The pipe to iconv converts the ISO-8859-1 output to the expected UTF-8. This is done last since the perl line can produce UTF-8 entities that will need to be properly converted.
This is still going to have issues with distinguishing between fractions and exponents (47/64 where 47 is wrapped in superscript tags and 64 is wrapped in subscript tags, and 10^13 where 13 is wrapped in superscript tags).
We could get super silly and make a really long sed line to parse all the special characters (the following is in AppleScript so you can see just how ridiculous the syntax gets):
set jsonResponse to do shell script "curl " & queryURL & " | sed -e 's/[†]/,/g' -e 's/\\\\x26#215;/*/g' -e 's/\\\\x26#188;/ 1\\/4/g' -e 's/\\\\x26#189;/ 1\\/2/g' -e 's/\\\\x26#190;/ 3\\/4/g' -e 's/\\\\x26#8539;/ 1\\/8/g' -e 's/\\\\x26#8540;/ 3\\/8/g' -e 's/\\\\x26#8541;/ 5\\/8/g' -e 's/\\\\x26#8542;/ 7\\/8/g' -e 's/\\\\x3csup\\\\x3e\\([0-9]*\\)\\\\x3c\\/sup\\\\x3e\\\\x26#8260;\\\\x3csub\\\\x3e\\([0-9]*\\)\\\\x3c\\/sub\\\\x3e/ \\1\\/\\2/g' -e 's/\\\\x3csup\\\\x3e\\([0-9]*\\)\\\\x3c\\/sup\\\\x3e/^\\1/' -e 's/( /(/g'"
The † (dagger) character is 160 in decimal within the MacRoman set (Macintosh encoding). In hexadecimal this is 0xA0 or \xA0 and encodes to the Non-Breaking Space in UTF-8 encoding, which is what Google is passing. So in AppleScript, in order to replace the Non-Breaking Space from UTF-8, we have to use the † (dagger) due to the Macintosh encoding.
- http://en.wikipedia.org/wiki/Mac_Roman#Codepage_layout
- http://en.wikipedia.org/wiki/UTF-8
- http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement
There are also several special fraction symbols that the sed line deals with: http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html#fractions
The moral of the story is that when dealing with JSON, just use a good JSON parser.
A sub-moral is: don't use AppleScript to deal with JSON.
回答2:
The accepted answer to question Is there an official API for Google calculator? is negative, so it seems that you just have to try to reverse-engineer its functionality. Here it seems to represent the fraction 47/64 so that the numerator 47 is inside <sup>
markup and the denominator 64 is inside <sub>
markup and then the <
and >
have been escaped using \xnn
notation, with nn
being the hex code of the character. This does not seem to make much sense, since the stylistic superscripting and subscripting is pointless, doing it in HTML markup is odd, and escaping the tag delimiters is weird. The main problem however is that at times, <sup>
might mean superscripting to make an expression an exponent, so just removing such information could distort the information.
来源:https://stackoverflow.com/questions/12867450/special-characters-in-google-calculator