Python minidom and UTF-8 encoded XML with hash references

混江龙づ霸主 提交于 2019-12-10 11:19:40

问题


I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".

gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. æ).

I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).

Anyway I guess gSOAP probably is obeying transport rules, or what?

When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:

So if the string "æble" is contained in the XML, it comes like this in the request:

"æble"

After parsing the XML the unicode string in the DOM Text Node's data member looks like this:

u'\xc3\xa6ble'

I would expect it to look like this:

u'\xe6ble'

What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?

Thanks in advance.

Best regards Jakob Simon-Gaarde


回答1:


Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html

However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...

Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is \xc3\xa6. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce 'æ', not 'æ'.

However it appears that it is encoding to UTF-8 first and then doing the escaping.

What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each str and unicode object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.

On the receiving end, please show us the repr() of the raw XML that you receive.

Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""

It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.

(1) unescaping "&#60;" to "<" would blow up

(2) what would you unescape "&#256" to? "\xc4\x80"?

(3) how could it unescape at all if the encoding was UTF-16xx?




回答2:


&#195;&#166;ble is actually æble.

To get the expected Unicode string u'\xe6ble' after parsing, the string in the request should be &#230;ble.




回答3:


Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using environ['wsgi.input'].read(). It always seems to return a raw string. I created a function that unescapes the character hashes:

def unescape_hash_char(req):
  pat = re.compile('&#(\d+);',re.M)
  parts = pat.split(req)
  a=0
  ret = ''
  for p in parts:
    if a%2:
      n = chr(int(p))
    else:
      n = p
    ret += n
    a+=1
  return ret

After doing this I parse the XML and I get the expected reslut.

Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?

Best regards Jakob Simon-Gaarde




回答4:


Note that

In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'

So we have is the unicode object u'\xc3\xa6' and we really want the string object'\xc3\xa6'. This transformation can be performed with the raw-unicode-escape codec:

In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'

In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'

In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ



回答5:


Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.

Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".

<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>&#195;&#166;ble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>

/ Jakob



来源:https://stackoverflow.com/questions/4663250/python-minidom-and-utf-8-encoded-xml-with-hash-references

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!