Filtering out certain bytes in python

后端 未结 4 505
鱼传尺愫
鱼传尺愫 2020-12-31 04:12

I\'m getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

This

相关标签:
4条回答
  • 2020-12-31 04:44

    As the answer to the linked question said, the XML standard defines a valid character as:

    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    

    Translating that into Python:

    def valid_xml_char_ordinal(c):
        codepoint = ord(c)
        # conditions ordered by presumed frequency
        return (
            0x20 <= codepoint <= 0xD7FF or
            codepoint in (0x9, 0xA, 0xD) or
            0xE000 <= codepoint <= 0xFFFD or
            0x10000 <= codepoint <= 0x10FFFF
            )
    

    You can then use that function however you need to, e.g.

    cleaned_string = ''.join(c for c in input_string if valid_xml_char_ordinal(c))
    
    0 讨论(0)
  • 2020-12-31 04:44

    you may refer to the solution on this website:

    https://mailman-mail5.webfaction.com/pipermail/lxml/2011-July/006090.html

    That solution works for me. You may also have to consider John Machin's solution.

    Good luck!

    0 讨论(0)
  • 2020-12-31 04:51

    Another approach that's much faster than the answer above is to use regular expressions, like so:

    re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text)
    

    Comparing to the answer above, it comes out to be more than 10X faster in my testing:

    import timeit
    
    func_test = """
    def valid_xml_char_ordinal(c):
        codepoint = ord(c)
        # conditions ordered by presumed frequency
        return (
            0x20 <= codepoint <= 0xD7FF or
            codepoint in (0x9, 0xA, 0xD) or
            0xE000 <= codepoint <= 0xFFFD or
            0x10000 <= codepoint <= 0x10FFFF
        );
    ''.join(c for c in r.content if valid_xml_char_ordinal(c))
    """
    
    func_setup = """
    import requests; 
    r = requests.get("https://stackoverflow.com/questions/8733233/")
    """
    
    regex_test = """re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', r.content)"""
    regex_setup = """
    import requests, re; 
    r = requests.get("https://stackoverflow.com/questions/8733233/")
    """
    
    func_test = timeit.Timer(func_test, setup=func_setup)
    regex_test = timeit.Timer(regex_test, setup=regex_setup)
    
    print func_test.timeit(100)
    print regex_test.timeit(100)
    

    Output:

    > 2.63773989677
    > 0.221401929855
    

    So, making sense of that, what we're doing is downloading this webpage once (the page you're currently reading), then running the functional technique and the regex technique over its contents 100X each.

    Using the functional method takes about 2.6 seconds.
    Using the regex method takes about 0.2 seconds.


    Update: As identified in the comments, the regex in this answer previously deleted some characters, which should have been allowed in XML. These characters include anything in the Supplementary Multilingual Plane, which is includes ancient scripts like cuneiform, hieroglyphics, and (weirdly) emojis.

    The correct regex is now above. A quick test for this in the future is using re.DEBUG, which prints:

    In [52]: re.compile(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', re.DEBUG)
    max_repeat 1 4294967295
      in
        negate None
        range (32, 55295)
        literal 9
        literal 10
        literal 13
        range (57344, 65533)
        range (65536, 1114111)
    Out[52]: re.compile(ur'[^ -\ud7ff\t\n\r\ue000-\ufffd\U00010000-\U0010ffff]+', re.DEBUG)
    

    My apologies for the error. I can only offer that I found this answer elsewhere and put it in here. It was somebody else's error, but I propagated it. My sincere apologies to anybody this affected.

    Update 2, 2017-12-12: I've learned from some OSX users that this code won't work on so-called narrow builds of Python, which apparently OSX sometimes has. You can check this by running import sys; sys.maxunicode. If it prints 65535, the code here won't work until you install a "wide build". See more about this here.

    0 讨论(0)
  • 2020-12-31 05:02

    I think this is harsh/overkill and it seems painfully slow, but my program is still quick and after struggling to comprehend what was going wrong (even after I attempted to implement @John's cleaned_string implementation), I just adapted his answer to purge ASCII-unprintable using the following (Python 2.7):

    from curses import ascii
    def clean(text):
        return str(''.join(
                ascii.isprint(c) and c or '?' for c in text
                )) 
    

    I'm not sure what I did wrong with the better option, but I just wanted to move on...

    0 讨论(0)
提交回复
热议问题