Generate random UTF-8 string in Python

后端 未结 8 1393
清酒与你
清酒与你 2020-12-09 08:38

I\'d like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module?

相关标签:
8条回答
  • 2020-12-09 08:56

    Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

    #!/usr/bin/env python3.1
    
    # From Table 3–7 of the Unicode Standard 5.0.0
    
    import random
    
    def byte_range(first, last):
        return list(range(first, last+1))
    
    first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
    trailing_values = byte_range(0x80, 0xBF)
    
    def random_utf8_seq():
        first = random.choice(first_values)
        if first <= 0x7F:
            return bytes([first])
        elif first <= 0xDF:
            return bytes([first, random.choice(trailing_values)])
        elif first == 0xE0:
            return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
        elif first == 0xED:
            return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
        elif first <= 0xEF:
            return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
        elif first == 0xF0:
            return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
        elif first <= 0xF3:
            return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
        elif first == 0xF4:
            return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])
    
    print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))
    

    Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

    0 讨论(0)
  • 2020-12-09 08:58

    There is a UTF-8 stress test from Markus Kuhn you could use.

    See also Really Good, Bad UTF-8 example test data.

    0 讨论(0)
  • 2020-12-09 08:58

    It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

    The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

    If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

    Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

    0 讨论(0)
  • 2020-12-09 08:59

    You could download a website written in greek or german that uses unicode and feed that to your code.

    0 讨论(0)
  • 2020-12-09 09:09

    People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

    import random
    
    def get_random_unicode(length):
    
        try:
            get_char = unichr
        except NameError:
            get_char = chr
    
        # Update this to include code point ranges to be sampled
        include_ranges = [
            ( 0x0021, 0x0021 ),
            ( 0x0023, 0x0026 ),
            ( 0x0028, 0x007E ),
            ( 0x00A1, 0x00AC ),
            ( 0x00AE, 0x00FF ),
            ( 0x0100, 0x017F ),
            ( 0x0180, 0x024F ),
            ( 0x2C60, 0x2C7F ),
            ( 0x16A0, 0x16F0 ),
            ( 0x0370, 0x0377 ),
            ( 0x037A, 0x037E ),
            ( 0x0384, 0x038A ),
            ( 0x038C, 0x038C ),
        ]
    
        alphabet = [
            get_char(code_point) for current_range in include_ranges
                for code_point in range(current_range[0], current_range[1] + 1)
        ]
        return ''.join(random.choice(alphabet) for i in range(length))
    
    if __name__ == '__main__':
        print('A random string: ' + get_random_unicode(10))
    
    0 讨论(0)
  • 2020-12-09 09:09

    Answering revised question:

    Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

    Please consider responding to my earlier invitation to tell us what you are really trying to do.

    0 讨论(0)
提交回复
热议问题