How to filter only printable characters in a file on Bash (linux) or Python?

前端 未结 4 1004
挽巷
挽巷 2021-01-18 23:45

I want to make a file including non-printable characters to just only include printable characters. I think this problem is related to ACSCII control action, but I could not

相关标签:
4条回答
  • 2021-01-18 23:55

    See the builtin string module.

    import string
    printable_str = filter(string.printable, string)
    
    0 讨论(0)
  • 2021-01-18 23:56

    You can try this sed command to remove all non-printable characters from a file:

    sed -i.bak 's/[^[:print:]]//g' file
    
    0 讨论(0)
  • 2021-01-19 00:04

    The hexdump shows that the dot in .[16D is actually an escape character, \x1b.
    Esc[nD is an ANSI escape code to delete n characters. So Esc[16D tells the terminal to delete 16 characters, which explains the cat output.

    There are various ways to remove ANSI escape codes from a file, either using Bash commands (eg using sed, as in Anubhava's answer) or Python.

    However, in cases like this, it may be better to run the file through a terminal emulator to interpret any existing editing control sequences in the file, so you get the result the file's author intended after they applied those editing sequences.

    One way to do that in Python is to use pyte, a Python module that implements a simple VTXXX compatible terminal emulator. You can easily install it using pip, and here are its docs on readthedocs.

    Here's a simple demo program that interprets the data given in the question. It's written for Python 2, but it's easy to adapt to Python 3. pyte is Unicode-aware, and its standard Stream class expects Unicode strings, but this example uses a ByteStream, so I can pass it a plain byte string.

    #!/usr/bin/env python
    
    ''' pyte VTxxx terminal emulator demo
    
        Interpret a byte string containing text and ANSI / VTxxx control sequences
    
        Code adapted from the demo script in the pyte tutorial at
        http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial
    
        Posted to http://stackoverflow.com/a/30571342/4014959 
    
        Written by PM 2Ring 2015.06.02
    '''
    
    import pyte
    
    
    #hex dump of data
    #00000000  48 45 4c 4c 4f 20 54 48  49 53 20 49 53 20 54 48  |HELLO THIS IS TH|
    #00000010  45 20 54 45 53 54 1b 5b  31 36 44 20 20 20 20 20  |E TEST.[16D     |
    #00000020  20 20 20 20 20 20 20 20  20 20 20 1b 5b 31 36 44  |           .[16D|
    #00000030  20 20                                             |  |
    
    data = 'HELLO THIS IS THE TEST\x1b[16D                \x1b[16D  '
    
    #Create a default sized screen that tracks changed lines
    screen = pyte.DiffScreen(80, 24)
    screen.dirty.clear()
    stream = pyte.ByteStream()
    stream.attach(screen)
    stream.feed(data)
    
    #Get index of last line containing text
    last = max(screen.dirty)
    
    #Gather lines, stripping trailing whitespace
    lines = [screen.display[i].rstrip() for i in range(last + 1)]
    
    print '\n'.join(lines)
    

    output

    HELLO
    

    hex dump of output

    00000000  48 45 4c 4c 4f 0a                                 |HELLO.|
    
    0 讨论(0)
  • 2021-01-19 00:14

    Minimalistic solution comes to my mind is

    import string
    printable_string = filter(lambda x: x in string.printable, your_string)
    ## TODO: substitute your string in the place of "your_string"
    

    If still this doesn't help then try also including uni-code specific [curses.ascii]

    0 讨论(0)
提交回复
热议问题