Python regex replacing \u2022

前端 未结 4 1169
死守一世寂寞
死守一世寂寞 2021-01-26 09:53

This is my string:

raw_list = u\'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-grow         


        
相关标签:
4条回答
  • 2021-01-26 10:20

    The key is to add the unicode u in front of the unicode character that you're trying to find - in this case the \u2022 which is the unicode character for a bullet. If your text contains unicode characters then your text is actually unicode text as opposed to a string (you can confirm by printing out your text and looking for the u at the beginning). See the below example, where I search for a unicode bullet character using regular expressions (RegEx) on both a string and unicode text:

    import regular expressions package:
    import re
    
    unicode text:
    my_unicode = u"""\u2022 Here\'s a string of data.\n<br/>\u2022 There are new 
    line characters \n, HTML line break tags <br/>, and bullets \u2002 together in 
    a sequence.\n<br/>\u2022 Our goal is to use RegEx to identify the sequences."""
    
    type(my_unicode) #unicode
    
    string:
    my_string = """\u2022 Here\'s a string of data. \n<br/>\u2022There are new 
    line characters \n, HTML line break tags <br/>, and bullets \u2002 together in 
    a sequence.\n<br/>\u2022 Our goal is to use RegEx to identify the sequences."""
    
    type(my_string)     #string 
    
    we successfully find the first piece of text that we're looking for which doesn't yet contain the unicode characters:
    re.findall('\n<br/>', my_unicode)
    
    re.findall('\n<br/>', my_string)
    
    with the addition of the unicode character, neither substring can be found:
    re.findall('\n<br/>\u2022', my_unicode)
    
    re.findall('\n<br/>\u2022', my_string)
    
    Adding four backslashes works for the string, but it does not work for the unicode text:
    re.findall('\n<br/>\\\\u', my_unicode)
    
    re.findall('\n<br/>\\\\u', my_string)
    
    Solution: Include the unicode u in front of the unicode character:
    re.findall('\n<br/>' u'\u2022', my_unicode)
    
    0 讨论(0)
  • 2021-01-26 10:24

    You're using a raw string, with the r. That tells Python to interpret the string literally, instead of actually taking escaped characters (such as \n).

    >>> r'\u2022'
    '\\u2022'
    

    You can see it's actually a double backslash. Instead you want to use >>> u'\u2022' and then it will work.

    Note that since you're doing a simple replacement you can just use the str.replace method:

    x = raw_list.replace(u'\u2022', ' ')
    

    You only need a regex replace for complicated pattern matching.

    0 讨论(0)
  • 2021-01-26 10:26

    Unless you use a Unicode string literal, the \uhhhh escape sequence has no meaning. Not to Python, and not to the re module. Add the u prefix:

    re.sub(ur'\u2022', ' ', raw_list)
    

    Note the ur there; that's a raw unicode string literal; this still interprets \uhhhh unicode escape sequences (but is otherwise identical to the standard raw string literal mode). The re module doesn't support such escape sequences itself (but it does support most other Python string escape sequences).

    Not that you need to use a regular expression here, a simple unicode.replace() would suffice:

    raw_list.replace(u'\u2022', u' ')
    

    or you can use unicode.translate():

    raw_list.translate({0x2022: u' '})
    
    0 讨论(0)
  • 2021-01-26 10:31

    This is my approach, changing regex pattern, you might try

    re.sub(r'[^\x00-\x7F]+','',raw_list)
    

    Out[1]: u'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-growing start-ups. Specialties Languages and Frameworks: JavaScript (Nodejs, React), Android, Ruby on Rails 4, iOS (Swift) Databases: Mongodb, Postgresql, MySQL, Redis Testing Frameworks: Mocha, Rspec xxxx Others: Sphinx, MemCached, Chef.'

    0 讨论(0)
提交回复
热议问题