This is my string:
raw_list = u\'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-grow
The key is to add the unicode u
in front of the unicode character that you're trying to find - in this case the \u2022
which is the unicode character for a bullet. If your text contains unicode characters then your text is actually unicode text as opposed to a string (you can confirm by printing out your text and looking for the u at the beginning). See the below example, where I search for a unicode bullet character using regular expressions (RegEx) on both a string and unicode text:
import re
unicode text:
my_unicode = u"""\u2022 Here\'s a string of data.\n
\u2022 There are new
line characters \n, HTML line break tags
, and bullets \u2002 together in
a sequence.\n
\u2022 Our goal is to use RegEx to identify the sequences."""
type(my_unicode) #unicode
string:
my_string = """\u2022 Here\'s a string of data. \n
\u2022There are new
line characters \n, HTML line break tags
, and bullets \u2002 together in
a sequence.\n
\u2022 Our goal is to use RegEx to identify the sequences."""
type(my_string) #string
we successfully find the first piece of text that we're looking for which doesn't yet contain the unicode characters:
re.findall('\n
', my_unicode)
re.findall('\n
', my_string)
with the addition of the unicode character, neither substring can be found:
re.findall('\n
\u2022', my_unicode)
re.findall('\n
\u2022', my_string)
Adding four backslashes works for the string, but it does not work for the unicode text:
re.findall('\n
\\\\u', my_unicode)
re.findall('\n
\\\\u', my_string)
Solution: Include the unicode u
in front of the unicode character:
re.findall('\n
' u'\u2022', my_unicode)