问题
We are running into a problem when parsing emails with python from outlook. Sometimes emails have characters that are not able to be appended to an excel worksheet using openpyxl. The error it raises is just IllegalCharacterError
.
I am trying to force this to print out the actual characters that are considered "Illegal".
That said while doing some digging in one of the files in opnepyxl I found on cell.py
this line that raises the error.
if next(ILLEGAL_CHARACTERS_RE.finditer(value), None):
raise IllegalCharacterError
So navigating to where ILLEGAL_CHARACTERS_RE
is defined we find:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
So I tried to print(ILLEGAL_CHARACTERS_RE)
in the hopes it might print out the values it is representing. As I am not very skilled in regex or the use of compile I was not sure what would happen but sadly all I got printed out to console was re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
.
Can someone help me figure out how to print these values or at the very least understand how to find what these values represent?
回答1:
In Regular Expression, or Regex for short, the output you are seeing is an expression of certain characters in a given range. For example:
First part of RE:
[\000-\010]
This means that this set contains any character from 0 to 8 (char codes 0 to 8), which are control characters. You could be getting any character from NULL (�) to BS (backspace).
Second part of RE:
[\013-\014]
Again, this is more control characters. Specifically, characters from 11 to 12 (char code 11 to 12). Which can be from VT or FF. Note that VT is actually tabulation which cannot be printable.
Third part of RE:
[\016-\037]
Now this is a bit more interesting, as this contains both control characters as well as printable characters. So with this being said, you could expect to get any character from 14 to 31 (char code 14 to 31).
So the only logical reason why it cannot print any illegal characters is because the RE that has been provided simply does not entail printable characters. Any ASCII character after 33 is a printable character (32 is the space character), but as you can see here, your code takes everything from \000 to \037. So you're trying to print control characters that aren't printable.
Here is a ASCII table for reference: https://www.w3schools.com/charsets/ref_html_ascii.asp
I hope this helps!
来源:https://stackoverflow.com/questions/62478974/what-are-all-the-illegal-characters-from-openpyxl