问题
I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").
The file looks like this in a hex viewer: EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).
So I tried to remove it using the following code, but I fail miserably:
$line =~ s/^\xEF\xBB\xBF//;
Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?
Thanks!
回答1:
EF BB BF
is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:
s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//; # Convenient alias
I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it
You're getting wide character because you forgot to add an :encoding
layer on your output file handle. The following adds :encoding(UTF-8)
to STDIN, STDOUT, STDERR, and makes it the default for open()
.
use open ':std', ':encoding(UTF-8)';
回答2:
To defuse the BOM, you have to know it's not 3 characters, it's 1 in UTF (U+FEFF):
s/^\x{FEFF}//;
回答3:
If you open the file using File::BOM, it will remove the BOM for you.
use File::BOM;
open_bom(my $fh, $path, ':utf8')
回答4:
Ideally, your filehandle should be doing this for you automatically. But if you're not in an ideal situation, this worked for me:
use Encode;
my $value = decode('UTF-8', $originalvalue);
$value =~ s/\N{U+FEFF}//;
来源:https://stackoverflow.com/questions/24390034/remove-bom-from-string-with-perl