Before anyone will tells me to RTFM, I must say - I have digged through:
is_utf8
returns information about which internal storage format was used, period.
Now on to your questions.
The whole utf8 pragma is a mystery for me.
use utf8;
tells perl
your source code is encoded using UTF-8. If you don't tell it so, perl
effectively assumes it's iso-8859-1 (as a side-effect of internal mechanisms).
The functions in the utf8:: namespace are unrelated to the pragma, and they serve a variety of purposes.
utf8::encode
and utf8::decode
: Useful encoding and decoding functions. Similar to Encode's encode_utf8
and decode_utf8
, but they work in-place.utf8::upgrade
and utf8::downgrade
: Rarely used, but useful for working around bugs in XS modules. More on this below.utf8::is_utf8
: I don't know why someone would ever use that.HOW i can ensure (test it), than any $other_data contains valid unicode string?
What does "valid Unicode string" mean to you? Unicode has different definitions of valid for different circumstances.
for what purpose is the utf8::is_utf8($data)?
Debugging. It peeks at Perl guts.
In the above example utf8::is_utf8($data) will print OK - but don't understand WHY.
Because NFD happens to have chosen to return a scalar containing a string in the UTF8=1 format.
Perl has two formats for storing strings:
The first format uses less memory and is faster when it comes to access a specific position in the string, but it's limited in what it can contain. (For example, it can't store Unicode code points since they require 21 bits.) Perl can freely switch between the two.
use utf8;
use feature qw( say );
my $d = my $u = "abcdé";
utf8::downgrade($d); # Switch to using the UTF8=0 format for $d.
utf8::upgrade($u); # Switch to using the UTF8=1 format for $u.
say utf8::is_utf8($d) ?1:0; # 0
say utf8::is_utf8($u) ?1:0; # 1
say $d eq $u ?1:0; # 1
One normally doesn't have to worry about this, but there are buggy modules. There are even buggy corners of Perl remaining despite use feature qw( unicode_strings );
. One can use utf8::upgrade
and utf8::downgrade
for changing the format of a scalar to that expected by the XS function.
Or it is miss-named and the function should be named as uni::is_unicode($data)???
That's no better. Perl has no way to know whether a string is a Unicode string or not. If you need to track that, you need to track it yourself.
Strings in the UTF8=0 format may contain Unicode code points.
my $s = "abc"; # U+0041,0042,0043
Strings in the UTF8=1 format may contain values that aren't Unicode code points.
my $s = pack('W*', @temperature_measurements);
HOW i can ensure (test it), than any $other_data contains valid unicode string?
You cannot determine ex post facto whether a string has character semantics or byte semantics. Perl does not track this for you. You have to track it by careful programming: encode and decode at the boundaries; :raw
layer for byte semantics, :encoding(foo)
for character semantics. Employ naming conventions for your variables and functions to clearly differentiate between the semantics and make wrong code look wrong.
for what purpose is the utf8::is_utf8($data)?
It tells you the presence of the SvUTF8
flag, nothing more. This is almost entirely useless for most developers, because it is an internals thing. The flag does not mean that a string has character semantics, its absence does not mean that a string has byte semantics.
The whole utf8 pragma is a mystery for me.
Probably because it is overdocumented, and therefore confusing. Most developers can stop reading after the part where is says that its purpose is to enable Unicode literals in the source code.
In the above example utf8::is_utf8($data) will print OK - but don't understand WHY.
Because of uni::perl which enables use open qw(:utf8 :std);
. Any input read from STDIN with <>
will be decoded. The normalisation step afterwards does not change that.