I\'ve just installed a website & legacy CMS onto our server and I\'m getting a POSIX compilation error. Luckily it\'s only appearing in the backend however the client\'s
Your error message that “POSIX collating elements are not supported” deserves some explanation. After all, what in the world is a POSIX collating element anyway, and how can I avoid it?
The short answer is that you have an equals sign inside your square brackets in a place where its use is reserved for future use, assuming we ever get around to implementing it, which is anything but certain. You can tickle this in Perl on the command line this way, which gives a much better error message than PHP is providing:
% perl -le 'print "abc" =~ /[=foo=]/ || "Fail"'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[=foo=] <-- HERE / at -e line 1.
That’s the short answer; the longer answer follows.
Inside a square bracketed character class, POSIX admits three different nestedbracketed forms, all indicated using an extra symbol inside the brackets in pairs:
[:PROPERTY:]
, as in [:alpha:]
.[=ELEMENTS=]
, as in [=eéèëê=]
in English or French, and [=vw=]
in Swedish.[.DIGRAPH.]
, as in [.ch.]
or [.ll.]
per the traditional Spanish alphabet. These are sometimes known as contractions because two or more code points count as though that sequence were a single code point.Perl supports only the first of these, not the second and third.
They are all awkward to use, because they must be nested inside an extra set of brackets, as in [[:punct:]
to mean \pP
or \p{punct}
. You only need extra braces with Unicode properties when you are selecting one of many, as in [\pL\pN\pM\p{Pc}]
.
The other two were an attempt to support locale-specific linguistic elements in a pre‐Unicode enviornment under legacy 8‑bit locales. For example, to express the traditional Spanish alphabet, which counts acute accents over vowels and diaereses over u’s as the same letter yet which counts a tilde over an n as a different letter altogether, and which furthermore has two digraphs each counting as a distinct letter, you would have to write this in POSIX:
[[=aá=]bc[.ch.]d[=eé=]fgh[=ií=]jkl[.ll.]mnñ[=oó=]pqrst[=uúü=]vwxyz]
You can and sometimes much combine these. For example, in German phonebooks where the three i‑mutated vowels can be spelt without diacritics by inserting a following e:
[a[=ä[.ae.]=]bcdefghijklmno[=ö[.oe.]=]pqrs[=ß[.ss.]=]tu[=ü[.ue.]=]vwxyz]
That way, assuming $ES
and $DE
are those languages’ respective alphabets, you could say something like
[$ES]{4}
and have it match words like guía, niño, llave, and choco in Spanish; or in German have
[$DE]{6}
and have it match words like tschüß or its uppercase undiacriticked equivalent, TSCHUESS.
This is awkward for various reasons, and not just those that are obvious from the two alphabets listed above. It does not admit the notion of combining characters, so you have to add those explicitly for non-normalized text, as in [=e\xE9[.e\x{301.]=]
.
Unicode has taken another path in how to implement linguistic elements like this. Fortunately, Unicode regular expressions per UTS#18 do not need to support language features tailored for specific languages or locales until Level 3. This is something no one yet has yet implemented.
Note that having SS and ß have the same casefold is not considered a locale tailoring. It is the full casefold for that code point no matter the linguistic context. So those are the same when case is ignored. Strange but true. Given that ß is code point U+00DF, we see that these are the same no matter the locale:
$ perl5.14.0 -E 'say "SS" =~ /^\xDF$/i ? "Pass" : "Fail"'
Pass
$ perl5.14.0 -E 'say "\xDF" =~ /^SS$/i ? "Pass" : "Fail"'
Pass
Although locale tailoring for patterns is still beyond us, collation has been implemented, including with locale support, and you can access it from Perl just fine.
However, PHP does not yet support Unicode collation.
References for Unicode collation include: