问题
I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names. They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.
basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above
but curiously the dirname part of pathinfo() works fine:
pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"
PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME)
and pathinfo(..., PATHINFO_DIRNAME)
, not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.
It sounds like a PHP bug.
Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?
回答1:
I've found that changing the locale fixes everything.
While Apache by default runs with "C" locale, cli scripts by default run with an utf-8 locale instead, such as "en_US.UTF-8" (or in my case "it_IT.UTF-8"). Under these conditions, the problem does not occur.
Therefore, the workaround on Apache consists in changing the locale from "C" to "C.UTF-8" before calling these functions.
setlocale(LC_ALL,'C.UTF-8');
basename("àxà"); // now returns "àxà", which is correct
pathinfo("àyà/àxà", PATHINFO_BASENAME); // now returns "àxà", which is correct
Or even better, if you want to backup the current locale and restore it once done:
$lc = new LocaleManager();
$lc->doBackup();
$lc->fixLocale();
basename("àxà/àyà");
$lc->doRestore();
class LocaleManager
{
/** @var array */
private $backup;
public function doBackup()
{
$this->backup = array();
$localeSettings = setlocale(LC_ALL, 0);
if (strpos($localeSettings, ";") === false)
{
$this->backup["LC_ALL"] = $localeSettings;
}
// If any of the locales differs, then setlocale() returns all the locales separated by semicolon
// Eg: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=C;...
else
{
$locales = explode(";", $localeSettings);
foreach ($locales as $locale)
{
list ($key, $value) = explode("=", $locale);
$this->backup[$key] = $value;
}
}
}
public function doRestore()
{
foreach ($this->backup as $key => $value)
{
setlocale(constant($key), $value);
}
}
public function fixLocale()
{
setlocale(LC_ALL, "C.UTF-8");
}
}
来源:https://stackoverflow.com/questions/45268499/php-basename-and-pathinfo-with-multibytes-utf-8-file-names