PHP basename() and pathinfo() with Multibytes UTF-8 file names

一笑奈何 提交于 2019-12-10 13:13:52

问题


I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names. They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.

basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above

but curiously the dirname part of pathinfo() works fine:

pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"

PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME) and pathinfo(..., PATHINFO_DIRNAME), not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.

It sounds like a PHP bug.

Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?


回答1:


I've found that changing the locale fixes everything.

While Apache by default runs with "C" locale, cli scripts by default run with an utf-8 locale instead, such as "en_US.UTF-8" (or in my case "it_IT.UTF-8"). Under these conditions, the problem does not occur.

Therefore, the workaround on Apache consists in changing the locale from "C" to "C.UTF-8" before calling these functions.

setlocale(LC_ALL,'C.UTF-8');
basename("àxà"); // now returns "àxà", which is correct
pathinfo("àyà/àxà", PATHINFO_BASENAME); // now returns "àxà", which is correct

Or even better, if you want to backup the current locale and restore it once done:

$lc = new LocaleManager();
$lc->doBackup();
$lc->fixLocale();
basename("àxà/àyà");
$lc->doRestore();


class LocaleManager
{
    /** @var array */
    private $backup;


    public function doBackup()
    {
        $this->backup = array();
        $localeSettings = setlocale(LC_ALL, 0);
        if (strpos($localeSettings, ";") === false)
        {
            $this->backup["LC_ALL"] = $localeSettings;
        }
        // If any of the locales differs, then setlocale() returns all the locales separated by semicolon
        // Eg: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=C;...
        else
        {
            $locales = explode(";", $localeSettings);
            foreach ($locales as $locale)
            {
                list ($key, $value) = explode("=", $locale);
                $this->backup[$key] = $value;
            }
        }
    }


    public function doRestore()
    {
        foreach ($this->backup as $key => $value)
        {
            setlocale(constant($key), $value);
        }
    }


    public function fixLocale()
    {
        setlocale(LC_ALL, "C.UTF-8");
    }
}


来源:https://stackoverflow.com/questions/45268499/php-basename-and-pathinfo-with-multibytes-utf-8-file-names

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!