Search And Replace Special Characters PHP

后端 未结 2 2073
庸人自扰
庸人自扰 2021-01-24 23:01

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I

相关标签:
2条回答
  • 2021-01-24 23:09

    Following Bobince's suggestion, the following worked for me:

    analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238

    function file_get_contents_utf8($fn) {
        $content = file_get_contents($fn);
        return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
    }
    
    
    if( !($_FILES['file']['error'] == 4) ) {
        foreach($_FILES as $file) {
            $n = $file['name'];
            $s = $file['size'];
            $filename = $file['tmp_name'];
            ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
            // dealing with fgetcsv() special chars
            // read the file into a string, do your pre-processing changes
            // write it back out to a temporary file, and have fgetcsv() read that.
            $file = file_get_contents_utf8($filename);
            $tempFile = tempnam(sys_get_temp_dir(), '');
            $handle = fopen($tempFile, "w+");
            fwrite($handle,$file);
            fseek($handle, 0);
            $filename = $tempFile;      
            // END -- dealing with fgetcsv() special chars
            $Array = analyse_file($filename, 10);
            $csvDelim = $Array['delimiter']['value'];
            while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
                // process the csv file
            }
        } // end foreach
    }
    
    0 讨论(0)
  • 2021-01-24 23:28

    0x95 is probably supposed to represent the character U+2022 Bullet (), encoded in Windows code page 1252. You can get rid of it in a byte string using:

    $line= str_replace("\x95", '', $line);
    

    or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:

    $line= preg_replace("/[\x80-\xFF]/", '', $line);
    

    If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.

    The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.

    In summary, PHP's built-in CSV parsing stuff is pretty crap.

    0 讨论(0)
提交回复
热议问题