How to Check if File is ASCII or Binary in PHP

后端 未结 5 2031
说谎
说谎 2020-12-03 14:23

Is there a quick, simple way to check if a file is ASCII or binary with PHP?

相关标签:
5条回答
  • 2020-12-03 14:53

    This only works for PHP>=5.3.0, and isn't 100% reliable, but hey, it's pretty darn close.

    // return mime type ala mimetype extension
    $finfo = finfo_open(FILEINFO_MIME);
    
    //check to see if the mime-type starts with 'text'
    return substr(finfo_file($finfo, $filename), 0, 4) == 'text';
    

    http://us.php.net/manual/en/ref.fileinfo.php

    0 讨论(0)
  • 2020-12-03 14:56

    Since ASCII is just an encoding for text, with binary representation, not really. You could check that all bytes are less than 128, but even this wouldn't guarantee that it was intended to be decoded as ASCII. For all you know it's some crazy image format, or an entirely different text encoding that also has no use of all eight bits. It might suffice for your use, though. If you just want to check if a file is valid ASCII, even if it's not a "text file", it will definitely suffice.

    0 讨论(0)
  • 2020-12-03 15:04

    this way it seems ok in my project:

    function probably_binary($stringa) {
        $is_binary=false;
        $stringa=str_ireplace("\t","",$stringa);
        $stringa=str_ireplace("\n","",$stringa);
        $stringa=str_ireplace("\r","",$stringa);
        if(is_string($stringa) && ctype_print($stringa) === false){
            $is_binary=true;
        }
        return $is_binary;
    }
    

    PS: sorry, my first post, I wanted to add a comment to previous one :)

    0 讨论(0)
  • 2020-12-03 15:05

    You should probably check the file's mimetype, but if you're willing to load the file into memory, maybe you could check to see if the buffer consists of all-printable-characters using something like:

    <?php
    $probably_binary = (is_string($var) === true && ctype_print($var) === false);
    

    Not perfect, but might be helpful in some cases.

    0 讨论(0)
  • 2020-12-03 15:06

    In one of my older PHP projects I use ASCII / Binary compression. When the user uploads their file, they are required to specify if that file is ASCII or Binary. I decided to modify my code to have the server automatically decide what the file mode is, as relying on the user's decision could result in a failed compression. I decided my code has to be absolute, and not use tricks that would potentially cause my program to fail. I quickly whipped up some code, ran some speed tests and then decided to search the internet to see if there is a faster code example to complete this task.


    Devin's very vague answer relates to the first code I wrote to complete this task. The results were so-so. I found that searching byte for byte was in many cases faster for binary files. If you find a byte larger than 127, the rest of the file could be ignored and the entire file is considered a binary file. That being said, you would have to read every last byte of a file to determine if the file is ASCII. It appears faster for many binary files because a binary byte will likely come earlier than the very last byte of the file, sometimes even the very first byte would be binary.

    <?php
    $filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'ASCII',
        2 => 'Binary'
    );
    
    function filemode($filename) {
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                $handle = fopen($filename, 'rb');
                for($i = 0; $i < $size; ++$i) {
                    $byte = fread($handle, 1);
                    if(ord($byte) > 127) {
                        fclose($handle);
                        return 2; // Binary
                    }
                }
                fclose($handle);
                return 1; // ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    
    // ==========
    
    $filename = 'e:\test.txt';
    
    $loops = 1;
    $x = 0;
    $i = 0;
    $start = microtime(true);
    
    for($i = 0; $i < $loops; ++$i)
        $x = filemode($filename);
    
    $stop = microtime(true);
    $duration = $stop - $start;
    
    echo
        'Filename: ', $filename, "\n",
        'Filemode: ', $filemodes[filemode($filename)], "\n",
        'Duration: ', $duration;
    

    My processor isn't exactly modern but I found that a 600Kb ASCII file would take about 0.25 seconds to complete. If I were to use this on hundreds or thousands of large files it might take a very long time. I decided to try and speed things up a bit by making my buffer larger than a single byte to read the file as chunks instead of one byte at a time. Using chunks will allow me to process more of the file at once but not load too much into memory. If a file we're testing is huge and we were to load the entire file into memory, it could use up far too much memory and cause the program to fail.

    <?php
    $filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'ASCII',
        2 => 'Binary'
    );
    
    function filemode($filename) {
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                $buffer_size = 256;
                $chunks = ceil($size / $buffer_size);
                $handle = fopen($filename, 'rb');
                for($chunk = 0; $chunk < $chunks; ++$chunk) {
                    $buffer = fread($handle, $buffer_size);
                    $buffer_length = strlen($buffer);
                    for($byte = 0; $byte < $buffer_length; ++$byte) {
                        if(ord($buffer[$byte]) > 127) {
                            fclose($handle);
                            return 2; // Binary
                        }
                    }
                }
                fclose($handle);
                return 1; // ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    
    // ==========
    
    $filename = 'e:\test.txt';
    
    $loops = 1;
    $x = 0;
    $i = 0;
    $start = microtime(true);
    
    for($i = 0; $i < $loops; ++$i)
        $x = filemode($filename);
    
    $stop = microtime(true);
    $duration = $stop - $start;
    
    echo
        'Filename: ', $filename, "\n",
        'Filemode: ', $filemodes[filemode($filename)], "\n",
        'Duration: ', $duration;
    

    The difference in speed was fairly significant taking only 0.15 seconds instead of the 0.25 seconds of the previous function, almost a tenth of a second faster to read my 600Kb ASCII file.


    Now that I have my file in chunks, I thought it would be a good idea to find alternative ways to test my chunks for binary characters. My first thought would be to use a regular expression to find non-ascii characters.

    <?php
    $filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'ASCII',
        2 => 'Binary'
    );
    
    function filemode($filename) {
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                $buffer_size = 256;
                $chunks = ceil($size / $buffer_size);
                $handle = fopen($filename, 'rb');
                for($chunk = 0; $chunk < $chunks; ++$chunk) {
                    $buffer = fread($handle, $buffer_size);
                    if(preg_match('/[\x80-\xFF]/', $buffer) === 1) {
                        fclose($handle);
                        return 2; // Binary
                    }
                }
                fclose($handle);
                return 1; // ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    
    // ==========
    
    $filename = 'e:\test.txt';
    
    $loops = 1;
    $x = 0;
    $i = 0;
    $start = microtime(true);
    
    for($i = 0; $i < $loops; ++$i)
        $x = filemode($filename);
    
    $stop = microtime(true);
    $duration = $stop - $start;
    
    echo
        'Filename: ', $filename, "\n",
        'Filemode: ', $filemodes[filemode($filename)], "\n",
        'Duration: ', $duration;
    

    Amazing! 0.02 seconds to consider my 600Kb file an ASCII file and this code appears to be 100% reliable.


    Now that I have arrived here, I have the opportunity to inspect several other methods deployed by other users.

    The most accepted answer today, written by davethegr8 uses the mimetype extension. First, I was required to enable this extension in the php.ini file. Next, I tested this code against an actual ASCII file that has no file extension and a binary file that has no file extension.

    Here is how I created my two test files.

    <?php
    $handle = fopen('E:\ASCII', 'wb');
    for($i = 0; $i < 128; ++$i) {
        fwrite($handle, chr($i));
    }
    fclose($handle);
    
    $handle = fopen('E:\Binary', 'wb');
    for($i = 0; $i < 256; ++$i) {
        fwrite($handle, chr($i));
    }
    fclose($handle);
    

    Here is how I tested both files...

    <?php
    $filename = 'E:\ASCII';
    $finfo = finfo_open(FILEINFO_MIME);
    echo (substr(finfo_file($finfo, $filename), 0, 4) == 'text') ? 'ASCII' : 'Binary';
    

    Which outputs:

    Binary

    and...

    <?php
    $filename = 'E:\Binary';
    $finfo = finfo_open(FILEINFO_MIME);
    echo (substr(finfo_file($finfo, $filename), 0, 4) == 'text') ? 'ASCII' : 'Binary';
    

    Which outputs:

    Binary

    This code shows both my ASCII and binary files to both be binary, which is obviously incorrect, so I had to find what was causing the mimetype to be "text". To me it was obvious, maybe text is just printable ASCII characters. So I limited the range of my ASCII file.

    <?php
    $handle = fopen('E:\ASCII', 'wb');
    for($i = 32; $i < 127; ++$i) {
        fwrite($handle, chr($i));
    }
    fclose($handle);
    

    And tested it again.

    <?php
    $filename = 'E:\ASCII';
    $finfo = finfo_open(FILEINFO_MIME);
    echo (substr(finfo_file($finfo, $filename), 0, 4) == 'text') ? 'ASCII' : 'Binary';
    

    Which outputs:

    ASCII

    If I lower the range, it treats it as binary. If I increase the range, once again, it treats it as binary.

    So the most accepted answer does not tell you if your file is ASCII but rather that it contains only readable text or not.


    Lastly, I need to test the other answer which uses ctype_print against my files. I decided the easiest way to do this was to use the code I made an supplement in MarcoA's code.

    <?php
    $filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'ASCII',
        2 => 'Binary'
    );
    
    function filemode($filename) {
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                $buffer_size = 256;
                $chunks = ceil($size / $buffer_size);
                $handle = fopen($filename, 'rb');
                for($chunk = 0; $chunk < $chunks; ++$chunk) {
                    $buffer = fread($handle, $buffer_size);
                    $buffer = str_ireplace("\t", '', $buffer);
                    $buffer = str_ireplace("\n", '', $buffer);
                    $buffer = str_ireplace("\r", '', $buffer);
                    if(ctype_print($buffer) === false) {
                        fclose($handle);
                        return 2; // Binary
                    }
                }
                fclose($handle);
                return 1; // ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    
    // ==========
    
    $filename = 'e:\test.txt';
    
    $loops = 1;
    $x = 0;
    $i = 0;
    $start = microtime(true);
    
    for($i = 0; $i < $loops; ++$i)
        $x = filemode($filename);
    
    $stop = microtime(true);
    $duration = $stop - $start;
    
    echo
        'Filename: ', $filename, "\n",
        'Filemode: ', $filemodes[filemode($filename)], "\n",
        'Duration: ', $duration;
    

    Ouch! 0.2 seconds to tell me that my 600Kb file is ASCII. My large ASCII file, I know, contains visible ASCII characters only. It does seem to know that my binary files are binary. And my pure ASCII file... Binary!

    I decided to read the documentation for ctype_print and its return value is defined as:

    Returns TRUE if every character in text will actually create output (including blanks). Returns FALSE if text contains control characters or characters that do not have any output or control function at all.

    This function, like davethegr8's answer only tells you if your text contains printable ASCII characters and does not tell you if your text is actually ASCII or not. That doesn't necessarily mean MacroA is completely wrong, they are just not completely right. str_ireplace is slow compared to str_replace, and only replacing those three control characters to test ctype_print isn't enough to know if the string is ASCII or not. To make this example work for ASCII, we must replace every control character!

    <?php
    $filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'ASCII',
        2 => 'Binary'
    );
    
    function filemode($filename) {
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                $buffer_size = 256;
                $chunks = ceil($size / $buffer_size);
                $replace = array(
                    "\x00", "\x01", "\x02", "\x03",
                    "\x04", "\x05", "\x06", "\x07",
                    "\x08", "\x09", "\x0A", "\x0B",
                    "\x0C", "\x0D", "\x0E", "\x0F",
                    "\x10", "\x11", "\x12", "\x13",
                    "\x14", "\x15", "\x16", "\x17",
                    "\x18", "\x19", "\x1A", "\x1B",
                    "\x1C", "\x1D", "\x1E", "\x1F",
                    "\x7F"
                );
                $handle = fopen($filename, 'rb');
                for($chunk = 0; $chunk < $chunks; ++$chunk) {
                    $buffer = fread($handle, $buffer_size);
                    $buffer = str_replace($replace, '', $buffer);
                    if(ctype_print($buffer) === false) {
                        fclose($handle);
                        return 2; // Binary
                    }
                }
                fclose($handle);
                return 1; // ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    

    This took 0.04 seconds to tell me that my 600Kb file is ASCII.


    All of this testing I believe hasn't been completely useless as it did give me one more idea. Why not add a printable filemode to my original function! While it does seems to be 0.018 seconds slower on my 600Kb printable ASCII file, here it is.

    <?php
    $filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'Printable',
        2 => 'ASCII',
        3 => 'Binary'
    );
    
    function filemode($filename) {
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                $printable = true;
                $buffer_size = 256;
                $chunks = ceil($size / $buffer_size);
                $handle = fopen($filename, 'rb');
                for($chunk = 0; $chunk < $chunks; ++$chunk) {
                    $buffer = fread($handle, $buffer_size);
                    if(preg_match('/[\x80-\xFF]/', $buffer) === 1) {
                        fclose($handle);
                        return 3; // Binary
                    }
                    else
                        if($printable === true)
                            $printable = ctype_print($buffer);
                }
                fclose($handle);
                return $printable === true ? 1 : 2; // Printable or ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    
    // ==========
    
    $filename = 'e:\test.txt';
    
    $loops = 1;
    $x = 0;
    $i = 0;
    $start = microtime(true);
    
    for($i = 0; $i < $loops; ++$i)
        $x = filemode($filename);
    
    $stop = microtime(true);
    $duration = $stop - $start;
    
    echo
        'Filename: ', $filename, "\n",
        'Filemode: ', $filemodes[filemode($filename)], "\n",
        'Duration: ', $duration;
    

    I also tested ctype_print against a regular expression and found ctype_print to be a bit faster.

    $printable = preg_match('/[^\x20-\x7E]/', $buffer) === 0;
    

    Here is my final function where finding printable text is optional, as is the buffer size.

    <?php
    const filemodes = array(
        -2 => 'Unreadable',
        -1 => 'Missing',
        0 => 'Empty',
        1 => 'Printable',
        2 => 'ASCII',
        3 => 'Binary'
    );
    
    function filemode($filename, $printable = false, $buffer_size = 256) {
        if(is_bool($printable) === false || is_int($buffer_size) === false)
            return false;
        $buffer_size = floor($buffer_size);
        if($buffer_size <= 0)
            return false;
        if(is_file($filename)) {
            if(is_readable($filename)) {
                $size = filesize($filename);
                if($size === 0)
                    return 0; // Empty
                if($buffer_size > $size)
                    $buffer_size = $size;
                $chunks = ceil($size / $buffer_size);
                $handle = fopen($filename, 'rb');
                for($chunk = 0; $chunk < $chunks; ++$chunk) {
                    $buffer = fread($handle, $buffer_size);
                    if(preg_match('/[\x80-\xFF]/', $buffer) === 1) {
                        fclose($handle);
                        return 3; // Binary
                    }
                    else
                        if($printable === true)
                            $printable = ctype_print($buffer);
                }
                fclose($handle);
                return $printable === true ? 1 : 2; // Printable or ASCII
            }
            else
                return -2; // Unreadable
        }
        else
            return -1; // Missing
    }
    
    // ==========
    
    $filename = 'e:\test.txt';
    echo
        'Filename: ', $filename, "\n",
        'Filemode: ', filemodes[filemode($filename, true)], "\n";
    
    0 讨论(0)
提交回复
热议问题