PHP and the million array baby

前端 未结 4 1125
一个人的身影
一个人的身影 2021-02-19 19:37

Imagine you have the following array of integers:

array(1, 2, 1, 0, 0, 1, 2, 4, 3, 2, [...] );

The integers go on up to one million entries; on

相关标签:
4条回答
  • 2021-02-19 20:12

    I can't randomly generate it every time because it should be consistent and always have the same values at the same indexes.

    Have you ever read up on pseudo-random numbers? There's this little thing called a seed which addresses this issue.

    Also benchmark your options and claims. Have you timed the file_get_contents vs. the json_decode? There is a trade-off to be made here between storage and access costs. Eg. if your numbers are 0..9 (or 0..255) then it may be easier to store them in a 2Mb string and use an access function on this. 2Mb will load faster whether from the FS or APC.

    0 讨论(0)
  • 2021-02-19 20:14

    Say the integers are all 0-15. Then you can store 2 per byte:

    <?php
    $data = '';
    for ($i = 0; $i < 500000; ++$i)
      $data .= chr(mt_rand(0, 255));
    
    echo serialize($data);
    

    To run: php ints.php > ints.ser

    Now you have a file with a 500000 byte string containing 1,000,000 random integers from 0 to 15.

    To load:

    <?php
    $data = unserialize(file_get_contents('ints.ser'));
    
    function get_data_at($data, $i)
    {
      $data = ord($data[$i >> 1]);
    
      return ($i & 1) ? $data & 0xf : $data >> 4;
    }
    
    for ($i = 0; $i < 1000; ++$i)
      echo get_data_at($data, $i), "\n";
    

    The loading time on my machine is about .002 seconds.

    Of course this might not be directly applicable to your situation, but it will be much faster than a bloated PHP array of a million entries. Quite frankly, having an array that large in PHP is never the proper solution.

    I'm not saying this is the proper solution either, but it definitely is workable if it fits your parameters.

    Note that if your array had integers in the 0-255 range, you could get rid of the packing and just access the data as ord($data[$i]). In that case, your string would be 1M bytes long.

    Finally, according to the documentation of file_get_contents(), php will memory map the file. If so, your best performance would be to dump raw bytes to a file, and use it like:

    $ints = file_get_contents('ints.raw');
    echo ord($ints[25]);
    

    This assumes that ints.raw is exactly one million bytes long.

    0 讨论(0)
  • 2021-02-19 20:20

    APC stores the data serialized, so it has to be unserialized as it is loaded back from APC. That's where your overhead is.

    The most efficient way of loading it is to write to file as PHP and include(), but you're never going to have any level of efficiency with an array containing a million elements... it takes an enormous amount of memory, and it takes time to load. This is why databases were invented, so what is your problem with a database?

    EDIT

    If you want to speed up serialize/deserialize, take a look at the igbinary extension

    0 讨论(0)
  • 2021-02-19 20:21

    As Mark said, this is why databases was created - to allow you to search (and manipulate, but you might not need that) data effectively based on your regular usage patterns. It'll also might also be faster than implementing your own search using the array. I'm guessing we're talking about somewhere close to 2-300MB of data (before serialization) being serialized and unserialized each time you're accessing the array.

    If you want to speed it up, try to assign each element of the array separately - you might trade function call overhead for time spent in serialization. You could also extend this with your own extension, wrapping your dataset in a small retrieval interface.

    I'm guessing the reason why you can't directly store the zvals are because they contain internal state, and you simply can't just point the variable symbol table to the previous table.

    0 讨论(0)
提交回复
热议问题