How to get the number of pages in a Word Document on linux?

前端 未结 4 934
醉话见心
醉话见心 2020-12-06 11:42

I saw this question PHP - Get number of pages in a Word document . I also need to determine the pages count from given word file (doc/docx). I tried to investigate phplivedo

相关标签:
4条回答
  • 2020-12-06 12:03

    Getting the number of pages for docx files is very easy:

    function get_num_pages_docx($filename)
    {
        $zip = new ZipArchive();
    
        if($zip->open($filename) === true)
        {  
            if(($index = $zip->locateName('docProps/app.xml')) !== false)
            {
                $data = $zip->getFromIndex($index);
                $zip->close();
    
                $xml = new SimpleXMLElement($data);
                return $xml->Pages;
            }
    
            $zip->close();
        }
    
        return false;
    }
    

    For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:

    function get_num_pages_doc($filename) 
    {
        $handle = fopen($filename, 'r');
        $line = @fread($handle, filesize($filename));
    
        echo '<div style="font-family: courier new;">';
    
            $hex = bin2hex($line);
            $hex_array = str_split($hex, 4);
            $i = 0;
            $line = 0;
            $collection = '';
            foreach($hex_array as $key => $string)
            {
                $collection .= hex_ascii($string);
                $i++;
    
                if($i == 1)
                {
                    echo '<b>'.sprintf('%05X', $line).'0:</b> ';
                }
    
                echo strtoupper($string).' ';
    
                if($i == 8)
                {
                    echo ' '.$collection.' <br />'."\n";
                    $collection = '';
                    $i = 0;
    
                    $line += 1;
                }
            }
    
        echo '</div>';
    
        exit();
    }
    
    function hex_ascii($string, $html_safe = true)
    {
        $return = '';
    
        $conv = array($string);
        if(strlen($string) > 2)
        {
            $conv = str_split($string, 2);
        }
    
        foreach($conv as $string)
        {
            $num = hexdec($string);
    
            $ascii = '.';
            if($num > 32)
            {   
                $ascii = unichr($num);
            }
    
            if($html_safe AND ($num == 62 OR $num == 60))
            {
                $return .= htmlentities($ascii);
            }
            else
            {
                $return .= $ascii;
            }
        }
    
        return $return;
    }
    
    function unichr($intval)
    {
        return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
    }
    

    which will out put code where you can find the sections such as:

    007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y.
    007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t.
    007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n...........
    007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 
    

    Which will allow you to see the referencing info such as:

    007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ
    007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
    007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
    007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........
    

    Which will allow you to determine properties described:

    _ab = ("SummaryInformation") 
    _cb = 0028
    _mse = 02 (STGTY_STREAM) 
    _bflags = 01 (DE_BLACK) 
    _sidLeftSib = FFFF FFFF 
    _sidRightSib = FFFF FFFF (none) 
    _sidChild = FFFF FFFF (n/a for STGTY_STREAM) 
    _clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a) 
    _dwUserFlags = 0000 0000 (n/a) 
    _time[0] = CreateTime = 0000 0000 0000 0000 (n/a) 
    _time[1] = ModifyTime = 0000 0000 0000 0000 (n/a)
    _startSect = 0000 0000 
    _ulSize = 0000 1000 
    _dptPropType = 0000 (n/a)
    

    Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.

    M$ don't make it easy!

    0 讨论(0)
  • 2020-12-06 12:18

    Excluding using Abiword or OpenOffice? Impossible - number of pages will depend on number of words/letters, fonts used, justification and kerning, margin size, line spacing, paragraph spacing, number of paragraphs, columns, size of graphics / embedded objects, page / column breaks and page margins.

    You need something which will can understand all of these.

    Even if you use OpenOffice or Abiword, reflowing the text may change the number of pages. Indeed, in some cases opening the same document on a different instance of MSWord may result in a difference.

    The best you could probably manage would be a statistical approach based on a representation of the document - but you'll still see huge variance.

    0 讨论(0)
  • 2020-12-06 12:20

    Have a look at PhpWord from microsoft codeplex ... "http://phpword.codeplex.com/

    It will allow you to open and read the word formatted file in PHP and do whatever processing you require.

    0 讨论(0)
  • 2020-12-06 12:20

    To get meta data properties of doc,docx,ppt and pptx like number of pages, number of slides using PHP i followed the following process and it worked liked charm and iam so happy, below is the process i followed , hope it helps someone

    Download and configure Apache Tika.
    

    once its done you could try executing the following commadn it will give all the meta data about your file

    java -jar tika-app-1.5.jar -m test.docx
    java -jar tika-app-1.5.jar -m test.doc
    java -jar tika-app-1.5.jar -m test.pptx
    java -jar tika-app-1.5.jar -m test.ppt
    

    once tested you can execute this comman in PHP script. Thanks.

    0 讨论(0)
提交回复
热议问题