What are some practical uses of PHP tokenizer?

孤者浪人 提交于 2019-12-02 20:29:06

I use PHP_CodeSniffer for coding style compliance, which is built on the tokeniser. Also, some frameworks (e.g. Symfony 2) use the tokeniser to generate cache files or intermediate class files of PHP code. It's also possible to use the tokeniser to build a source code formatter or syntax highlighter.

Basically, anywhere you use PHP code as data you can use the tokeniser. It's much more reliable that trying to parse PHP code with regular expressions or other string processing functions.

NikiC

I personally have already used it to build a PHP sandbox, which tries to create a more secure environment for executing PHP scripts.

Furthermore I did loads of experiments to preprocess PHP, e.g. I have an (incomplete) PHP 5.3 emulator for PHP 5.2 called prephp.

And many other similar tools, like source code analyzers (for security auditing, code style analysis, ...) use the Tokenizer as well.

But even for smaller things the Tokenizer may be handy. Not just large scale code analyzers. For example if you are accepting a PHP array and want to check that it's not malicious, you can do so using the Tokenizer.

PS: Currently I am switching to actually parsing the PHP, instead of just tokenizing it, using a PHP parser written in PHP I recently published (it works, but isn't really practically usable yet).

Pekka supports GoFundMonica

Interesting question.

I have not used the tokenizer in any production projects myself yet, but there are several questions on Stack Overflow to which the tokenizer is the (or at least, one) correct answer.

A pretty basic use is for syntax highlighting.

foreach(token_get_all($source) as $token) {
    if (is_array($token))
    {
        $map = "token_name";
        echo "<span class={$map($token[0])}>$token[1]</span>";
    }
    else {
        echo "<span class=T_RAW>$token</span>";
    }
}

The token numbers are usually converted into nicer CSS class names of course, but you could just craft a stylesheet with only .T_COMMENT, .T_ARRAY, .T_ELSEIF, .T_FUNCTION ... classes.

I've used tokenizer to find the cyclomatic complexity number and some other code metrics of a callback:

if ((isset($reflection) === true) && ($reflection->getFileName() !== false))
{
    if (($source = file($reflection->getFileName(), FILE_IGNORE_NEW_LINES)) !== false)
    {
        $source = implode("\n", array_slice($source, $reflection->getStartLine() - 1, $reflection->getEndLine() - ($reflection->getStartLine() - 1)));
        $result[$key]['source'] = array
        (
            'ccn' => 1,
            'statements' => 0,
            'lines' => array
            (
                'logical' => array(),
                'physical' => substr_count($source, "\n"),
            ),
        );

        if (is_array($tokens = token_get_all(sprintf('<?php %s ?>', $source))) === true)
        {
            $points = array_map('constant', array_filter(array
            (
                'T_BOOLEAN_AND',
                'T_BOOLEAN_OR',
                'T_CASE',
                'T_CATCH',
                'T_ELSEIF',
                'T_FINALLY',
                'T_FOR',
                'T_FOREACH',
                'T_GOTO',
                'T_IF',
                'T_LOGICAL_AND',
                'T_LOGICAL_OR',
                'T_LOGICAL_XOR',
                'T_WHILE',
            ), 'defined'));

            foreach ($tokens as $token)
            {
                if (is_array($token) === true)
                {
                    if ((in_array($token[0], array(T_CLOSE_TAG, T_COMMENT, T_DOC_COMMENT, T_INLINE_HTML, T_OPEN_TAG), true) !== true) && (strlen(trim($token[1])) > 0))
                    {
                        if (in_array($token[0], $points, true) === true)
                        {
                            ++$result[$key]['source']['ccn'];
                        }

                        array_push($result[$key]['source']['lines']['logical'], $token[2]);
                    }
                }

                else if (strncmp($token, '?', 1) === 0)
                {
                    ++$result[$key]['source']['ccn'];
                }

                else if (strncmp($token, ';', 1) === 0)
                {
                    ++$result[$key]['source']['statements'];
                }
            }

            $result[$key]['source']['lines']['logical'] = max(0, count(array_unique($result[$key]['source']['lines']['logical'])) - 1);
        }
    }
}

A friend of mine has written Überloader (A brute-force autoloader for PHP5.) which uses this very technique when it indexes class files. The _check_file() method from it will be of particular interest to you.

Überloader is designed for legacy projects that have not planned or thought about their class naming conventions or file structures.

I use the class everyday in legacy projects that I am fixing up or renovating.

KillerX

From a comment in the PHP manual:

The tokenizer functions are quite powerful. For example, you can retrieve all of the methods in a given class using an algorithm like:

for each token: if token is T_FUNCTION then start buffer if buffer is started then add the current string to the buffer if token is ( stop buffer

And the great thing is that the class methods will have the right case, so it's a good way to get around the limitations with get_class_methods returning lowercase method names. Also since using a similar algorithm you can read the arguments of a function you can implement Reflections-like functionality into PHP4.

Finally you can use it as a simpler method of extracting Javadoc out of a class file to generate documentation. The util/MethodTable.php class in AMFPHP (http://www.amfphp.org) uses the tokenizer functions to create a method table with all of the arguments, a description, return type, etc. and from that method table it can generate ActionScript that matches the PHP, but it could also be fitted to generate JavaScript, documentation files, or basically anything you put your mind to. I can also see that this could be the base for a class -> WSDL file generator.

You can use for gathering various informations about some php code, as for example all defined classes, methods, variables, generating documentation and similar tasks.

B4rb4ross4

I'm working on a Symfony 1.2 legacy application and I use the tokenizer to get all calls of sfConfig::get() and sfConfig::set().

So basically I document all configuration parameters of my application.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!