HTML Purifier: Removing an element conditionally based on its attributes

前端 未结 3 513
南笙
南笙 2020-12-03 23:46

As per the HTML Purifier smoketest, \'malformed\' URIs are occasionally discarded to leave behind an attribute-less anchor tag, e.g.

相关标签:
3条回答
  • 2020-12-04 00:23

    Success! Thanks to Ambush Commander and mcgrailm in another question, I am now using a hilariously simple solution:

    // a bit of context
    $htmlDef = $this->configuration->getHTMLDefinition(true);
    $anchor  = $htmlDef->addBlankElement('a');
    
    // HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from
    // all anchor tags (see first post for class detail)
    $anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp();
    
    // this is the magic! We're making 'href' a required attribute (note the
    // asterisk) - now HTML Purifier removes <a></a>, as well as
    // <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp
    // is through with it!
    $htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());
    

    It works, it works, bahahahaHAHAHAHAnhͥͤͫ̀ğͮ͑̆ͦó̓̉ͬ͋h́ͧ̆̈́̉ğ̈́͐̈a̾̈́̑ͨô̔̄̑̇g̀̄h̘̝͊̐ͩͥ̋ͤ͛g̦̣̙̙̒̀ͥ̐̔ͅo̤̣hg͓̈́͋̇̓́̆a͖̩̯̥͕͂̈̐ͮ̒o̶ͬ̽̀̍ͮ̾ͮ͢҉̩͉̘͓̙̦̩̹͍̹̠̕g̵̡͔̙͉̱̠̙̩͚͑ͥ̎̓͛̋͗̍̽͋͑̈́̚...! * manic laughter, gurgling noises, keels over with a smile on her face *

    0 讨论(0)
  • 2020-12-04 00:29

    The fact that you can't remove elements with a TagTransform appears to have been an implementation detail. The classic mechanism for removing nodes (a smidge higher-level than just tags) is to use an Injector though.

    Anyway, the particular piece of functionality you're looking for is already implemented as %AutoFormat.RemoveEmpty

    0 讨论(0)
  • 2020-12-04 00:31

    For perusal, this is my current solution. It works, but bypasses HTML Purifier entirely.

    /**
     * Removes <a></a> and <a href="http:/"></a> tags from the purified
     * HTML.
     * @todo solve this with an injector?
     * @param string $purified The purified HTML
     * @return string The purified HTML, sans pointless anchors.
     */
    private function anchorCull($purified)
    {
        if (empty($purified)) return '';
        // re-parse HTML
        $domTree = new DOMDocument();
        $domTree->loadHTML($purified);
        // find all anchors (even good ones)
        $anchors = $domTree->getElementsByTagName('a');
        // collect bad anchors (destroying them in this loop breaks the DOM)
        $destroyNodes = array();
        for ($i = 0; ($i < $anchors->length); $i++) {
            $anchor = $anchors->item($i);
            $href   = $anchor->attributes->getNamedItem('href');
            // <a></a>
            if (is_null($href)) {
                $destroyNodes[] = $anchor;
            // <a href="http:/"></a>
            } else if ($href->nodeValue == 'http:/') {
                $destroyNodes[] = $anchor;
            }
        }
        // destroy the collected nodes
        foreach ($destroyNodes as $node) {
            // preserve content
            $retain = $node->childNodes;
            for ($i = 0; ($i < $retain->length); $i++) {
                $rnode = $retain->item($i);
                $node->parentNode->insertBefore($rnode, $node);
            }
            // actually destroy the node
            $node->parentNode->removeChild($node);
        }
        // strip out HTML out of DOM structure string
        $html = $domTree->saveHTML();
        $begin = strpos($html, '<body>') + strlen('<body>');
        $end   = strpos($html, '</body>');
        return substr($html, $begin, $end - $begin);
    }
    

    I'd still much rather have a good HTML Purifier solution to this, so, just as a heads-up, this answer won't end up self-accepted. But in case no better answer ends up coming around, at least it might help those with similar issues. :)

    0 讨论(0)
提交回复
热议问题