Close open HTML tags in a string

后端 未结 9 1914
南方客
南方客 2020-11-28 11:21

Situation is a string that results in something like this:

This is some text and here is a bold text then the post stop here....

相关标签:
9条回答
  • 2020-11-28 11:46

    Here is a function i've used before, which works pretty well:

    function closetags($html) {
        preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
        $openedtags = $result[1];
        preg_match_all('#</([a-z]+)>#iU', $html, $result);
        $closedtags = $result[1];
        $len_opened = count($openedtags);
        if (count($closedtags) == $len_opened) {
            return $html;
        }
        $openedtags = array_reverse($openedtags);
        for ($i=0; $i < $len_opened; $i++) {
            if (!in_array($openedtags[$i], $closedtags)) {
                $html .= '</'.$openedtags[$i].'>';
            } else {
                unset($closedtags[array_search($openedtags[$i], $closedtags)]);
            }
        }
        return $html;
    } 
    

    Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:

    $str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
    $tidy = new Tidy();
    $clean = $tidy->repairString($str, array(
        'output-xml' => true,
        'input-xml' => true
    ));
    echo $clean;
    
    0 讨论(0)
  • 2020-11-28 11:46

    A small modification to the original answer...while the original answer stripped tags correctly. I found that during my truncation, I could end up with chopped up tags. For example:

    This text has some <b>in it</b>
    

    Truncating at character 21 results in:

    This text has some <
    

    The following code, builds on the next best answer and fixes this.

    function truncateHTML($html, $length)
    {
        $truncatedText = substr($html, $length);
        $pos = strpos($truncatedText, ">");
        if($pos !== false)
        {
            $html = substr($html, 0,$length + $pos + 1);
        }
        else
        {
            $html = substr($html, 0,$length);
        }
    
        preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
        $openedtags = $result[1];
    
        preg_match_all('#</([a-z]+)>#iU', $html, $result);
        $closedtags = $result[1];
    
        $len_opened = count($openedtags);
    
        if (count($closedtags) == $len_opened)
        {
            return $html;
        }
    
        $openedtags = array_reverse($openedtags);
        for ($i=0; $i < $len_opened; $i++)
        {
            if (!in_array($openedtags[$i], $closedtags))
            {
                $html .= '</'.$openedtags[$i].'>';
            }
            else
            {
                unset($closedtags[array_search($openedtags[$i], $closedtags)]);
            }
        }
    
    
        return $html;
    }
    
    
    $str = "This text has <b>bold</b> in it</b>";
    print "Test 1 - Truncate with no tag: " . truncateHTML($str, 5) . "<br>\n";
    print "Test 2 - Truncate at start of tag: " . truncateHTML($str, 20) . "<br>\n";
    print "Test 3 - Truncate in the middle of a tag: " . truncateHTML($str, 16) . "<br>\n";
    print "Test 4: - Truncate with less text: " . truncateHTML($str, 300) . "<br>\n";
    

    Hope it helps someone out there.

    0 讨论(0)
  • 2020-11-28 11:48

    And what about using PHP's native DOMDocument class? It inherently parses HTML and corrects syntax errors... E.g.:

    $fragment = "<article><h3>Title</h3><p>Unclosed";
    $doc = new DOMDocument();
    $doc->loadHTML($fragment);
    $correctFragment = $doc->getElementsByTagName('body')->item(0)->C14N();
    echo $correctFragment;
    

    However, there are several disadvantages of this approach. Firstly, it wraps the original fragment within the <body> tag. You can get rid of it easily by something like (preg_)replace() or by substituting the ...->C14N() function by some custom innerHTML() function, as suggested for example at http://php.net/manual/en/book.dom.php#89718. The second pitfall is that PHP throws an 'invalid tag in Entity' warning if HTML5 or custom tags are used (nevertheless, it will still proceed correctly).

    0 讨论(0)
  • 2020-11-28 11:49

    Using a regular expression isn't an ideal approach for this. You should use an html parser instead to create a valid document object model.

    As a second option, depending on what you want, you could use a regex to remove any and all html tags from your string before you put it in the <p> tag.

    0 讨论(0)
  • 2020-11-28 11:53

    There are numerous other variables that need to be addressed to give a full solution, but are not covered by your question.

    However, I would suggest using something like HTML Tidy and in particular the repairFile or repaireString methods.

    0 讨论(0)
  • 2020-11-28 11:56

    This PHP method always worked for me. It will close all un-closed HTML tags.

    function closetags($html) {
        preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
        $openedtags = $result[1];
    
        preg_match_all('#</([a-z]+)>#iU', $html, $result);
        $closedtags = $result[1];
        $len_opened = count($openedtags);
        if (count($closedtags) == $len_opened) {
            return $html;
        }
        $openedtags = array_reverse($openedtags);
        for ($i=0; $i < $len_opened; $i++) {
            if (!in_array($openedtags[$i], $closedtags)){
                $html .= '</'.$openedtags[$i].'>';
            } else {
                unset($closedtags[array_search($openedtags[$i], $closedtags)]);
            }
        }
        return $html;
    }
    
    0 讨论(0)
提交回复
热议问题