I am trying to cut off text after 236 chars without cutting words in half and preserving html tags. This is what I am using right now:
$shortdesc = $_helper-
Best solution I have come across for this is from the CakePHP framework TextHelper class
Here is the method
/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* ### Options:
*
* - `ending` Will be used as Ending and appended to the trimmed string
* - `exact` If false, $text will not be cut mid-word
* - `html` If true, HTML tags would be handled correctly
*
* @param string $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param array $options An array of html attributes and options.
* @return string Trimmed string.
* @access public
* @link http://book.cakephp.org/view/1469/Text#truncate-1625
*/
function truncate($text, $length = 100, $options = array()) {
$default = array(
'ending' => '...', 'exact' => true, 'html' => false
);
$options = array_merge($default, $options);
extract($options);
if ($html) {
if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
return $text;
}
$totalLength = mb_strlen(strip_tags($ending));
$openTags = array();
$truncate = '';
preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
foreach ($tags as $tag) {
if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
array_unshift($openTags, $tag[2]);
} else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
$pos = array_search($closeTag[1], $openTags);
if ($pos !== false) {
array_splice($openTags, $pos, 1);
}
}
}
$truncate .= $tag[1];
$contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
if ($contentLength + $totalLength > $length) {
$left = $length - $totalLength;
$entitiesLength = 0;
if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
foreach ($entities[0] as $entity) {
if ($entity[1] + 1 - $entitiesLength <= $left) {
$left--;
$entitiesLength += mb_strlen($entity[0]);
} else {
break;
}
}
}
$truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
break;
} else {
$truncate .= $tag[3];
$totalLength += $contentLength;
}
if ($totalLength >= $length) {
break;
}
}
} else {
if (mb_strlen($text) <= $length) {
return $text;
} else {
$truncate = mb_substr($text, 0, $length - mb_strlen($ending));
}
}
if (!$exact) {
$spacepos = mb_strrpos($truncate, ' ');
if (isset($spacepos)) {
if ($html) {
$bits = mb_substr($truncate, $spacepos);
preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
if (!empty($droppedTags)) {
foreach ($droppedTags as $closingTag) {
if (!in_array($closingTag[1], $openTags)) {
array_unshift($openTags, $closingTag[1]);
}
}
}
}
$truncate = mb_substr($truncate, 0, $spacepos);
}
}
$truncate .= $ending;
if ($html) {
foreach ($openTags as $tag) {
$truncate .= '</'.$tag.'>';
}
}
return $truncate;
}
Other frameworks may have similar (or different) solutions to this problem, so you could take a look at them too. My familiarity with Cake is what prompted my linking to their solution
Edit:
Just tested this method in an app I'm working on with the OP's text
<?php
echo truncate(
'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>',
236,
array('html' => true, 'ending' => ''));
?>
Output:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubegre</strong>
Notice the output stops just short of completing the last word, but includes the complete strong tags
Can I just give a thought ?
Sample text :
Lorem ipsum dolor sit amet, <i class="red">magna aliquyam erat</i>, duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong> hello
First, parse it into:
array(
'0' => array(
'tag' => '',
'text' => 'Lorem ipsum dolor sit amet, '
),
'1' => array(
'tag' => '<i class="red">',
'text' => 'magna aliquyam erat',
)
'2' => ......
'3' => ......
)
then cut the text one by one, and wrap each one with its tag after cut,
then join them.
This will work with Unicode (from @nice ass answer):
class Html
{
protected
$reachedLimit = false,
$totalLen = 0,
$maxLen = 25,
$toRemove = [];
public static function trim($html, $maxLen = 25)
{
$dom = new \DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$instance = new static();
$toRemove = $instance->walk($dom, $maxLen);
// remove any nodes that exceed limit
foreach ($toRemove as $child) {
$child->parentNode->removeChild($child);
}
return $dom->saveHTML();
}
protected function walk(\DOMNode $node, $maxLen)
{
if ($this->reachedLimit) {
$this->toRemove[] = $node;
} else {
// only text nodes should have text,
// so do the splitting here
if ($node instanceof \DOMText) {
$this->totalLen += $nodeLen = mb_strlen($node->nodeValue);
// use mb_strlen / mb_substr for UTF-8 support
if ($this->totalLen > $maxLen) {
dump($node->nodeValue);
$node->nodeValue = mb_substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
$this->reachedLimit = true;
}
}
// if node has children, walk its child elements
if (isset($node->childNodes)) {
foreach ($node->childNodes as $child) {
$this->walk($child, $maxLen);
}
}
}
return $this->toRemove;
}
}
function limitStrlen($input, $length, $ellipses = true, $strip_html = true, $skip_html)
{
// strip tags, if desired
if ($strip_html || !$skip_html)
{
$input = strip_tags($input);
// no need to trim, already shorter than trim length
if (strlen($input) <= $length)
{
return $input;
}
//find last space within length
$last_space = strrpos(substr($input, 0, $length), ' ');
if($last_space !== false)
{
$trimmed_text = substr($input, 0, $last_space);
}
else
{
$trimmed_text = substr($input, 0, $length);
}
}
else
{
if (strlen(strip_tags($input)) <= $length)
{
return $input;
}
$trimmed_text = $input;
$last_space = $length + 1;
while(true)
{
$last_space = strrpos($trimmed_text, ' ');
if($last_space !== false)
{
$trimmed_text = substr($trimmed_text, 0, $last_space);
if (strlen(strip_tags($trimmed_text)) <= $length)
{
break;
}
}
else
{
$trimmed_text = substr($trimmed_text, 0, $length);
break;
}
}
// close unclosed tags.
$doc = new DOMDocument();
$doc->loadHTML($trimmed_text);
$trimmed_text = $doc->saveHTML();
}
// add ellipses (...)
if ($ellipses)
{
$trimmed_text .= '...';
}
return $trimmed_text;
}
$str = "<h1><strong><span>Lorem</span></strong> <i>ipsum</i> <p class='some-class'>dolor</p> sit amet, consetetur.</h1>";
// view the HTML
echo htmlentities(limitStrlen($str, 22, false, false, true), ENT_COMPAT, 'UTF-8');
// view the result
echo limitStrlen($str, 22, false, false, true);
Note: There may be a better way to close tags instead of using DOMDocument
. For example we can use a p tag
inside a h1 tag
and it still will work. But in this case the heading tag will close before the p tag
because theoretically it's not possible to use p tag
inside it. So, be careful for HTML's strict standards.
Here is JS solution: trim-html
The idea is to split HTML string in that way to have an array with elements being html tag(open or closed) or just string.
var arr = html.replace(/</g, "\n<")
.replace(/>/g, ">\n")
.replace(/\n\n/g, "\n")
.replace(/^\n/g, "")
.replace(/\n$/g, "")
.split("\n");
Than we can iterate through array and count characters.
You can take an XML approach and push elements to a string var until the length of the string exceed 236
example code ?
for each node // text or tag
push to the string var
if string length > 236
break
endfor
for parsing HTML in PHP http://simplehtmldom.sourceforge.net/