I have a php web sites in wich I can manages articles. On the Add a new article form, there is a rich-text-box (allows HTML input) that I\'d like to limit the character inpu
html_entity_decode
only decodes HTML entities, it doesn't ignore HTML tags. Try:
strlen(strip_tags(html_entity_decode($string)));
Or the multi-byte equivalent:
mb_strlen(strip_tags(html_entity_decode($string)), 'auto');
You want to get the number of characters, but you don't want to count HTML markup.
You can do that by using a HTML parser, like DOMDocument
. You load in the document (or fragment), obtain the body tag which represents the documents content, get it's nodeValue
, normalize the whitespace of it and then you use a UTF-8 compatible character counting function:
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
$body = $doc->getElementsByTagName('body')->item(0);
$text = $body->nodeValue;
$text = trim(preg_replace('/\s{1,}/u', ' ', $text));
printf("Length: %d character(s).\n", mb_strlen($text, 'utf-8'));
Example input test.html
:
<body>
<div style='float:left'><img src='../../../../includes/ph1.jpg'></div>
<label style='width: 476px; height: 40px; position: absolute;top:100px; left: 40px; z-index: 2; background-color: rgb(255, 255, 255);; background-color: transparent' >
<font size="4">1a. Nice to meet you!</font>
</label>
<img src='ENG_L1_C1_P0_1.jpg' style='width: 700px; height: 540px; position: absolute;top:140px; left: 40px; z-index: 1;' />
<script type='text/javascript'>
swfobject.registerObject('FlashID');
</script>
<input type="image" id="nextPageBtn" src="../../../../includes/ph4.gif" style="position: absolute; top: 40px; left: 795px; ">
</body>
Example output:
Length: 58 character(s).
The normalized text is:
1a. Nice to meet you! swfobject.registerObject('FlashID');
Take care that this counts the text-size including things like text inside <script>
tags.