I am scraping the DOM of a static site with PHP and pulling out specific bit\'s of data so I can put stuff into a database.
For this example I am storing the inner H
The number in parenthesis is the total byte count. Obviously, a 45-byte string cannot be identical to a 11-byte one.
You can use bin2hex() to inspect the exact bytes. I also suggest you don't see the output as HTML—In most browsers you can hit Ctrl+U.
Edit: asking why two given strings render the same words after being processed by a web browser is better answered by actually looking at the real raw data (as opposed to just looking at the output produced by the browser).
Edit #2:
var_dump( hex2bin('3c74642077696474683d223832222076616c69676e3d22746f70223e547970653c2f74643e') );
... prints this:
string(37) "<td width="82" valign="top">Type</td>"
Do you want to strip HTML tags or something? Did you see the raw HTML?
You should as question why this one happens
string(45) "Description"
string(11) "Description"
Second one is 11 chars, first one is 45! Why? So there are some hidden (not showed) characters\symbols. That's why this strings not equal.
Try this one Remove control characters from php String
Solution is to use a regex like this
function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
return preg_replace('/[^A-Za-z0-9\-\;\,\?\*\%\@\$\!\(\)\#\=\&]/', '', $string); // Removes special chars
}
Adapt it to the special char you need or not add the one you want to keep catching like this \#
or esle \=