Using PHP's DOMDocument::preserveWhiteSpace = false and still getting whitespace

本小妞迷上赌 提交于 2019-12-25 18:21:36

问题


I'm scraping this page:
http://kat.ph/search/example/?field=seeders&sorder=desc

In this way:

...
curl_setopt( $curl, CURLOPT_URL, $url );
$header = array (
    'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding:gzip,deflate,sdch',
    'Accept-Language:en-US,en;q=0.8',
    'Cache-Control:max-age=0',
    'Connection:keep-alive',
    'Host:kat.ph',
    'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
);
curl_setopt( $curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19'); 
curl_setopt( $curl, CURLOPT_HTTPHEADER, $header ); 
curl_setopt( $curl, CURLOPT_REFERER, 'http://kat.ph' ); 
curl_setopt( $curl, CURLOPT_ENCODING, 'gzip,deflate,sdch' ); 
curl_setopt( $curl, CURLOPT_AUTOREFERER, true ); 
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1 ); 
curl_setopt( $curl, CURLOPT_TIMEOUT, 10 );

$html = curl_exec( $curl );
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
@$dom->loadHTML( $html );

(Had to mimic the browser for this to work, hence CURL)

But I still get DOMNodes of type #text which consist of just whitespace characters.

Any ideas of why is this happening and how to avoid it?


回答1:


It looks like that the preserveWhiteSpace property simply sets the libxml2 XML_PARSE_NOBLANKS flag, which is not always reliable as this thread suggests. Specifically, when parsing without a DTD as in this case the parser keeps empty text elements under some circumstances (mainly if they are siblings of other non-text elements).

The thread may be a bit dated, but the behavior still exists as described.



来源:https://stackoverflow.com/questions/9972112/using-phps-domdocumentpreservewhitespace-false-and-still-getting-whitespace

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!