domdocument

php spider breaks in middle (Domdocument, xpath, curl) - help needed

我与影子孤独终老i 提交于 2019-12-25 01:24:49
问题 I am a beginner programmer, designing a spider that crawls pages. Logic goes like this: get $url with curl create dom document parsing out href tags using xpath storing href attributes in $totalurls (that aren't already there) updating $url from $totalurls Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on. But if I begin with the page that was 10th in previous example it finds all links with no problem but

how to print only one tag with curl

↘锁芯ラ 提交于 2019-12-24 18:44:49
问题 i have 2 or 3 tag <p> in my web but, im just want to print first and second <p> . how i can do that? here my code <?php $url = "http://www.web.org/dorama/1401143633/momikeshite-fuyu--wagaya-no-mondai-nakatta-koto-ni"; $ch = curl_init(); $timeout = 5; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $html = curl_exec($ch); curl_close($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); foreach($dom-

how to print only one tag with curl

ε祈祈猫儿з 提交于 2019-12-24 18:42:04
问题 i have 2 or 3 tag <p> in my web but, im just want to print first and second <p> . how i can do that? here my code <?php $url = "http://www.web.org/dorama/1401143633/momikeshite-fuyu--wagaya-no-mondai-nakatta-koto-ni"; $ch = curl_init(); $timeout = 5; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $html = curl_exec($ch); curl_close($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); foreach($dom-

Fetching all images src from specific div

故事扮演 提交于 2019-12-24 16:56:02
问题 Suppose, I have HTML structure like: <div> <div class="content"> <p>This is dummy text</p> <p><img src="a.jpg"></p> <p>This is dummy text</p> <p><img src="b.jpg"></p> </div> </div> I want to fetch all image src from .content div. I tried : <?php // a new dom object $dom = new domDocument; // load the html into the object $dom->loadHTML("example.com/article/2345"); // discard white space $dom->preserveWhiteSpace = false; //get element by class $finder = new DomXPath($dom); $classname =

DOMElement replace HTML value

北慕城南 提交于 2019-12-24 15:26:06
问题 I have this HTML string in a DOMElement : <h1>Home</h1> test{{test}} I want to replace this content in a way that only <h1>Home</h1> test remains (so I want to remove the {{test}} ). At this moment, my code looks like this: $node->nodeValue = preg_replace( '/(?<replaceable>{{([a-z0-9_]+)}})/mi', '' , $node->nodeValue); This doesn't work because nodeValue doesn't contain the HTML value of the node. I can't figure out how to get the HTML string of the node other than using $node->C14N() , but

PHP Split html string into array

大兔子大兔子 提交于 2019-12-24 15:25:54
问题 I hope I can get some help from you guys. This is what I'm struggling with, I have a string of HTML that will look like this: <h4>Some title here</h4> <p>Lorem ipsum dolor</p> (some other HTML here) <h4>Some other title here</h4> <p>Lorem ipsum dolor</p> (some other HTML here) I need to split all the <h4> from the rest of the content, but for example the content after the first <h4> and before the second <h4> needs to be related to the first <h4> , something like this: Array { [0] => <h4>Some

PHP DOMDocument how to get that content of this tag?

非 Y 不嫁゛ 提交于 2019-12-24 15:25:11
问题 I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id . <span id="CPHCenter_lblOperandName">Hello world</span> My code: $dom = new domDocument; @$dom->loadHTML($html); // the @ is to silence errors and misconfigures of HTML $dom->preserveWhiteSpace = false; $nodes = $dom->getElementsByTagName('//span[@id="CPHCenter_lblOperandName"'); foreach($nodes as $node){ echo $node->nodeValue; } But For some reason I think something is wrong

HTML DOM Document parsing

拜拜、爱过 提交于 2019-12-24 12:54:04
问题 i am new to DOM Document.. i have this html: <tr class="calendar_row" data-eventid="39657"> <td class="alt1 eventDate smallfont" align="center">Sun<div class="eventday_multiple">Dec 9</div></td> <td class="alt1 smallfont" align="center">3:34am</td> <td class="alt1 smallfont" align="center">USD</td> </tr> <tr class="calendar_row" data-eventid="39658"> <td class="alt1 eventDate smallfont" align="center">Sun<div class="eventday_multiple">Dec 10</div></td> <td class="alt1 smallfont" align="center

WebKit2 and DomDocument/JavaScriptCore (Python3)

别等时光非礼了梦想. 提交于 2019-12-24 11:27:59
问题 I am converting a Python3 application to use WebKit2 instead of WebKit (which is no longer available in Debian Buster). In the application the user can (de)select check boxes which I read from the Python3 application. In the original code I could simply get the DomDocument of the Webview and iterate through the child objects to return the value of the object with a given name (sample code below). In WebKit2 the get_dom_document function is no longer available and the WebKit2 documentation is

get a complete table with php domdocument and print it

独自空忆成欢 提交于 2019-12-24 10:16:08
问题 I would like to get a complete html table having id = 'myid' from a given url using php domddocument and print it to our web page, How can i do this ? I am trying with below code to get table but i cant getting trs(table rows) and tds(table data) and other inner html. $xml = new DOMDocument(); @$xml->loadHTMLFile($url); foreach($xml->getElementById('myid') as $table) { // now how to get tr and td and other element ? // i am getting other element like :- $links = $table->getElementsByTagName(