html-parsing | 易学教程

Web Scraping in R with loop from data.frame

阅读更多关于 Web Scraping in R with loop from data.frame

问题 library(rvest) df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8")) for(i in 1:3) { webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i])) data <- webpage %>% html_nodes(".specs") %>% .[[1]] %>% html_table(fill = TRUE) } want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name) 回答1: The problem is in how

Web Scraping in R with loop from data.frame

阅读更多关于 Web Scraping in R with loop from data.frame

How can I retrieve and parse just the html returned from an URL?

阅读更多关于 How can I retrieve and parse just the html returned from an URL?

问题 I want to be able to programmatically (without it displaying in the browser) send an URL such as http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=platypi&sprefix=platypi%2Caps&rh=i%3Aaps%2Ck%3Aplatypi" and get back in a string (or some more appropriate data type?) the html results of the page (the interesting part, anyway) so that I could parse that and reformat selected parts of it as matched text and images (which link to the appropriate page). I want to do

How can I retrieve and parse just the html returned from an URL?

阅读更多关于 How can I retrieve and parse just the html returned from an URL?

file_get_html() returns garbage

阅读更多关于 file_get_html() returns garbage

问题 I am using a simple_html_dom parser. The following code is returning garbage output: $opts = array( 'http'=>array( 'method'=>"GET", 'header'=> "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n". "Accept-Encoding: gzip, deflate\r\n". "Accept-language: en-US,en;q=0.5\r\n" . "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n". "Cookie: foo=bar\r\n" ) ); $context = stream_context_create($opts); $html = file_get_html

Replacing a HTML div InnerText tag using HTML Agility Pack

阅读更多关于 Replacing a HTML div InnerText tag using HTML Agility Pack

问题 I'm using the HTML Agility Pack to manipulate and edit a HTML document. I want to change the text in the field such as this: <div id="Div1"><b>Some text here.</b><br></div> I am looking to update the text within this div to be: <div id="Div1"><b>Some other text.</b><br></div> I've tried doing this using the following code, but it doesn't seem to be working because the InnerText property is readonly. HtmlTextNode hNode = null; hNode = hDoc.DocumentNode.SelectSingleNode("//div[@id='Div1']") as

php preg_replace for property inside html tags

阅读更多关于 php preg_replace for property inside html tags

问题 My problem is how to replace the src value of a <script> tag inside a string like in this example (well, I need this in a more general scenario of properties inside tags): $data = <<<EOD <script language="javascript" src= "../tests/ajax-navigation.js"></script> ... <img src="../404.jpg" alt="404"> ... EOD; I used this function in php: class Search{ public static function replaceProperty($data, $start, $end, $property, $alias, $limit = -1){ //get blocks formed as: $start $property = "..." $end

The best way to parse HTML tags in java-script

阅读更多关于 The best way to parse HTML tags in java-script

问题 can anybody help/advice that is there any way to parse HTML tags appear in side the <body>...</body> tags 回答1: I suppose you want to parse a HTML document using PHP. I suggest you read about the http://www.php.net/manual/en/book.dom.php Here is an example provided by PHP Pro <?php $html = ' <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" dir="ltr"> <head> <title>PHPRO.ORG<

The best way to parse HTML tags in java-script

阅读更多关于 The best way to parse HTML tags in java-script

DOMDocument remove script tags from HTML source

阅读更多关于 DOMDocument remove script tags from HTML source

问题 I used @Alex's approach here to remove script tags from a HTML document using the built in DOMDocument. The problem is if I have a script tag with Javascript content and then another script tag that links to an external Javascript source file, not all script tags are removed from the HTML. $result = ' <!doctype html> <html> <head> <meta charset="utf-8"> <title> hey </title> <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script> <script>