html-content-extraction

Extracting Information from websites

阅读更多关于 Extracting Information from websites

问题 Not every website exposes their data well, with XML feeds, APIs, etc How could I go about extracting information from a website? For example: ... <div> <div> <span id="important-data">information here</span> </div> </div> ... I come from a background of Java programming and coding with Apache XMLBeans. Is there anything similar to parse HTML, when I know the structure and the data is between a known tag? Thanks 回答1: There are several Open Source HTML Parsers out there for Java. I have used

Extracting the body text of an HTML document using PHP

阅读更多关于 Extracting the body text of an HTML document using PHP

问题 I know it's better to use DOM for this purpose but let's try to extract the text in this way: <?php $html=<<<EOD <html> <head> </head> <body> <p>Some text</p> </body> </html> EOD; preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE); if (empty($matches)) exit; $matched_body_start_tag = $matches[0][0]; $index_of_body_start_tag = $matches[0][1]; $index_of_body_end_tag = strpos($html, '</body>'); $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index

Python HTML scraping

阅读更多关于 Python HTML scraping

问题 It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: <a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e"> I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code? I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this... Huge thanks! 回答1: Regex is usally a bad idea, try using BeautifulSoup Quick example: html = #get html soup

How do I save a web page, programatically?

阅读更多关于 How do I save a web page, programatically?

问题 I would like to save a web page programmatically. I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing. The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down. 回答1: Take a look at wget, specifically the -p flag −p −−page−requisites This option causes Wget to download all the ﬁles

BeautifulSoup - easy way to to obtain HTML-free contents

阅读更多关于 BeautifulSoup - easy way to to obtain HTML-free contents

问题 I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font , b and different things... I'd like to get just the text content, without any other html tag. Example of link: <A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009: <font color=green>CCS Ingegneria Elettronica

Parse a .Net Page with Postbacks

阅读更多关于 Parse a .Net Page with Postbacks

问题 I need to read data from an online database that's displayed using an aspx page from the UN. I've done HTML parsing before, but it was always by manipulating query-string values. In this case, the site uses asp.net postbacks. So, you click on a value in box one, then box two shows, click on a value in box 2 and click a button to get your results. Does anybody know how I could automate that process? Thanks, Mike 回答1: You may still only need to send one request, but that one request can be

Using BeautifulSoup to find a HTML tag that contains certain text

阅读更多关于 Using BeautifulSoup to find a HTML tag that contains certain text

问题 I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11} <h2> this is cool #12345678901 </h2> So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be something like: [u'blahblah #223409823523', u'thisisinteresting #293845023984'] I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document

Extracting text fragment from a HTML body (in .NET)

阅读更多关于 Extracting text fragment from a HTML body (in .NET)

问题 I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc). An example of this content: <h1>Header 1</h1> <p>Some text here</p><p>Some more text here</p> <div align=right><a href="x">A link here</a></div><hr /> <h1>Header 2</h1> <p>Some text here</p><p>Some more text here</p> <div align=right><a href="x">A link here</a></div><hr /> The trick is, I need to

How do I extract HTML content using Regex in PHP

阅读更多关于 How do I extract HTML content using Regex in PHP

问题 I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website. In any case, I don't know much about regexes, so bear with me. I've got an HTML page in a format similar to this <html> <head>...</head> <body> <div class=nav>...</div><p id="someshit" /> <div class=body>....

RCurl getURLContent detect content type through final redirect

阅读更多关于 RCurl getURLContent detect content type through final redirect

问题 This is a followup question to RCurl getURL with loop - link to a PDF kills looping : I have the following getURL command: require(RCurl) #set a bunch of options for curl options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) agent="Firefox/23.0" curl = getCurlHandle() curlSetOpt( cookiejar = 'cookies.txt' , useragent = agent, followlocation = TRUE , autoreferer = TRUE , httpauth = 1L, # "basic" http authorization version -- this seems to make a