C# scrape correct web content following jquery

问题

I've been using HtmlAgilityPack for awhile but the web resource I have been working with now has a (seems like) jQuery protocol the browser passes through. What I expect to load is a product page but what actually loads (verified by a WebBrowser control, and a WebClient DownloadString) is a redirect, asking the visitor to select a consultant and sign up with them.

In other words, using Chrome's Inspect >> Elements tool, I get:

<div data-v-1a7a6550="" class="product-extra-images">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">

But WebBrowser and HTMLAgilityPack only get:

<div class="container content">
  <div class="alert alert-danger " role="alert">
    <button type="button" class="close" data-dismiss="alert">
      <span aria-hidden="true">&times;</span>
    </button>
    <h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
    <p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
      <div class="text-center">
        <form action="/just-browsing/" method="POST" class="form-inline">
   ...

After digging into the class definitions in the head, I found the page does use jQuery to handle proper loading, and to handle actions (scrolling, resizing, hovering over images, selecting other images, etc) while the visitor browses the page. Here's from the head of the jQuery:

/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/

I tried ScrapySharp as described here: C# .NET: Scraping dynamic (JS) websites

But that just ended up consuming all available memory and never producing anything.

Also this: htmlagilitypack and dynamic content issue Loaded the incorrect redirect as noted above.

I can provide more of the source I'm trying to extract from, including the complete jQuery if needed.

回答1:

Use CaptureRedirect = false; to bypass redirection page. This worked for me with the page you mentioned:

var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);

Now keep trying till seeing the text "Product Description" on the page.

var doc = web.LoadFromBrowser(url, html =>
{
    return html.Contains("Product Description");
});

Latests versions of HtmlAgilityPack can run a browser in background. So we don't really need another library like ScrapySharp for scraping dynamic content.

来源：https://stackoverflow.com/questions/52971088/c-sharp-scrape-correct-web-content-following-jquery

标签

jquery

html-agility-pack

scrapysharp