问题
I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.
The search result for example can be found here: Search Result
When I look at the HTML source for the result I can see the following:
<HR><CENTER><H3>License Information *</H3></CENTER><HR>
<P>
<CENTER> 06/03/2014 </CENTER> <BR>
<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> Not applicable in this profession <BR>
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>
How can I use the HTMLAgilityPack to scrap those data from the site?
I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:
private void btnCrawl_Click(object sender, EventArgs e)
{
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();
if ( filename.Equals( "iexplore" ) )
txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
}
string url = ie.LocationURL.ToString();
string xmlns = "{http://www.w3.org/1999/xhtml}";
Crawler cl = new Crawler(url);
XDocument xdoc = cl.GetXDocument();
var res = from item in xdoc.Descendants(xmlns + "div")
where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
&& item.Element(xmlns + "a") != null
//select item;
select new
{
Link = item.Element(xmlns + "a").Attribute("href").Value,
Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
Desc = item.Elements(xmlns + "p").ElementAt(1).Value
};
foreach (var node in res)
{
MessageBox.Show(node.ToString());
tb.Text = node + "\n";
}
//Console.ReadKey();
}
The Crawler helper class:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace CrawlerWeb
{
public class Crawler
{
public string Url
{
get;
set;
}
public Crawler() { }
public Crawler(string Url)
{
this.Url = Url;
}
public XDocument GetXDocument()
{
HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
doc2.OptionOutputAsXml = true;
doc2.OptionAutoCloseOnEnd = true;
doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
return xdoc;
}
}
}
tb
is a multiline textbox... So I would like it to display the following:
Name
WILLIAMS AJAYA L
Address
NEW YORK NY
Profession
ATHLETIC TRAINER
License No
001475
Date of Licensure
1/12/07
Additional Qualification
Not applicable in this profession
Status
REGISTERED
Registered through last day of
08/15
I would like the second argument to be added to an array because next step would be to write to a SQL database...
I am able to get the URL from the IE which has the search result but how can I code it in my script?
回答1:
This little snippet should get you started:
HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
You basically use the WebClient
class to download the HTML file and then you load that HTML into the HtmlDocument
object. Then you need to use XPath to query the DOM tree and search for nodes. In the above example "nodes" will include all the div
elements in the document.
Here's a quick reference about the XPath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx
来源:https://stackoverflow.com/questions/24018750/how-to-use-htmlagilitypack-to-extract-html-data