问题
We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current website for the product descriptions.
Here is one of the pages: http://www.cabinplace.com/accrugsbathblackbear.htm
What is the best was to get the description into a string? Should I use html agility pack? and if so how would this be done? as I am new to html agility pack and xhtml in general.
Thanks
回答1:
The HTML Agility Pack is a good library to use for this kind of work.
You did not indicate if all of the content is structured this way nor if you have already gotten the kind of fragment you posted from the HTML files, so it is difficult to advise further.
In general, if all pages are structured similarly, I would use an XPath expression to extract the paragraph and pick the innerHtml
or innerText
from each page.
Something like the following:
var description = htmlDoc.SelectNodes("p[@class='content_txt']")[0].innerText;
回答2:
Also,
If you need a good tool for testing or finding the Xpath for the HAP you can use this one: HTML-Agility-xpath-finder. It is made using the same library so if you find a xpath in this tool you be securely able to use in your code.
来源:https://stackoverflow.com/questions/6143621/c-sharp-html-agility-pack