Pulling price from amazon rss feed embedded in description

喜夏-厌秋 提交于 2020-01-06 02:37:04

问题


I am working on an RSS feed, which is pulling data from an Amazon RSS feed of books. I am using C# .NET Compact Framework 3.5. I can get the title of the book, the date published etc from the nodes in the RSS feed. However, the price of the book is embedded in a whole heap of HTML in the description node. How would I go about extracting only the price and not a load of HTML?

if (nodeChannel.ChildNodes[i].Name == "item")
{
    nodeItem = nodeChannel.ChildNodes[i];
    row = new ListViewItem();
    row.Text = nodeItem["title"].InnerText;
    row.SubItems.Add(nodeItem["description"].InnerText);
    listBooks.Items.Add(row);
}

An example of the price in the middle of the description node

<description><![CDATA[    <div class="hreview" style="clear:both;">  <div class="item">        <div style="float:left;" class="tgRssImage"><a class="url" href="https://rads.stackoverflow.com/amzn/click/com/B0013FDM7E" rel="nofollow noreferrer"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0" /></a></div>    <span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span>  </div>  <div class="description">    <br />    <span style="display: block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a href="https://rads.stackoverflow.com/amzn/click/com/B0013FDM7E" rel="nofollow noreferrer">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">285 used and new</a> from <span class="tgProductPrice">$1.00</span></span><br /></span>    <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0" /><br /></span>    <br />    <span class="tgRssProductTag"></span>    <span class="tgRssAllTags">Customer tags: <a href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(92), <a href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(79), <a href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(51), <a href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(43), <a href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(34), <a href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(14), <a href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(6), <a href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a href="http://www.amazon.com/tag/mutants/ref=tag_rss_rs_itdp_item_at">mutants</a>(4)<br /></span>  </div></div>]]></description>

$5.49 is in that mess somewhere


回答1:


It could be a dumb idea but how about doing a string search after class="tgProductPrice">? then extract the followign char until you hit the end tag </span>.

You don't need to load any html, you alraedy have it in the Description.

Will that work for you?




回答2:


That description looks really bad and if you don't have any possibility of getting a different version of that RSS feed, I think the only solution is to parse the HTML that you have in the description.

For that, you could use the HTML Agility Pack (haven't used it, but it's the recommended solution for HTML parsing from .NET) or use a regular expression or text search to find that tag and extract the price (this feels a bit hacky to me, and could lead to the need to make many changes if the RSS changes)

Edit: I've done the string search combined with regex a while back and it was a nightmare to maintain, but considering your case and that it's for only one value, it might be ok.




回答3:


using CsQuery; //get CsQuery from nuget packages
path = textBox1.Text;
        var dom = CQ.CreateFromUrl(path);
        var divContent = dom.Select("#priceblock_ourprice").Text();
        //priceblock_ourprice is an id of span where price is written
        label1.Text = divContent.ToString();


来源:https://stackoverflow.com/questions/4285430/pulling-price-from-amazon-rss-feed-embedded-in-description

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!