How do I read a secure rss feed into a SyndicationFeed without providing credentials?

前端 未结 1 1635
深忆病人
深忆病人 2021-01-06 16:30

For whatever reason, IBM uses https (without requiring credentials) for their RSS feeds. I\'m trying to consume https://www.ibm.com/developerworks/mydeveloperworks/blogs/rol

相关标签:
1条回答
  • 2021-01-06 16:43

    I don't think it has anything to do with security. A 500 error is a server-side error. Something in the request generated by XmlReader.Create(url) is confusing the ibm website. If it was simply a security issue, as suggested in your question, then you'd expect to get a 403 error, or "Authorization Denied". But you got 500, which is an application error.

    Even so, maybe there's something the client app can do, to avoid confusing the server.

    I looked at the outgoing HTTP request headers, using Fiddler. For a request generated by IE, the headers look like this:

    GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
    Accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-silverlight, application/x-shockwave-flash, application/x-silverlight-2-b2, */*
    Accept-Language: en-us
    User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; .NET CLR 3.5.30729;)
    Accept-Encoding: gzip, deflate
    Host: www.ibm.com
    Connection: Keep-Alive
    Cookie: UnicaNIODID=Ww06gyvyPpZ-WPl6K7y; conxnsCookie=en; IBMPOLLCOOKIE=""; UnicaNIODID=QridYHCNf7M-WYM8Usr
    

    For a request from XmlReader.Create(url), the headers look like this:

    GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
    Host: www.ibm.com
    Connection: Keep-Alive
    

    Quite a difference. Also, in the response to the latter, I got a Set-Cookie header, in the 500 response, which wasn't present in the response to IE.

    Based on that I theorized that it was the difference in request headers, in particular the cookie, that was confusing ibm.com.


    I don't know how to convince XmlReader.Create() to embed all the request headers I wanted, including the cookie. But I know how to do that with an HttpWebRequest. So I used that.

    There were a few hurdles I had to clear.

    1. I needed the persistent cookie for ibm.com. For that I had to resort to a p/invoke of the Win32 InternetGetCookie. See the PersistentCookies class attached in the user-contributed content at the bottom of the doc page for WebRequest, for how to do that. After attaching the cookie, I was no longer getting 500 errors. Hooray!

    2. But the resulting stream could not be read by XmlReader.Create(). It looked binary to me. I realized I needed to de-compress the gzip or deflated content. For that I had to wrap a GZipStream or DeflateStream around the received response stream, and use the decompressing stream for XmlReader. set the AutomaticDecompression property on HttpWebRequest. I could have avoided the need for this by not including "gzip, deflate" on the Accept-Encoding header in the outbound request. Actually, after setting the AutomaticDecompression property, those headers are set implicitly in the outbound HTTP Request.

    3. When I did that, I got actual text. But some of the byte codes were off. Next I needed to use the proper text encoding in the TextReader, as indicated in the HttpWebResponse.

    4. After doing that, I got a sensible string, but the resulting decompressed rss stream caused the XmlReader to choke, with

      ReadElementString method can only be called on elements with simple or empty content. Line 11, position 25.

      I looked and found a small <script> block, at that location, within the <copyright> element in the rss document. It seems IBM is trying to get the browser to "localize" the copyright date by attaching logic that would run in the browser to format the date. Seems like overkill to me, or even a bug by IBM. But because the angle bracket within the text node of an element bothered the XmlReader, I removed the script block with a Regex replace.


    After clearing those hurdles, it worked. The .NET app was able to read the RSS stream from that https url.

    I didn't do any further testing - to see if varying the Accept header or the Accept-Encoding header would change the behavior. That's for you to figure out, if you care.

    The resulting code is below. It's much uglier than your simple 3-liner. I don't know how to make it any simpler.

    public void Run()
    {
        string url;
        url = "https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en";
    
        HttpWebRequest hwr = (HttpWebRequest) WebRequest.Create(url);
        // attach persistent cookies
        hwr.CookieContainer =
            PersistentCookies.GetCookieContainerForUrl(url);
        hwr.Accept = "text/xml, */*";
        hwr.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us");
        hwr.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
        hwr.KeepAlive = true;
        hwr.AutomaticDecompression = DecompressionMethods.Deflate |
                                     DecompressionMethods.GZip;
    
        using (var resp = (HttpWebResponse) hwr.GetResponse())
        {
            using(Stream s = resp.GetResponseStream())
            {            
                string cs = String.IsNullOrEmpty(resp.CharacterSet) ? "UTF-8" : resp.CharacterSet;
                Encoding e = Encoding.GetEncoding(cs);
    
                using (StreamReader sr = new StreamReader(s, e))
                {
                    var allXml = sr.ReadToEnd();
    
                    // remove any script blocks - they confuse XmlReader
                    allXml = Regex.Replace( allXml,
                                            "(.*)<script type='text/javascript'>.+?</script>(.*)",
                                            "$1$2",
                                            RegexOptions.Singleline);
    
                    using (XmlReader xmlr = XmlReader.Create(new StringReader(allXml)))
                    {
                        var items = from item in SyndicationFeed.Load(xmlr).Items
                            select item;
                    }
                }
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题