问题
I have a site that requires login before it lets you download files. Currently I am using the BrowserSession Class to login and do all the scraping required (at least for the most part).
BrowserSession Class source at bottom of post:
The download Links show up on the document nodes. But I don't know how to add download functionality to that class, and If I try to download them with a webclient it fails, I already had to heavily modify the BrowserSession class, (I should have Modified it as a Partial but didn't) So I don't really want to change from using the BrowserSession Class.
I believe its using htmlAgilityPack.HtmlWeb to download and load the webpages.
If there is no easy way to modify the BrowserSession, Is there someway to use it's CookieCollection With Webclient?
PS: I Need to be logged in to download the file, Otherwise the link redirects to the login screen. Which is why I am unable to simply use WebClient, and either need to modify the BrowserSession class to be able to download, or modify WebClient to use cookies before getting a page.
I will admit I do not understand cookies very well (I am not sure if they are used every time GET is used, or if its just on POST), but so far BrowserSession has taken care of all that.
PPS:The BrowserSession I Posted Is not the one that I added stuff too, however the core functions are all the same.
public class BrowserSession
{
private bool _isPost;
private HtmlDocument _htmlDoc;
/// <summary>
/// System.Net.CookieCollection. Provides a collection container for instances of Cookie class
/// </summary>
public CookieCollection Cookies { get; set; }
/// <summary>
/// Provide a key-value-pair collection of form elements
/// </summary>
public FormElementCollection FormElements { get; set; }
/// <summary>
/// Makes a HTTP GET request to the given URL
/// </summary>
public string Get(string url)
{
_isPost = false;
CreateWebRequestObject().Load(url);
return _htmlDoc.DocumentNode.InnerHtml;
}
/// <summary>
/// Makes a HTTP POST request to the given URL
/// </summary>
public string Post(string url)
{
_isPost = true;
CreateWebRequestObject().Load(url, "POST");
return _htmlDoc.DocumentNode.InnerHtml;
}
/// <summary>
/// Creates the HtmlWeb object and initializes all event handlers.
/// </summary>
private HtmlWeb CreateWebRequestObject()
{
HtmlWeb web = new HtmlWeb();
web.UseCookies = true;
web.PreRequest = new HtmlWeb.PreRequestHandler(OnPreRequest);
web.PostResponse = new HtmlWeb.PostResponseHandler(OnAfterResponse);
web.PreHandleDocument = new HtmlWeb.PreHandleDocumentHandler(OnPreHandleDocument);
return web;
}
/// <summary>
/// Event handler for HtmlWeb.PreRequestHandler. Occurs before an HTTP request is executed.
/// </summary>
protected bool OnPreRequest(HttpWebRequest request)
{
AddCookiesTo(request); // Add cookies that were saved from previous requests
if (_isPost) AddPostDataTo(request); // We only need to add post data on a POST request
return true;
}
/// <summary>
/// Event handler for HtmlWeb.PostResponseHandler. Occurs after a HTTP response is received
/// </summary>
protected void OnAfterResponse(HttpWebRequest request, HttpWebResponse response)
{
SaveCookiesFrom(response); // Save cookies for subsequent requests
}
/// <summary>
/// Event handler for HtmlWeb.PreHandleDocumentHandler. Occurs before a HTML document is handled
/// </summary>
protected void OnPreHandleDocument(HtmlDocument document)
{
SaveHtmlDocument(document);
}
/// <summary>
/// Assembles the Post data and attaches to the request object
/// </summary>
private void AddPostDataTo(HttpWebRequest request)
{
string payload = FormElements.AssemblePostPayload();
byte[] buff = Encoding.UTF8.GetBytes(payload.ToCharArray());
request.ContentLength = buff.Length;
request.ContentType = "application/x-www-form-urlencoded";
System.IO.Stream reqStream = request.GetRequestStream();
reqStream.Write(buff, 0, buff.Length);
}
/// <summary>
/// Add cookies to the request object
/// </summary>
private void AddCookiesTo(HttpWebRequest request)
{
if (Cookies != null && Cookies.Count > 0)
{
request.CookieContainer.Add(Cookies);
}
}
/// <summary>
/// Saves cookies from the response object to the local CookieCollection object
/// </summary>
private void SaveCookiesFrom(HttpWebResponse response)
{
if (response.Cookies.Count > 0)
{
if (Cookies == null) Cookies = new CookieCollection();
Cookies.Add(response.Cookies);
}
}
/// <summary>
/// Saves the form elements collection by parsing the HTML document
/// </summary>
private void SaveHtmlDocument(HtmlDocument document)
{
_htmlDoc = document;
FormElements = new FormElementCollection(_htmlDoc);
}
}
FormElementCollection Class:
/// <summary>
/// Represents a combined list and collection of Form Elements.
/// </summary>
public class FormElementCollection : Dictionary<string, string>
{
/// <summary>
/// Constructor. Parses the HtmlDocument to get all form input elements.
/// </summary>
public FormElementCollection(HtmlDocument htmlDoc)
{
var inputs = htmlDoc.DocumentNode.Descendants("input");
foreach (var element in inputs)
{
string name = element.GetAttributeValue("name", "undefined");
string value = element.GetAttributeValue("value", "");
if (!name.Equals("undefined")) Add(name, value);
}
}
/// <summary>
/// Assembles all form elements and values to POST. Also html encodes the values.
/// </summary>
public string AssemblePostPayload()
{
StringBuilder sb = new StringBuilder();
foreach (var element in this)
{
string value = System.Web.HttpUtility.UrlEncode(element.Value);
sb.Append("&" + element.Key + "=" + value);
}
return sb.ToString().Substring(1);
}
}
回答1:
It is not easy to log in and download the WebPages. I have recently have had the same issue. If you find a solution aisde from this one, please provide it.
Now what I did was I used Selenium with PhantomJS. With Selenium I can interact with the webbrowser of my choice.
Also the Browser class does not use Html Agility Pack, which is a third party library available through nuget.
I want to refer you to this question, where I have created an entire example of how to use Selenium and how to download the HtmlDocument and filter out the necessary information using xpath.
回答2:
I managed to get it working, using BrowserSession, and a modified webClient:
First off Change the _htmlDoc to Public to access the document Nodes:
public class BrowserSession
{
private bool _isPost;
public string previous_Response { get; private set; }
public HtmlDocument _htmlDoc { get; private set; }
}
Secondly Add this method to BrowserSession:
public void DownloadCookieProtectedFile(string url, string Filename)
{
using (CookieAwareWebClient wc = new CookieAwareWebClient())
{
wc.Cookies = Cookies;
wc.DownloadFile(url, Filename);
}
}
//rest of BrowserSession
Third Add this Class Somewhere, Which allows passing the cookies from BrowserSession to the WebClient.
public class CookieAwareWebClient : WebClient
{
public CookieCollection Cookies = new CookieCollection();
private void AddCookiesTo(HttpWebRequest request)
{
if (Cookies != null && Cookies.Count > 0)
{
request.CookieContainer.Add(Cookies);
}
}
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
HttpWebRequest webRequest = request as HttpWebRequest;
if (webRequest != null)
{
if (webRequest.CookieContainer == null) webRequest.CookieContainer = new CookieContainer();
AddCookiesTo(webRequest);
}
return request;
}
}
This should Give you the ability to use BrowserSession Like you normally would, And when you need to get a file that you can only access If your logged in, Simply Call BrowserSession.DownloadCookieProtectedFile() As if it were a WebClient, Only Set the Cookies like so:
Using(wc = new CookieAwareWebClient())
{
wc.Cookies = BrowserSession.Cookies
//Download with WebClient As normal
wc.DownloadFile();
}
来源:https://stackoverflow.com/questions/33679834/is-there-anyway-to-use-browsersession-to-download-files-c-sharp