问题
I have a question that seems to have been asked before, but is a bit different. I'm trying to scrape data from this website but the problem is that is seems like it's loaded with AJAX. Because of that my application is unable to find the id's and classes in the HTML that I'm looking for.
You can reproduce this by inspecting an element or viewing the source. Whilst viewing the source I'm seeing a lot less than whilst inspecting an element.
I thought that I could track down the file that contains the AJAX to load this html by pressing F12, going to the network tab and selecting XHR, but I'm unable to find it.
My question is: how do I retrieve this data or find out what file is used to collect the data?
An example of my code (I'm unable to find the Timetable_toolbar_elementSelect_popup0
):
private async Task GetHtmlDocument(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
//request.Credentials = new LoginCredentials().Credentials;
try
{
WebResponse myResponse = await request.GetResponseAsync();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(myResponse.GetResponseStream());
var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0");
}
catch (Exception e)
{
}
}
回答1:
Solution where you call the ajax method using a webrequest.
So I got bored and figured most of it out. What is missing below is how to identify the Klase by id. The below example will fetch the klase '1GLD'. The reason why we need cookies is in order for the request to know which school we are fetching the Klase from. Also the below code only returns JSON - and not HTML since it is an ajax method we call.
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
//we are now ready to call the ajax method and get the JSON.
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2";
//The command below will return a JSON datastructure containing all the klases and their relevant ID.
//string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2"
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var responseText = streamReader.ReadToEnd();
//THE RESULTS GETS PRINTED HERE.
Console.Write(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Other solution with Selenium with Firefox driver.
This is way easier to do. but it also takes some time. Not all the thread sleeps are necessary. This will give an HTML to work with isntead just like you requested. But I found it necessary in the last foreach loop.
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var ffOptions = new FirefoxOptions();
ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
var service = FirefoxDriverService.CreateDefaultService();
var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));
driver.Navigate().GoToUrl(webAddr);
driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click();
driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click();
Thread.Sleep(2000);
driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click();
//we get all the options for Klase
doc.LoadHtml(driver.PageSource);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]");
List<String> options = new List<String>();
foreach (HtmlNode n in nodes)
{
options.Add(n.InnerText);
}
foreach(string s in options)
{
driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear();
driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter);
Thread.Sleep(2000);
doc.LoadHtml(driver.PageSource);
//Console.WriteLine(driver.Url); //Now we can see the id of the current Klase
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
Last update
Using the Selenium solution I was able to get the ID's for all courses. I have included the file here so you can use it with your ajax and web requests.
回答2:
I was going to leave this as a comment. But it got too big and too badly formatted. So here we go.
Firstly. The site is updated dynamically using javascript that is called with an ajaxcommand.
If you can open up a session and store the cookie containing the SESSIONID and the now "encrypted" schoolname then you can call the ajax commands as such.
https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2
This does however require you to know what elementType is and what elementId is.
In this case elementId refers to Klas when it is equal to 1GLD. And formatID(7) refers Roosterformaat when it is equal to "Beknopt". You have to figure out what the remaining variables does. Even more important is that if you succeed in being able to make valid ajax commands to the server then you wont get html back as a response you will receive the data in JSON.
The easiest way to do what you want is to have all the classes in a separate file. And use that as reference point. Same goes for the other options.
And then use a headless browser like phantomjs.org with Selenium. This way you can find and click on the classes you want to scrape. Load the html into a HtmlAgilityPack.HtmlDocument and then do what you need to do. Selenium/PhantomJS till keep track of your cookies. This method is slower - but a lot easier to do.
EDIT Storing cookies from a webrequest - the easy way.
I am not keen on this subject. But OP asked. If anybody has a better way of doing it please edit.
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2";
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
//cookies.Add(httpResponse.Cookies);
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
foreach(Cookie c in httpResponse.Cookies)
{
Console.WriteLine(c.ToString());
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
来源:https://stackoverflow.com/questions/47491022/await-ajax-with-htmlagilitypack-in-xamarin