This is the html code from which i want to extract data. But whenever i run i am getting some random values. Please can anyone help me out with this.
I want to extract t
Even though you haven't provided much detail that pertains to the issue at hand such as the output that you're getting or the website and question, I'm willing to put money that the problem is the Rangers that you put in your itemization...
Is the return values that you say are random/incorrect come from when running spider... as in your crawling the sites entire directory of different ... institutes? meaning that the html snippet is just one of the many other pages your scrapping?? If so...
Then your issue is for sure a range issue... your using ranges to select a tag from others in the same node... but what happens if the pages in the site are not all the same? Range remains the same but the position of you content does not so your going to get either none values.
In the for loop..
def parse(self, response):
for students in response.css('div.topBlockInstituteInfoBottom'):
The css value that your using I cannot verify if its correct due to the fact I dont know the site in question BUT taking the HTML snippet you showed... the css value in your loop is wrong...
//html/body/..../a[contains(., "next")]/@href
you can just go straight to using
//a[contains(., "next")]/@href
But the more specific you can get to stating thepath to your content, or the node, the less likely you are to run into any confusion.. rather, yyour parse.
In your case.... Do this. exactly, though Im not going to take all thje fun away thatisis to learn something new lol... heres what oneshouldlooklike..
response.xpath("//ul@class='clg-info'/li[contains(.,'Ownership')]/span/text().extract()
You dont need tobe in scrapy shell to check out what it outputs, if you use any browser dev tool and then inside do a ctrl+f or a search... it should allow for xpath but yeah...the out put is "Private".... because I stated that basically that in the path/node-level of 'clg-info' that I am looking for the upcoming li that contains theplain text word Ownership... doenst have to be the full word either but yea... then had to maneuver on span over and there... you just have to look at thehtml while doing it and it obvious.
... PRO TIP... that example I gave earlier about finding a link ... /a tag, that contains the wordnext... couldyou think of how that can be usefull?? =) Navihating through webpages can be such a pain but know your xpath and regex and there no content you cant parse... shoot... once you getgood, you can start really understanding how to de-obfuscate js in web pages.. BY HAND ... one of that sissy jnice studfflol
讨论(0)