How to get the job description using scrapy?

妖精的绣舞 提交于 2019-12-08 04:23:41

问题


I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.

My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.

So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.

Now, I try to get all the text from the div job_description where I actually get nothing. I used

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

It returns [u'\t\t\t\n\t\t ']

How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.

Update: This selection only returns [] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()


回答1:


You were close with

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

The div-tag actually does not have any text besides what you get.

<div class="job_description" (...)>
    "This is the text you are getting"
    <p>"This is the text you want"</p>
</div>

As you see, the text you are getting with response.xpath('//div[@class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:

response.xpath('//div[@class="job_description"]//*/text()').extract()

What this does is it selects all the child-nodes from div[@class="job_description] and returns the text (see here for what the different xpaths do).

You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.

For example the entire job description would be in

response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()


来源:https://stackoverflow.com/questions/41178659/how-to-get-the-job-description-using-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!