How to get the job description using scrapy?

问题

I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.

My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.

So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.

Now, I try to get all the text from the div job_description where I actually get nothing. I used

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

It returns [u'\t\t\t\n\t\t ']

How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.

Update: This selection only returns [] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()

回答1:

You were close with

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

The div-tag actually does not have any text besides what you get.

<div class="job_description" (...)>
    "This is the text you are getting"
    <p>"This is the text you want"</p>
</div>

As you see, the text you are getting with response.xpath('//div[@class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:

response.xpath('//div[@class="job_description"]//*/text()').extract()

What this does is it selects all the child-nodes from div[@class="job_description] and returns the text (see here for what the different xpaths do).

You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.

For example the entire job description would be in

response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()

来源：https://stackoverflow.com/questions/41178659/how-to-get-the-job-description-using-scrapy

标签

python

xpath

scrapy-spider