问题
I'm new to scrapy
and XPath
but programming in Python for sometime. I would like to get the email
, name of the person making the offer
and phone
number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/
using scrapy. As you see, the email and phone is provided as text inside the <p>
tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview
or at least all the text talking about this respective job and use ReGex
to get the email
, phone number
and if possible the name of the person
.
So, I fired up the scrapy shell
using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/
and get the response
from there.
Now, I try to get all the text from the div job_description
where I actually get nothing. I used
full_des = response.xpath('//div[@class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns []
response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()
回答1:
You were close with
full_des = response.xpath('//div[@class="job_description"]/text()').extract()
The div
-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[@class="job_description"]/text()').extract()
is the text that is in between the div
-tag, not in between the tags inside the div
-tag. For this you would need:
response.xpath('//div[@class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[@class="job_description]
and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n
and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()
来源:https://stackoverflow.com/questions/41178659/how-to-get-the-job-description-using-scrapy