Extracting p within h1 with Python/Scrapy

前端 未结 2 1878
夕颜
夕颜 2021-01-28 18:02

I am using Scrapy to extract some data about musical concerts from websites. At least one website I\'m working with uses (incorrectly, according to W3C - Is it valid to have par

相关标签:
2条回答
  • 2021-01-28 18:15

    That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p> tag that should be contained within the <h1> tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>, whereas the response obtained from the site shows it as :

    <h1 class="performance-title">\n</h1>
    <p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
    </p>
    

    As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p> tag hence is :

    response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()
    

    This is by using the <h1 class="performance-title"> as a landmark and finding its sibling <p> tag

    0 讨论(0)
  • 2021-01-28 18:23
    //*[@id="content"]/section/article/section[2]/h1/p/text()
    
    0 讨论(0)
提交回复
热议问题