I am using Scrapy to extract some data about musical concerts from websites. At least one website I\'m working with uses (incorrectly, according to W3C - Is it valid to have par
That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p>
tag that should be contained within the <h1>
tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>
, whereas the response obtained from the site shows it as :
<h1 class="performance-title">\n</h1> <p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax </p>
As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p>
tag hence is :
response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()
This is by using the <h1 class="performance-title">
as a landmark and finding its sibling <p>
tag
//*[@id="content"]/section/article/section[2]/h1/p/text()