I\'m brand new to Scrapy. I have learned how to use response.css()
for reading specific aspects from a web page, and am avoiding learning the xpath system. It seems
You can try to extract text with this expression:
>>> txt = """<p>My sentence has a <a href="https://www.google.com">link to google</a> in it.</p>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.css('p ::text').extract()
[u'My sentence has a ', u'link to google', u' in it.']
>>> ' '.join(sel.css('p ::text').extract())
u'My sentence has a link to google in it.'
Or, for example, use w3lib.html library to clean html tags from your response. In this way:
from w3lib.html import remove_tags
with_tags = response.css("p").get()
clean_text = remove_tags(with_tags)
But first variant looks shorter and more readable.
Use html-text after extracting the whole paragraph:
from html_text import extract_text
for paragraph in response.css('p'):
html = paragraph.get()
text = extract_text(html)