I have a HTML line as follows:
Is this model too thin for Yves Saint Laurent?
I would lik
Instead of using regular expressions, you should use some html parser like BeautifulSoup. You can also use etree library with xpath for complicated use cases.
Still, if you want to use regex -
Regular Expression is a Domain-Specific Language that makes string parsing and processing a lot more easier. Although, some people may disagree regular expressions provide much elegant solutions to problem, that looping over string could ever be.-
import re
html_string = '<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>'
regex = re.compile(r'(?<=>).*(?=<)')
result = regex.findall(html_string)[0]
In this regex, I am using look-ahead and look-behind of regular expressions. As far as learning regular expressions is concerned, it takes rather considerable amount of time. I recommend going through some good tutorial or some book on regex.
If your element contains only text, use the .string attribute:
headline = soup.find(class_='cd__headline-text')
print(headline.string)
If there are other tags contained, you can either get all the text contained in the current element and further, or only get specific text from the current element.
The element.get_text() function will recurse and gather all strings in element and child elements, concatenating them with your string of choice (defaulting to the empty string) and with or without whitespace stripping.
To get only specific strings, you can either iterate over the .strings or .stripped_strings generators, or use the element contents to access all contained elements, then pick out instances of the NavigableString
type.
Demo with your sample:
>>> from bs4 import BeautifulSoup
>>> markup = '<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>'
>>> soup = BeautifulSoup(markup)
>>> headline = soup.find(class_='cd__headline-text')
>>> print headline.string
Is this model too thin for Yves Saint Laurent?
>>> print list(headline.strings)
[u'Is this model too thin for Yves Saint Laurent? ']
>>> print list(headline.stripped_strings)
[u'Is this model too thin for Yves Saint Laurent?']
>>> print headline.get_text()
Is this model too thin for Yves Saint Laurent?
>>> print headline.get_text(strip=True)
Is this model too thin for Yves Saint Laurent?
and with an additional element added:
>>> markup = '<span class="cd__headline-text">Is this model <em>too thin</em> for Yves Saint Laurent? </span>'
>>> soup = BeautifulSoup(markup)
>>> headline = soup.find(class_='cd__headline-text')
>>> headline.string is None
True
>>> print list(headline.strings)
[u'Is this model ', u'too thin', u' for Yves Saint Laurent? ']
>>> print list(headline.stripped_strings)
[u'Is this model', u'too thin', u'for Yves Saint Laurent?']
>>> print headline.get_text()
Is this model too thin for Yves Saint Laurent?
>>> print headline.get_text(' - ', strip=True)
Is this model - too thin - for Yves Saint Laurent?
>>> headline.contents
[u'Is this model ', <em>too thin</em>, u' for Yves Saint Laurent? ']
>>> from bs4 import NavigableString
>>> [el for el in headline.children if isinstance(el, NavigableString)]
[u'Is this model ', u' for Yves Saint Laurent? ']