Let say I have this string which contains html a tag:
Berlin-Treptow-Köpenick
<
I have made the assumption that the string to be extracted is comprised of alphanumeric characters--including accented letters--and hyphens, and that the string immediately follows the first instance of the character '>'
.
string =
'<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'
r = /
(?<=\>) # match '>' in a positive lookbehind
[\p{Alnum}-]+ # match >= 0 alphameric character and hyphens
/x # extended or free-spacing mode
string[r] #=> "Berlin-Treptow-Köpenick"
Note that /A-Za-z0-9/
does not match accented characters such as 'ö'
.
Alternatively, one can use the POSIX syntax:
r = /(?<=\>)[[[:alnum:]]-]+/
You could use:
html = '<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'
html.match(/>(.*)</)[1]
#=> "Berlin-Treptow-Köpenick"
When your html partial get more complex then I would recommend looking libraries like nokogiri.
string = '<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'
string.scan(/<[a][^>]*>(.+?)<\/[a]>/).flatten