Python — Regex — How to find a string between two sets of strings

问题

Consider the following:

<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>

How would you go about taking out the sitemap line with regex in python?

<a href="/sitemap">Sitemap</a>

The following can be used to pull out the anchor tags.

'/<a(.*?)a>/i'

However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?

回答1:

Don't use a regex. Use BeautfulSoup, an HTML parser.

from BeautifulSoup import BeautifulSoup

html = \
"""
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>"""

soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a

# <a href="/sitemap">Sitemap</a>

回答2:

Parsing HTML with regular expression is a bad idea!

Think about the following piece of html

<a></a > <!-- legal html, but won't pass your regex -->

<a href="/sitemap">Sitemap<!-- proof that a>b iff ab>1 --></a>

There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.

You should consider using Beautiful Soup python HTML parser.

Anyhow, a ad-hoc solution using regex is

import re

data = """
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>
"""

e = re.compile('<a *[^>]*>.*</a *>')

print e.findall(data)

Output:

>>> e.findall(data)
['<a href="foo1.com">Foo1</a>', '<a href="/">Home</a>', '<a href="/extract">Extract</a>', '<a href="/sitemap">Sitemap</a>']

回答3:

In order to extract the contents of the tagline:

    <a href="/sitemap">Sitemap</a>

... I would use:

    >>> import re
    >>> s = '''
    <div id=hotlinklist>
    <a href="foo1.com">Foo1</a>
      <div id=hotlink>
        <a href="/">Home</a>
      </div>
      <div id=hotlink>
        <a href="/extract">Extract</a>
      </div>
      <div id=hotlink>
        <a href="/sitemap">Sitemap</a>
      </div>
    </div>'''
    >>> m = re.compile(r'<a href="/sitemap">(.*?)</a>').search(s)
    >>> m.group(1)
    'Sitemap'

回答4:

Use BeautifulSoup or lxml if you need to parse HTML.

Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?

If you really have to use regular expressions, have a look at findall.

来源：https://stackoverflow.com/questions/849912/python-regex-how-to-find-a-string-between-two-sets-of-strings

标签

python

regex

string