Replacing a custom “HTML” tag in a Python string

问题

I want to be able to include a custom "HTML" tag in a string, such as: "This is a <photo id="4" /> string".

In this case the custom tag is <photo id="4" />. I would also be fine changing this custom tag to be written differently if it makes it easier, ie [photo id:4] or something.

I want to be able to pass this string to a function that will extract the tag <photo id="4" />, and allow me to transform this to some more complicated template like <div class="photo"><img src="...." alt="..."></div>, which I can then use to replace the tag in the original string.

I'm imaging it work something like this:

>>> content = "This is a <photo id="4" /> string"
# Pass the string to a function that returns all the tags with the given name.
>>> tags = parse_tags('photo', string)
>>> print(tags)
[{'tag': 'photo', 'id': 4, 'raw': '<photo id="4" />'}]
# Now that I know I need to render a photo with ID 4, so I can pass that to some sort of template thing
>>> rendered = render_photo(id=tags[0]['id'])
>>> print(rendered)
<div class="photo"><img src="...." alt="..."></div>
>>> content = content.replace(tags[0]['raw'], rendered)
>>> print(content)
This is a <div class="photo"><img src="...." alt="..."></div> string

I think this is a fairly common pattern, for something like putting a photo in a blog post, so I'm wondering if there is a library out there that will do something similar to the example parse_tags function above. Or do I need to write it?

This example of the photo tag is just a single example. I would want to have tags with different names. As a different example, maybe I have a database of people and I want a tag like <person name="John Doe" />. In that case the output I want is something like {'tag': 'person', 'name': 'John Doe', 'raw': '<person name="John Doe" />'}. I can then use the name to look that person up and return a rendered template of the person's vcard or something.

回答1:

If you're working with HTML5, I would suggest looking into the xml module (etree). It will allow you to parse the whole document into a tree structure and manipulate tags individually (and then turn the resut bask into an html document).

You could also use regular expressions to perform text substitutions. This would likely be faster than loading a xml tree structure if you don't have too many changes to make.

    import re
    text = """<html><body>some text <photo> and tags <photo id="4"> more text <person name="John Doe"> yet more text"""
    tags = ["photo","person","abc"]
    patterns = "|".join([ f"(<{tag} .*?>)|(<{tag}>)" for tag in tags ])
    matches = list(re.finditer(patterns,text))
    for match in reversed(matches):
        tag = text[match.start():match.end()]
        print(match.start(),match.end(),tag)
        # substitute what you need for that tag
        text = text[:match.start()] + "***" + text[match.end():]
    print(text)

This will be printed:

    64 88 <person name="John Doe">
    39 53 <photo id="4">
    22 29 <photo>
    <html><body>some text *** and tags *** more text *** yet more text

Performing the replacements in reverse order ensures that the ranges found by finditer() remain valid as the text changes with the substitutions.

回答2:

For this kind of "surgical" parsing (where you want to isolate specific tags instead of creating a full hierarchical document), pyparsing's makeHTMLTags method can be very useful.

See the annotated script below, showing the creation of the parser, and using it for parseTag and replaceTag methods:

import pyparsing as pp

def make_tag_parser(tag):
    # makeHTMLTags returns 2 parsers, one for the opening tag and one for the
    # closing tag - we only need the opening tag; the parser will return parsed
    # fields of the tag itself
    tag_parser = pp.makeHTMLTags(tag)[0]

    # instead of returning parsed bits of the tag, use originalTextFor to
    # return the raw tag as token[0] (specifying asString=False will retain
    # the parsed attributes and tag name as attributes)
    parser = pp.originalTextFor(tag_parser, asString=False)

    # add one more callback to define the 'raw' attribute, copied from t[0]
    def add_raw_attr(t):
        t['raw'] = t[0]
    parser.addParseAction(add_raw_attr)

    return parser

# parseTag to find all the matches and report their attributes
def parseTag(tag, s):
    return make_tag_parser(tag).searchString(s)


content = """This is a <photo id="4" /> string"""

tag_matches = parseTag("photo", content)
for match in tag_matches:
    print(match.dump())
    print("raw: {!r}".format(match.raw))
    print("tag: {!r}".format(match.tag))
    print("id:  {!r}".format(match.id))


# transform tag to perform tag->div transforms
def replaceTag(tag, transform, s):
    parser = make_tag_parser(tag)

    # add one more parse action to do transform
    parser.addParseAction(lambda t: transform.format(**t))
    return parser.transformString(s)

print(replaceTag("photo", 
                   '<div class="{tag}"><img src="<src_path>/img_{id}.jpg." alt="{tag}_{id}"></div>', 
                   content))

Prints:

['<photo id="4" />']
- empty: True
- id: '4'
- raw: '<photo id="4" />'
- startPhoto: ['photo', ['id', '4'], True]
  [0]:
    photo
  [1]:
    ['id', '4']
  [2]:
    True
- tag: 'photo'
raw: '<photo id="4" />'
tag: 'photo'
id:  '4'
This is a <div class="photo"><img src="<src_path>/img_4.jpg." alt="photo_4"></div> string

来源：https://stackoverflow.com/questions/54992490/replacing-a-custom-html-tag-in-a-python-string

标签

python

html

parsing