RegEx for extracting HTML Image properties

前端 未结 6 1447
走了就别回头了
走了就别回头了 2021-01-28 12:02

I need a RegEx pattern for extracting all the properties of an image tag.

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those p

相关标签:
6条回答
  • 2021-01-28 12:48

    As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

    It won't. Use a HTML parser if you have to parse "evil" (from an unknown source) HTML.

    0 讨论(0)
  • 2021-01-28 12:50

    Before comitting yourself to regex, see what it can do: RegEx match open tags except XHTML self-contained tags

    0 讨论(0)
  • 2021-01-28 12:53
    /<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i
    

    A match_all on this, will return (format depends on your library, but key indexes are):

    0 -> image tag
    1 -> attribute
    2 -> attribute name
    3 -> attribute value (with enclosing quotes if exists)
    4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)
    
    0 讨论(0)
  • 2021-01-28 13:01

    If performance is not a big concern I'd go with an html parser (like BeautifulSoup in python) if you are doing this server-side or jquery or just plain javascript if you are doing it client-side. Granted it is overkill but it is a lot quicker, less likely to have bugs (since they've thought of the corner cases), and it will handle the potential malformedness.

    0 讨论(0)
  • 2021-01-28 13:02

    If you want all attribute values, might I suggest using the DOM? Something like element.attributes will work well.

    If you insist on a regex //\b\w+="[^"]+"// should get everything.

    0 讨论(0)
  • 2021-01-28 13:08

    Your best bet is to use something like HTML Agility Pack instead of using regex. It's designed to handle a lot of cases and can save you more than a few headaches due to hammering out edge cases

    0 讨论(0)
提交回复
热议问题