RegEx for extracting HTML Image properties

前端未结

关注

 6  1447

走了就别回头了

I need a RegEx pattern for extracting all the properties of an image tag.

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those p

相关标签:

6条回答

攒了一身酷

2021-01-28 12:48

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

It won't. Use a HTML parser if you have to parse "evil" (from an unknown source) HTML.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2021-01-28 12:50

Before comitting yourself to regex, see what it can do: RegEx match open tags except XHTML self-contained tags

0 讨论(0)
发布评论:

提交评论
- 加载中...

你的背包

2021-01-28 12:53

/<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i

A match_all on this, will return (format depends on your library, but key indexes are):

0 -> image tag
1 -> attribute
2 -> attribute name
3 -> attribute value (with enclosing quotes if exists)
4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)

0 讨论(0)

渐次进展

2021-01-28 13:01

If performance is not a big concern I'd go with an html parser (like BeautifulSoup in python) if you are doing this server-side or jquery or just plain javascript if you are doing it client-side. Granted it is overkill but it is a lot quicker, less likely to have bugs (since they've thought of the corner cases), and it will handle the potential malformedness.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2021-01-28 13:02

If you want all attribute values, might I suggest using the DOM? Something like element.attributes will work well.

If you insist on a regex //\b\w+="[^"]+"// should get everything.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2021-01-28 13:08

Your best bet is to use something like HTML Agility Pack instead of using regex. It's designed to handle a lot of cases and can save you more than a few headaches due to hammering out edge cases

0 讨论(0)
发布评论:

提交评论
- 加载中...