What is a regex to find the first image in an image tag in an HTML document? My previous tries have not really worked, as they just matched based on .jpg\"
and didn
Scraping html, a simple and very loose regex would be: /\<img.*?src="(.*?)"/
Using a real DOM parser is of course the preferred method.
This is a perfect example of a task that is tricky and unreliable with regex, and almost trivially easy with an HTML parser. Use a parser for this, not regex.
You haven't said which language you're using, but I've heard some very good things about Beautiful Soup, HTML Purifier, and the HTML Agility Pack, which use Python, PHP, and .NET, respectively. Trust me--save yourself some pain and use those instead.
Edit: If you must use a regex, go with @ridgerunner's pattern.
As anubhava correctly points out, regex is not 100% reliable for parsing HTML. However, for one-shot-tasks, (i.e. not production code), a regex solution can do a pretty good job (and is quite fast as well):
Capture the image URL filename (sans query or fragment) from the first IMG element into group $1
:
<img\b[^>]+?src\s*=\s*['"]?([^\s'"?#>]+)
Note that there are certainly edge cases where this does not work.
Edit: Added ">"
to the negated SRC attribute value character class.