My script works fine doing this:
images = re.findall(\"src.\\\"(\\S*?media.tumblr\\S*?tumblr_\\S*?jpg)\", doc)
videos = re.findall(\"\\S*?(http\\S*?video_file\\S
If you really want efficient...
For starters, I would cut out the \S*?
in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src
and media
to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE
option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*
As mentioned in the comments, a pipe (|)
should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester