Does anyone know of a regular expression I could use to find URLs within a string? I\'ve found a lot of regular expressions on Google for determining if an entire string is
I use the logic of finding text between two dots or periods
the regex below works fine with python
(?<=\.)[^}]*(?=\.)
None of the solutions provided here solved the problems/use-cases I had.
What I have provided here, is the best I have found/made so far. I will update it when I find new edge-cases that it doesn't handle.
\b
#Word cannot begin with special characters
(?<![@.,%&#-])
#Protocols are optional, but take them with us if they are present
(?<protocol>\w{2,10}:\/\/)?
#Domains have to be of a length of 1 chars or greater
((?:\w|\&\#\d{1,5};)[.-]?)+
#The domain ending has to be between 2 to 15 characters
(\.([a-z]{2,15})
#If no domain ending we want a port, only if a protocol is specified
|(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])
If you have the url pattern, you should be able to search for it in your string. Just make sure that the pattern doesnt have ^
and $
marking beginning and end of the url string. So if P is the pattern for URL, look for matches for P.
(?:vnc|s3|ssh|scp|sftp|ftp|http|https)\:\/\/[\w\.]+(?:\:?\d{0,5})|(?:mailto|)\:[\w\.]+\@[\w\.]+
If you want an explanation of each part, try in regexr[.]com where you will get a great explanation of every character.
This is split by an "|" or "OR" because not all useable URI have "//" so this is where you can create a list of schemes as or conditions that you would be interested in matching.
It is just simple.
Use this pattern: \b((ftp|https?)://)?([\w-\.]+\.(com|net|org|gov|mil|int|edu|info|me)|(\d+\.\d+\.\d+\.\d+))(:\d+)?(\/[\w-\/]*(\?\w*(=\w+)*[&\w-=]*)*(#[\w-]+)*)?
It matches any link contains:
Allowed Protocols: http, https and ftp
Allowed Domains: *.com, *.net, *.org, *.gov, *.mil, *.int, *.edu, *.info and *.me OR IP
Allowed Ports: true
Allowed Parameters: true
Allowed Hashes: true
Here a little bit more optimized regexp:
(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&:\/~\+#]*[A-Z\-\@?^=%&\/~\+#]){2,6}?
Here is test with data: https://regex101.com/r/sFzzpY/6