Regular expression to find URLs within a string

前端 未结 27 1792
被撕碎了的回忆
被撕碎了的回忆 2020-11-22 14:18

Does anyone know of a regular expression I could use to find URLs within a string? I\'ve found a lot of regular expressions on Google for determining if an entire string is

相关标签:
27条回答
  • 2020-11-22 14:36

    I found this which covers most sample links, including subdirectory parts.

    Regex is:

    (?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))?
    
    0 讨论(0)
  • 2020-11-22 14:36

    I use this Regex:

    /((\w+:\/\/\S+)|(\w+[\.:]\w+\S+))[^\s,\.]/ig
    

    It works fine for many URLs, like: http://google.com, https://dev-site.io:8080/home?val=1&count=100, www.regexr.com, localhost:8080/path, ...

    0 讨论(0)
  • 2020-11-22 14:38

    I have utilize c# Uri class and it works, well with IP Address, localhost

     public static bool CheckURLIsValid(string url)
        {
            Uri returnURL;
    
           return (Uri.TryCreate(url, UriKind.Absolute, out returnURL)
               && (returnURL.Scheme == Uri.UriSchemeHttp || returnURL.Scheme == Uri.UriSchemeHttps));
    
    
        }
    
    0 讨论(0)
  • 2020-11-22 14:38

    I liked Stefan Henze 's solution but it would pick up 34.56. Its too general and I have unparsed html. There are 4 anchors for a url;

    www ,

    http:\ (and co) ,

    . followed by letters and then / ,

    or letters . and one of these: https://ftp.isc.org/www/survey/reports/current/bynum.txt .

    I used lots of info from this thread. Thank you all.

    "(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
    

    Above solves just about everything except a string like "eurls:www.google.com,facebook.com,http://test.com/", which it returns as a single string. Tbh idk why I added gopher etc. Proof R code

    if(T){
      wierdurl<-vector()
      wierdurl[1]<-"https://JP納豆.例.jp/dir1/納豆 "
      wierdurl[2]<-"xn--jp-cd2fp15c.xn--fsq.jp "
      wierdurl[3]<-"http://52.221.161.242/2018/11/23/biofourmis-collab"
      wierdurl[4]<-"https://12000.org/ "
      wierdurl[5]<-"  https://vg-1.com/?page_id=1002 "
      wierdurl[6]<-"https://3dnews.ru/822878"
      wierdurl[7]<-"The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
      Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
      The code below catches all urls in text and returns urls in list. "
      wierdurl[8]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
      Alsotherearesomeurls:www.google.com,facebook.com,http://test.com/method?param=wasd
      Thecodebelowcatchesallurlsintextandreturnsurlsinlist. "
      wierdurl[9]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-stringAlsotherearesomeurlsZwww.google.com,facebook.com,http://test.com/method?param=wasdThecodebelowcatchesallurlsintextandreturnsurlsinlist."
      wierdurl[10]<-"1facebook.com/1res"
      wierdurl[11]<-"1facebook.com/1res/wat.txt"
      wierdurl[12]<-"www.e "
      wierdurl[13]<-"is this the file.txt i need"
      wierdurl[14]<-"xn--jp-cd2fp15c.xn--fsq.jpinspiredby "
      wierdurl[15]<-"[xn--jp-cd2fp15c.xn--fsq.jp/inspiredby "
      wierdurl[16]<-"xnto--jpto-cd2fp15c.xnto--fsq.jpinspiredby "
      wierdurl[17]<-"fsety--fwdvg-gertu56.ffuoiw--ffwsx.3dinspiredby "
      wierdurl[18]<-"://3dnews.ru/822878 "
      wierdurl[19]<-" http://mywebsite.com/msn.co.uk "
      wierdurl[20]<-" 2.0http://www.abe.hip "
      wierdurl[21]<-"www.abe.hip"
      wierdurl[22]<-"hardware/software/data"
      regexstring<-vector()
      regexstring[2]<-"(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
      regexstring[3]<-"/(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#\\/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#\\/%=~_|$])/igm"
      regexstring[4]<-"[a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]?"
      regexstring[5]<-"((http|ftp|https)\\:\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
      regexstring[6]<-"((http|ftp|https):\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?"
      regexstring[7]<-"(http|ftp|https)(:\\/\\/)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
      regexstring[8]<-"(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#/%=~_|$])"
      regexstring[10]<-"((http[s]?|ftp):\\/)?\\/?([^:\\/\\s]+)((\\/\\w+)*\\/)([\\w\\-\\.]+[^#?\\s]+)(.*)?(#[\\w\\-]+)?"
      regexstring[12]<-"http[s:/]+[[:alnum:]./]+"
      regexstring[9]<-"http[s:/]+[[:alnum:]./]+" #in DLpages 230
      regexstring[1]<-"[[:alnum:]-]+?[.][:alnum:]+?(?=[/ :])" #in link_graphs 50
      regexstring[13]<-"^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$"
      regexstring[14]<-"(((((http|ftp|https):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]+(?:(?:\\.[\\w_-]+)*))((\\.((org|com|net|edu|gov|mil|int)|(([:alpha:]{2})(?=[, ]))))|([\\/]([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
      regexstring[15]<-"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
        }
    
    for(i in wierdurl){#c(7,22)
      for(c in regexstring[c(15)]) {
        print(paste(i,which(regexstring==c)))
        print(str_extract_all(i,c))
      }
    }
    
    0 讨论(0)
  • 2020-11-22 14:39

    Guess no regex is perfect for this use. I found a pretty solid one here

    /(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm
    

    Some differences / advantages compared to the other ones posted here:

    • It does not match email addresses
    • It does match localhost:12345
    • It won't detect something like moo.com without http or www

    See here for examples

    0 讨论(0)
  • 2020-11-22 14:39

    If you have to be strict on selecting links, I would go for:

    (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
    

    For more infos, read this:

    An Improved Liberal, Accurate Regex Pattern for Matching URLs

    0 讨论(0)
提交回复
热议问题