问题
Apologies for yet another regex question!
I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators
https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n
this example contains just two urls, but it could be more.
I'm trying to separate the urls, into a list using python
I've tried searching for solutions and tried a few but can't get this to work exactly, as they greedily consume all following urls. https://stackoverflow.com/a/6883094/659346
I realise that's probably because https://...
could probably be legally allowed in the query part of a url, but in my case I'm willing to assume it can't, and assume that when it occurs it's the start of the next url.
I also tried (http[s]://.*?)
but that with and without the ?
either makes it get the whole bit of text or just the https://
回答1:
You need to use a positive lookahead assertion.
>>> s = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
>>> re.findall(r'https?://.*?(?=https?://|$|\s)', s)
['https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZg', 'https://console.developers.google.com/project/reducted/?authuser=1']
回答2:
(https?:\/\/(?:(?!https?:\/\/).)*)
Try this.See demo.
https://regex101.com/r/tX2bH4/15
import re
p = re.compile(r'(https?:\/\/(?:(?!https?:\/\/).)*)')
test_str = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
re.findall(p, test_str)
来源:https://stackoverflow.com/questions/27966726/regex-separate-urls-in-text-that-has-no-separators