Detecting a (naughty or nice) URL or link in a text string

问题

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

Some objectives:

The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business

Of course, I am blocking spam, but the same process could be used to auto-link text.

Ideas

Here are some things I'm thinking.

The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
- Well-formed URLs
- All non-whitespace followed by '.' followed by any valid TLD
- Anything else?

Update and Summary

Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

@Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per @capar
A good dot-looking character is @Sharkey's subscripted · (i.e. "_·"). · is also a word boundary so it's harder to casually copy & paste.

That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:

Strip out all dotted-quads (@Sharkey's comment to his own answer)
@Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.

回答1:

I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:

[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]

Things this will get:

buyfunkypharmaceuticals.it
google.com
http://stackoverflo**w.com/**questions/700163/

It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.

回答2:

Check these articles:

The Problem With URLs
Detecting URLs in a Block of Text

回答3:

I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.

If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.

回答4:

I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?

The following paragraph is an example:

This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)

回答5:

Well, obviously the low hanging fruit are things that start with http:// and www. Trying to filter out things like "www . g mail . com" leads to interesting philosophical questions about how far you want to go. Do you want to take it the next step and filter out "www dot gee mail dot com" also? How about abstract descriptions of a URL, like "The abbreviation for world wide web followed by a dot, followed by the letter g, followed by the word mail followed by a dot, concluded with the TLD abbreviation for commercial".

It's important to draw the line of what sorts of things you're going to try to filter before you continue with trying to design your algorithm. I think that the line should be drawn at the level where "gmail.com" is considered a url, but "gmail. com" is not. Otherwise, you're likely to get false positives every time someone fails to capitalize the first letter in a sentence.

回答6:

Since you are primarily looking for invitations to copy and paste into a browser address bar, it might be worth taking a look at the code used in open source browsers (such as Chrome or Mozilla) to decide if the text entered into the "address bar equivalent" is a search query or a URL navigation attempt.

回答7:

Ping the possible URL

If you don't mind a little server side computation, what about something like this?

urls = []
for possible_url in extracted_urls(comment):
    if pingable(possible_url):
       urls.append(url)  #you could do this as a list comprehension, but OP may not know python

Here:

extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates
pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.

[ramanujan:~/base]$ping -c 1 www.google.com

PING www.l.google.com (74.125.19.147): 56 data bytes 64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms

--- www.l.google.com ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms

[ramanujan:~/base]$ping -c 1 fooalksdflajkd.com

ping: cannot resolve fooalksdflajkd.com: Unknown host

The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.

回答8:

Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.

However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.

回答9:

Of course you realize if spammers decide to use tinuyrl or such services to shorten their URLs you're problem just got worse. You might have to write some code to look up the actual URLs in that case, using a service like TinyURL decoder

回答10:

Consider incorporating the OWASP AntiSAMY API...

回答11:

I like capar's answer the best so far, but dealing with unicode fonts can be a bit fraught, with older browsers often displaying a funny thing or a little box ... and the location of the U+05B4 is a bit odd ... for me, it appears outside the pipes here |ִ| even though it's between them.

There's a handy · (·) though, which breaks cut and paste in the same way. Its vertical alignment can be corrected by <sub>ing it, eg:

stackoverflow_·com

Perverse, but effective in FF3 anyway, it can't be cut-and-pasted as a URL. The <sub> is actually quite nice as it makes it visually obvious why the URL can't be pasted.

Dots which aren't in suspected URLs can be left alone, so for example you could do

s/\b\.\b/<sub>&middot;<\/sub>/g

Another option is to insert some kind of zero-width entity next to suspect dots, but things like &zwj; and &zwnj; and &ampzwsp; don't seem to work in FF3.

回答12:

There's already some great answers in here, so I won't post more. I will give a couple of gotchas though. First, make sure to test for known protocols, anything else may be naughty. As someone whose hobby concerns telnet links, you will probably want to include more than http(s) in your search, but may want to prevent say aim: or some other urls. Second, is that many people will delimit their links in angle-brackets (gt/lt) like <http://theroughnecks.net> or in parens "(url)" and there's nothing worse than clicking a link, and having the closing > or ) go allong with the rest of the url.

P.S. sorry for the self-referencing plugs ;)

回答13:

I needed just the detection of simple http urls with/out protocol, assuming that either the protocol is given or a 'www' prefix. I found the above mentioned link quite helpful, but in the end I came out with this:

http(s?)://(\S+\.)+\S+|www\d?\.(\S+\.)+\S+

This does, obviously, not test compliance to the dns standard.

回答14:

Given the messes of "other funny business" that I see in Disqus comment spam in the form of look-alike characters, the first thing you'll want to do is deal with that.

Luckily, the Unicode people have you covered. Dig up an implementation of the TR39 Skeleton Algorithm for Unicode Confusables in your programming language of choice and pair it with some Unicode normalization and Unicode-aware upper/lower-casing.

The skeleton algorithm uses a lookup table maintained by the Unicode people to do something conceptually similar to case-folding.

(The output may not use sensible characters, but, if you apply it to both sides of the comparison, you'll get a match if the characters are visually similar enough for a human to get the intent.)

Here's an example from this Java implementation:

// Skeleton representations of unicode strings containing 
// confusable characters are equal 
skeleton("paypal").equals(skeleton("paypal")); // true
skeleton("paypal").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("paypal").equals(skeleton("ρ⍺у𝓅𝒂ן")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true

// The skeleton representation does not transform case
skeleton("payPal").equals(skeleton("paypal")); // false

// The skeleton representation does not remove diacritics
skeleton("paypal").equals(skeleton("pàỳpąl")); // false

(As you can see, you'll want to do some other normalization first.)

Given that you're doing URL detection for the purpose of judging whether something's spam, this is probably one of those uncommon situations where it'd be safe to start by normalizing the Unicode to NFKD and then stripping codepoints declared to be combining characters.

(You'd then want to normalize the case before feeding them to the skeleton algorithm.)

I'd advise that you do one of the following:

Write your code to run a confusables check both before and after the characters get decomposed, in case things are considered confusables before being decomposed but not after, and check both uppercased and lowercased strings in case the confusables tables aren't symmetrical between the upper and lowercase forms.
Investigate whether #1 is actually a concern (no need to waste CPU time if it isn't) by writing a little script to inspect the Unicode tables and identify any codepoints where decomposing or lowercasing/uppercasing a pair of characters changes whether they're considered confusable with each other.

来源：https://stackoverflow.com/questions/700163/detecting-a-naughty-or-nice-url-or-link-in-a-text-string

标签

language-agnostic

url

sanitization

spam-prevention