How to remove the URLs present in text example
String str=\"Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz\"
Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there"
, you can do this:
String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\\s", " ");
This will remove all sequences starting with "http" and up to the first space character.
You should read the Javadoc on String class. It will make things clear for you.
How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.
Maybe this regular expression would do the job:
[\S]+://[\S]+
Explanation:
If you can move on towards python then you can find much better solution here using these code,
import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)
m.group(0)
should be replaced with an empty string rather than m.group(i)
where i
is incremented with every call to m.find()
as mentioned in one of the answers above.
private String removeUrl(String commentstr)
{
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
StringBuffer sb = new StringBuffer(commentstr.length);
while (m.find()) {
m.appendReplacement(sb, "");
}
return sb.toString();
}
Input the String
that contains the url
private String removeUrl(String commentstr)
{
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
int i = 0;
while (m.find()) {
commentstr = commentstr.replaceAll(m.group(i),"").trim();
i++;
}
return commentstr;
}
Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.
private String removeUrl(String commentstr)
{
// rid of ? and & in urls since replaceAll can't deal with them
String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", "");
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
int i = 0;
while (m.find()) {
commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim();
i++;
}
return commentstr;
}