My Java standalone application gets a URL (which points to a file) from the user and I need to hit it and download it. The problem I am facing is that I am not able to encod
The java.net.URI class can help; in the documentation of URL you find
Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use an URI
Use one of the constructors with more than one argument, like:
URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/first book.pdf",
null);
URL url = uri.toURL();
//or String request = uri.toString();
(the single-argument constructor of URI does NOT escape illegal characters)
Only illegal characters get escaped by above code - it does NOT escape non-ASCII characters (see fatih's comment).
The toASCIIString
method can be used to get a String only with US-ASCII characters:
URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/é",
null);
String request = uri.toASCIIString();
For an URL with a query like http://www.google.com/ig/api?weather=São Paulo
, use the 5-parameter version of the constructor:
URI uri = new URI(
"http",
"www.google.com",
"/ig/api",
"weather=São Paulo",
null);
String request = uri.toASCIIString();
If you have a URL, you can pass url.toString() into this method. First decode, to avoid double encoding (for example, encoding a space results in %20 and encoding a percent sign results in %25, so double encoding will turn a space into %2520). Then, use the URI as explained above, adding in all the parts of the URL (so that you don't drop the query parameters).
public URL convertToURLEscapingIllegalCharacters(String string){
try {
String decodedURL = URLDecoder.decode(string, "UTF-8");
URL url = new URL(decodedURL);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
return uri.toURL();
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
}
I agree with Matt. Indeed, I've never seen it well explained in tutorials, but one matter is how to encode the URL path, and a very different one is how to encode the parameters which are appended to the URL (the query part, behind the "?" symbol). They use similar encoding, but not the same.
Specially for the encoding of the white space character. The URL path needs it to be encoded as %20, whereas the query part allows %20 and also the "+" sign. The best idea is to test it by ourselves against our Web server, using a Web browser.
For both cases, I ALWAYS would encode COMPONENT BY COMPONENT, never the whole string. Indeed URLEncoder allows that for the query part. For the path part you can use the class URI, although in this case it asks for the entire string, not a single component.
Anyway, I believe that the best way to avoid these problems is to use a personal non-conflictive design. How? For example, I never would name directories or parameters using other characters than a-Z, A-Z, 0-9 and _ . That way, the only need is to encode the value of every parameter, since it may come from an user input and the used characters are unknown.
You can use a function like this. Complete and modify it to your need :
/**
* Encode URL (except :, /, ?, &, =, ... characters)
* @param url to encode
* @param encodingCharset url encoding charset
* @return encoded URL
* @throws UnsupportedEncodingException
*/
public static String encodeUrl (String url, String encodingCharset) throws UnsupportedEncodingException{
return new URLCodec().encode(url, encodingCharset).replace("%3A", ":").replace("%2F", "/").replace("%3F", "?").replace("%3D", "=").replace("%26", "&");
}
Example of use :
String urlToEncode = ""http://www.growup.com/folder/intérieur-à_vendre?o=4";
Utils.encodeUrl (urlToEncode , "UTF-8")
The result is : http://www.growup.com/folder/int%C3%A9rieur-%C3%A0_vendre?o=4
URLEncoding can encode HTTP URLs just fine, as you've unfortunately discovered. The string you passed in, "http://search.barnesandnoble.com/booksearch/first book.pdf", was correctly and completely encoded into a URL-encoded form. You could pass that entire long string of gobbledigook that you got back as a parameter in a URL, and it could be decoded back into exactly the string you passed in.
It sounds like you want to do something a little different than passing the entire URL as a parameter. From what I gather, you're trying to create a search URL that looks like "http://search.barnesandnoble.com/booksearch/whateverTheUserPassesIn". The only thing that you need to encode is the "whateverTheUserPassesIn" bit, so perhaps all you need to do is something like this:
String url = "http://search.barnesandnoble.com/booksearch/" +
URLEncoder.encode(userInput,"UTF-8");
That should produce something rather more valid for you.
Use the following standard Java solution (passes around 100 of the testcases provided by Web Plattform Tests):
0. Test if URL is already encoded.
1. Split URL into structural parts. Use java.net.URL
for it.
2. Encode each structural part properly!
3. Use IDN.toASCII(putDomainNameHere)
to Punycode encode the host name!
4. Use java.net.URI.toASCIIString()
to percent-encode, NFC encoded unicode - (better would be NFKC!).
Find more here: https://stackoverflow.com/a/49796882/1485527