How to normalize a URL in Java?

后端 未结 8 1113
孤独总比滥情好
孤独总比滥情好 2020-12-09 01:27

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization p

相关标签:
8条回答
  • 2020-12-09 02:06

    I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:

    /**
     * - Covert the scheme and host to lowercase (done by java.net.URL)
     * - Normalize the path (done by java.net.URI)
     * - Add the port number.
     * - Remove the fragment (the part after the #).
     * - Remove trailing slash.
     * - Sort the query string params.
     * - Remove some query string params like "utm_*" and "*session*".
     */
    public class NormalizeURL
    {
        public static String normalize(final String taintedURL) throws MalformedURLException
        {
            final URL url;
            try
            {
                url = new URI(taintedURL).normalize().toURL();
            }
            catch (URISyntaxException e) {
                throw new MalformedURLException(e.getMessage());
            }
    
            final String path = url.getPath().replace("/$", "");
            final SortedMap<String, String> params = createParameterMap(url.getQuery());
            final int port = url.getPort();
            final String queryString;
    
            if (params != null)
            {
                // Some params are only relevant for user tracking, so remove the most commons ones.
                for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
                {
                    final String key = i.next();
                    if (key.startsWith("utm_") || key.contains("session"))
                    {
                        i.remove();
                    }
                }
                queryString = "?" + canonicalize(params);
            }
            else
            {
                queryString = "";
            }
    
            return url.getProtocol() + "://" + url.getHost()
                + (port != -1 && port != 80 ? ":" + port : "")
                + path + queryString;
        }
    
        /**
         * Takes a query string, separates the constituent name-value pairs, and
         * stores them in a SortedMap ordered by lexicographical order.
         * @return Null if there is no query string.
         */
        private static SortedMap<String, String> createParameterMap(final String queryString)
        {
            if (queryString == null || queryString.isEmpty())
            {
                return null;
            }
    
            final String[] pairs = queryString.split("&");
            final Map<String, String> params = new HashMap<String, String>(pairs.length);
    
            for (final String pair : pairs)
            {
                if (pair.length() < 1)
                {
                    continue;
                }
    
                String[] tokens = pair.split("=", 2);
                for (int j = 0; j < tokens.length; j++)
                {
                    try
                    {
                        tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
                    }
                    catch (UnsupportedEncodingException ex)
                    {
                        ex.printStackTrace();
                    }
                }
                switch (tokens.length)
                {
                    case 1:
                    {
                        if (pair.charAt(0) == '=')
                        {
                            params.put("", tokens[0]);
                        }
                        else
                        {
                            params.put(tokens[0], "");
                        }
                        break;
                    }
                    case 2:
                    {
                        params.put(tokens[0], tokens[1]);
                        break;
                    }
                }
            }
    
            return new TreeMap<String, String>(params);
        }
    
        /**
         * Canonicalize the query string.
         *
         * @param sortedParamMap Parameter name-value pairs in lexicographical order.
         * @return Canonical form of query string.
         */
        private static String canonicalize(final SortedMap<String, String> sortedParamMap)
        {
            if (sortedParamMap == null || sortedParamMap.isEmpty())
            {
                return "";
            }
    
            final StringBuffer sb = new StringBuffer(350);
            final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();
    
            while (iter.hasNext())
            {
                final Map.Entry<String, String> pair = iter.next();
                sb.append(percentEncodeRfc3986(pair.getKey()));
                sb.append('=');
                sb.append(percentEncodeRfc3986(pair.getValue()));
                if (iter.hasNext())
                {
                    sb.append('&');
                }
            }
    
            return sb.toString();
        }
    
        /**
         * Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
         * according to the RFC, so we make the extra replacements.
         *
         * @param string Decoded string.
         * @return Encoded string per RFC 3986.
         */
        private static String percentEncodeRfc3986(final String string)
        {
            try
            {
                return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
            }
            catch (UnsupportedEncodingException e)
            {
                return string;
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-09 02:07

    Im have a simple way to solve it. Here is my code

    public static String normalizeURL(String oldLink)
    {
        int pos=oldLink.indexOf("://");
        String newLink="http"+oldLink.substring(pos);
        return newLink;
    }
    
    0 讨论(0)
  • 2020-12-09 02:15

    Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.

    0 讨论(0)
  • 2020-12-09 02:15

    The RL library: https://github.com/backchatio/rl goes quite a ways beyond java.net.URL.normalize(). It's in Scala, but I imagine it should be useable from Java.

    0 讨论(0)
  • 2020-12-09 02:17

    You can do this with the Restlet framework using Reference.normalize(). You should also be able to remove the elements you don't need quite conveniently with this class.

    0 讨论(0)
  • 2020-12-09 02:21

    No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.

    e.g. http://ACME.com/./foo%26bar becomes:

    http://acme.com/foo&bar

    URI's normalize() does not do this.

    0 讨论(0)
提交回复
热议问题