How does Stack Overflow generate its SEO-friendly URLs?

后端 未结 21 1842
-上瘾入骨i
-上瘾入骨i 2020-11-22 04:27

What is a good complete regular expression or some other process that would take the title:

How do you change a title to be part of the URL like Stack

相关标签:
21条回答
  • 2020-11-22 05:10

    Here is my version of Jeff's code. I've made the following changes:

    • The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. That is, we never want “my-slug-”. This means an extra string allocation to remove it on this edge case. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.
    • His approach is purely lookup based and missed a lot of characters I found in examples while researching on Stack Overflow. To counter this, I first peform a normalisation pass (AKA collation mentioned in Meta Stack Overflow question Non US-ASCII characters dropped from full (profile) URL), and then ignore any characters outside the acceptable ranges. This works most of the time...
    • ... For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ASCII value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but it is better than nothing. The normalisation code was inspired by Jon Hanna’s great post in Stack Overflow question How can I remove accents on a string?.
    • The case conversion is now also optional.

      public static class Slug
      {
          public static string Create(bool toLower, params string[] values)
          {
              return Create(toLower, String.Join("-", values));
          }
      
          /// <summary>
          /// Creates a slug.
          /// References:
          /// http://www.unicode.org/reports/tr15/tr15-34.html
          /// https://meta.stackexchange.com/questions/7435/non-us-ascii-characters-dropped-from-full-profile-url/7696#7696
          /// https://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url/25486#25486
          /// https://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string
          /// </summary>
          /// <param name="toLower"></param>
          /// <param name="normalised"></param>
          /// <returns></returns>
          public static string Create(bool toLower, string value)
          {
              if (value == null)
                  return "";
      
              var normalised = value.Normalize(NormalizationForm.FormKD);
      
              const int maxlen = 80;
              int len = normalised.Length;
              bool prevDash = false;
              var sb = new StringBuilder(len);
              char c;
      
              for (int i = 0; i < len; i++)
              {
                  c = normalised[i];
                  if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
                  {
                      if (prevDash)
                      {
                          sb.Append('-');
                          prevDash = false;
                      }
                      sb.Append(c);
                  }
                  else if (c >= 'A' && c <= 'Z')
                  {
                      if (prevDash)
                      {
                          sb.Append('-');
                          prevDash = false;
                      }
                      // Tricky way to convert to lowercase
                      if (toLower)
                          sb.Append((char)(c | 32));
                      else
                          sb.Append(c);
                  }
                  else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\\' || c == '-' || c == '_' || c == '=')
                  {
                      if (!prevDash && sb.Length > 0)
                      {
                          prevDash = true;
                      }
                  }
                  else
                  {
                      string swap = ConvertEdgeCases(c, toLower);
      
                      if (swap != null)
                      {
                          if (prevDash)
                          {
                              sb.Append('-');
                              prevDash = false;
                          }
                          sb.Append(swap);
                      }
                  }
      
                  if (sb.Length == maxlen)
                      break;
              }
              return sb.ToString();
          }
      
          static string ConvertEdgeCases(char c, bool toLower)
          {
              string swap = null;
              switch (c)
              {
                  case 'ı':
                      swap = "i";
                      break;
                  case 'ł':
                      swap = "l";
                      break;
                  case 'Ł':
                      swap = toLower ? "l" : "L";
                      break;
                  case 'đ':
                      swap = "d";
                      break;
                  case 'ß':
                      swap = "ss";
                      break;
                  case 'ø':
                      swap = "o";
                      break;
                  case 'Þ':
                      swap = "th";
                      break;
              }
              return swap;
          }
      }
      

    For more details, the unit tests, and an explanation of why Facebook's URL scheme is a little smarter than Stack Overflows, I've got an expanded version of this on my blog.

    0 讨论(0)
  • 2020-11-22 05:10

    I know it's very old question but since most of the browsers now support unicode urls I found a great solution in XRegex that converts everything except letters (in all languages to '-').

    That can be done in several programming languages.

    The pattern is \\p{^L}+ and then you just need to use it to replace all non letters to '-'.

    Working example in node.js with xregex module.

    var text = 'This ! can @ have # several $ letters % from different languages such as עברית or Español';
    
    var slugRegEx = XRegExp('((?!\\d)\\p{^L})+', 'g');
    
    var slug = XRegExp.replace(text, slugRegEx, '-').toLowerCase();
    
    console.log(slug) ==> "this-can-have-several-letters-from-different-languages-such-as-עברית-or-español"
    
    0 讨论(0)
  • 2020-11-22 05:10

    I liked the way this is done without using regular expressions, so I ported it to PHP. I just added a function called is_between to check characters:

    function is_between($val, $min, $max)
    {
        $val = (int) $val; $min = (int) $min; $max = (int) $max;
    
        return ($val >= $min && $val <= $max);
    }
    
    function international_char_to_ascii($char)
    {
        if (mb_strpos('àåáâäãåa', $char) !== false)
        {
            return 'a';
        }
    
        if (mb_strpos('èéêëe', $char) !== false)
        {
            return 'e';
        }
    
        if (mb_strpos('ìíîïi', $char) !== false)
        {
            return 'i';
        }
    
        if (mb_strpos('òóôõö', $char) !== false)
        {
            return 'o';
        }
    
        if (mb_strpos('ùúûüuu', $char) !== false)
        {
            return 'u';
        }
    
        if (mb_strpos('çccc', $char) !== false)
        {
            return 'c';
        }
    
        if (mb_strpos('zzž', $char) !== false)
        {
            return 'z';
        }
    
        if (mb_strpos('ssšs', $char) !== false)
        {
            return 's';
        }
    
        if (mb_strpos('ñn', $char) !== false)
        {
            return 'n';
        }
    
        if (mb_strpos('ýÿ', $char) !== false)
        {
            return 'y';
        }
    
        if (mb_strpos('gg', $char) !== false)
        {
            return 'g';
        }
    
        if (mb_strpos('r', $char) !== false)
        {
            return 'r';
        }
    
        if (mb_strpos('l', $char) !== false)
        {
            return 'l';
        }
    
        if (mb_strpos('d', $char) !== false)
        {
            return 'd';
        }
    
        if (mb_strpos('ß', $char) !== false)
        {
            return 'ss';
        }
    
        if (mb_strpos('Þ', $char) !== false)
        {
            return 'th';
        }
    
        if (mb_strpos('h', $char) !== false)
        {
            return 'h';
        }
    
        if (mb_strpos('j', $char) !== false)
        {
            return 'j';
        }
        return '';
    }
    
    function url_friendly_title($url_title)
    {
        if (empty($url_title))
        {
            return '';
        }
    
        $url_title = mb_strtolower($url_title);
    
        $url_title_max_length   = 80;
        $url_title_length       = mb_strlen($url_title);
        $url_title_friendly     = '';
        $url_title_dash_added   = false;
        $url_title_char = '';
    
        for ($i = 0; $i < $url_title_length; $i++)
        {
            $url_title_char     = mb_substr($url_title, $i, 1);
    
            if (strlen($url_title_char) == 2)
            {
                $url_title_ascii    = ord($url_title_char[0]) * 256 + ord($url_title_char[1]) . "\r\n";
            }
            else
            {
                $url_title_ascii    = ord($url_title_char);
            }
    
            if (is_between($url_title_ascii, 97, 122) || is_between($url_title_ascii, 48, 57))
            {
                $url_title_friendly .= $url_title_char;
    
                $url_title_dash_added = false;
            }
            elseif(is_between($url_title_ascii, 65, 90))
            {
                $url_title_friendly .= chr(($url_title_ascii | 32));
    
                $url_title_dash_added = false;
            }
            elseif($url_title_ascii == 32 || $url_title_ascii == 44 || $url_title_ascii == 46 || $url_title_ascii == 47 || $url_title_ascii == 92 || $url_title_ascii == 45 || $url_title_ascii == 47 || $url_title_ascii == 95 || $url_title_ascii == 61)
            {
                if (!$url_title_dash_added && mb_strlen($url_title_friendly) > 0)
                {
                    $url_title_friendly .= chr(45);
    
                    $url_title_dash_added = true;
                }
            }
            else if ($url_title_ascii >= 128)
            {
                $url_title_previous_length = mb_strlen($url_title_friendly);
    
                $url_title_friendly .= international_char_to_ascii($url_title_char);
    
                if ($url_title_previous_length != mb_strlen($url_title_friendly))
                {
                    $url_title_dash_added = false;
                }
            }
    
            if ($i == $url_title_max_length)
            {
                break;
            }
        }
    
        if ($url_title_dash_added)
        {
            return mb_substr($url_title_friendly, 0, -1);
        }
        else
        {
            return $url_title_friendly;
        }
    }
    
    0 讨论(0)
提交回复
热议问题