Multi-byte safe wordwrap() function for UTF-8

前端 未结 9 975
太阳男子
太阳男子 2020-12-01 13:17

PHP\'s wordwrap() function doesn\'t work correctly for multi-byte strings like UTF-8.

There are a few examples of mb safe functions in the comments, but with some di

相关标签:
9条回答
  • 2020-12-01 13:20

    I haven't found any working code for me. Here is what I've written. For me it is working, thought it is probably not the fastest.

    function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false) {
        $lines = explode($break, $str);
        foreach ($lines as &$line) {
            $line = rtrim($line);
            if (mb_strlen($line) <= $width)
                continue;
            $words = explode(' ', $line);
            $line = '';
            $actual = '';
            foreach ($words as $word) {
                if (mb_strlen($actual.$word) <= $width)
                    $actual .= $word.' ';
                else {
                    if ($actual != '')
                        $line .= rtrim($actual).$break;
                    $actual = $word;
                    if ($cut) {
                        while (mb_strlen($actual) > $width) {
                            $line .= mb_substr($actual, 0, $width).$break;
                            $actual = mb_substr($actual, $width);
                        }
                    }
                    $actual .= ' ';
                }
            }
            $line .= trim($actual);
        }
        return implode($break, $lines);
    }
    
    0 讨论(0)
  • 2020-12-01 13:20

    This one seems to work well...

    function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
        if ($charset === null) $charset = mb_internal_encoding();
    
        $pieces = explode($break, $str);
        $result = array();
        foreach ($pieces as $piece) {
          $current = $piece;
          while ($cut && mb_strlen($current) > $width) {
            $result[] = mb_substr($current, 0, $width, $charset);
            $current = mb_substr($current, $width, 2048, $charset);
          }
          $result[] = $current;
        }
        return implode($break, $result);
    }
    
    0 讨论(0)
  • 2020-12-01 13:23
    /**
     * wordwrap for utf8 encoded strings
     *
     * @param string $str
     * @param integer $len
     * @param string $what
     * @return string
     * @author Milian Wolff <mail@milianw.de>
     */
    
    function utf8_wordwrap($str, $width, $break, $cut = false) {
        if (!$cut) {
            $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.',}\b#U';
        } else {
            $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.'}#';
        }
        if (function_exists('mb_strlen')) {
            $str_len = mb_strlen($str,'UTF-8');
        } else {
            $str_len = preg_match_all('/[\x00-\x7F\xC0-\xFD]/', $str, $var_empty);
        }
        $while_what = ceil($str_len / $width);
        $i = 1;
        $return = '';
        while ($i < $while_what) {
            preg_match($regexp, $str,$matches);
            $string = $matches[0];
            $return .= $string.$break;
            $str = substr($str, strlen($string));
            $i++;
        }
        return $return.$str;
    }
    

    Total time: 0.0020880699 is good time :)

    0 讨论(0)
  • 2020-12-01 13:27

    Custom word boundaries

    Unicode text has many more potential word boundaries than 8-bit encodings, including 17 space separators, and the full width comma. This solution allows you to customize a list of word boundaries for your application.

    Better performance

    Have you ever benchmarked the mb_* family of PHP built-ins? They don't scale well at all. By using a custom nextCharUtf8(), we can do the same job, but orders of magnitude faster, especially on large strings.

    <?php
    
    function wordWrapUtf8(
      string $phrase,
      int $width = 75,
      string $break = "\n",
      bool $cut = false,
      array $seps = [' ', "\n", "\t", ',']
    ): string
    {
      $chunks = [];
      $chunk = '';
      $len = 0;
      $pointer = 0;
      while (!is_null($char = nextCharUtf8($phrase, $pointer))) {
        $chunk .= $char;
        $len++;
        if (in_array($char, $seps, true) || ($cut && $len === $width)) {
          $chunks[] = [$len, $chunk];
          $len = 0;
          $chunk = '';
        }
      }
      if ($chunk) {
        $chunks[] = [$len, $chunk];
      }
      $line = '';
      $lines = [];
      $lineLen = 0;
      foreach ($chunks as [$len, $chunk]) {
        if ($lineLen + $len > $width) {
          if ($line) {
            $lines[] = $line;
            $lineLen = 0;
            $line = '';
          }
        }
        $line .= $chunk;
        $lineLen += $len;
      }
      if ($line) {
        $lines[] = $line;
      }
      return implode($break, $lines);
    }
    
    function nextCharUtf8(&$string, &$pointer)
    {
      // EOF
      if (!isset($string[$pointer])) {
        return null;
      }
    
      // Get the byte value at the pointer
      $char = ord($string[$pointer]);
    
      // ASCII
      if ($char < 128) {
        return $string[$pointer++];
      }
    
      // UTF-8
      if ($char < 224) {
        $bytes = 2;
      } elseif ($char < 240) {
        $bytes = 3;
      } elseif ($char < 248) {
        $bytes = 4;
      } elseif ($char == 252) {
        $bytes = 5;
      } else {
        $bytes = 6;
      }
    
      // Get full multibyte char
      $str = substr($string, $pointer, $bytes);
    
      // Increment pointer according to length of char
      $pointer += $bytes;
    
      // Return mb char
      return $str;
    }
    
    0 讨论(0)
  • 2020-12-01 13:29

    Because no answer was handling every use case, here is something that does. The code is based on Drupal’s AbstractStringWrapper::wordWrap.

    <?php
    
    /**
     * Wraps any string to a given number of characters.
     *
     * This implementation is multi-byte aware and relies on {@link
     * http://www.php.net/manual/en/book.mbstring.php PHP's multibyte
     * string extension}.
     *
     * @see wordwrap()
     * @link https://api.drupal.org/api/drupal/core%21vendor%21zendframework%21zend-stdlib%21Zend%21Stdlib%21StringWrapper%21AbstractStringWrapper.php/function/AbstractStringWrapper%3A%3AwordWrap/8
     * @param string $string
     *   The input string.
     * @param int $width [optional]
     *   The number of characters at which <var>$string</var> will be
     *   wrapped. Defaults to <code>75</code>.
     * @param string $break [optional]
     *   The line is broken using the optional break parameter. Defaults
     *   to <code>"\n"</code>.
     * @param boolean $cut [optional]
     *   If the <var>$cut</var> is set to <code>TRUE</code>, the string is
     *   always wrapped at or before the specified <var>$width</var>. So if
     *   you have a word that is larger than the given <var>$width</var>, it
     *   is broken apart. Defaults to <code>FALSE</code>.
     * @return string
     *   Returns the given <var>$string</var> wrapped at the specified
     *   <var>$width</var>.
     */
    function mb_wordwrap($string, $width = 75, $break = "\n", $cut = false) {
      $string = (string) $string;
      if ($string === '') {
        return '';
      }
    
      $break = (string) $break;
      if ($break === '') {
        trigger_error('Break string cannot be empty', E_USER_ERROR);
      }
    
      $width = (int) $width;
      if ($width === 0 && $cut) {
        trigger_error('Cannot force cut when width is zero', E_USER_ERROR);
      }
    
      if (strlen($string) === mb_strlen($string)) {
        return wordwrap($string, $width, $break, $cut);
      }
    
      $stringWidth = mb_strlen($string);
      $breakWidth = mb_strlen($break);
    
      $result = '';
      $lastStart = $lastSpace = 0;
    
      for ($current = 0; $current < $stringWidth; $current++) {
        $char = mb_substr($string, $current, 1);
    
        $possibleBreak = $char;
        if ($breakWidth !== 1) {
          $possibleBreak = mb_substr($string, $current, $breakWidth);
        }
    
        if ($possibleBreak === $break) {
          $result .= mb_substr($string, $lastStart, $current - $lastStart + $breakWidth);
          $current += $breakWidth - 1;
          $lastStart = $lastSpace = $current + 1;
          continue;
        }
    
        if ($char === ' ') {
          if ($current - $lastStart >= $width) {
            $result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
            $lastStart = $current + 1;
          }
    
          $lastSpace = $current;
          continue;
        }
    
        if ($current - $lastStart >= $width && $cut && $lastStart >= $lastSpace) {
          $result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
          $lastStart = $lastSpace = $current;
          continue;
        }
    
        if ($current - $lastStart >= $width && $lastStart < $lastSpace) {
          $result .= mb_substr($string, $lastStart, $lastSpace - $lastStart) . $break;
          $lastStart = $lastSpace = $lastSpace + 1;
          continue;
        }
      }
    
      if ($lastStart !== $current) {
        $result .= mb_substr($string, $lastStart, $current - $lastStart);
      }
    
      return $result;
    }
    
    ?>
    
    0 讨论(0)
  • 2020-12-01 13:29

    Just want to share some alternative I found on the net.

    <?php
    if ( !function_exists('mb_str_split') ) {
        function mb_str_split($string, $split_length = 1)
        {
            mb_internal_encoding('UTF-8'); 
            mb_regex_encoding('UTF-8');  
    
            $split_length = ($split_length <= 0) ? 1 : $split_length;
    
            $mb_strlen = mb_strlen($string, 'utf-8');
    
            $array = array();
    
            for($i = 0; $i < $mb_strlen; $i += $split_length) {
                $array[] = mb_substr($string, $i, $split_length);
            }
    
            return $array;
        }
    }
    

    Using mb_str_split, you can use join to combine the words with <br>.

    <?php
        $text = '<utf-8 content>';
    
        echo join('<br>', mb_str_split($text, 20));
    

    And finally create your own helper, perhaps mb_textwrap

    <?php
    
    if( !function_exists('mb_textwrap') ) {
        function mb_textwrap($text, $length = 20, $concat = '<br>') 
        {
            return join($concat, mb_str_split($text, $length));
        }
    }
    
    $text = '<utf-8 content>';
    // so simply call
    echo mb_textwrap($text);
    

    See screenshot demo:

    0 讨论(0)
提交回复
热议问题