How do I convert Word smart quotes and em dashes in a string?

前端 未结 13 1703
星月不相逢
星月不相逢 2020-11-29 03:11

I have a form with a textarea. Users enter a block of text which is stored in a database.

Occasionally a user will paste text from Word containing smart quotes or em

相关标签:
13条回答
  • 2020-11-29 03:36

    In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"

    0 讨论(0)
  • 2020-11-29 03:37

    Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.

    0 讨论(0)
  • 2020-11-29 03:38

    If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...

    You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():

    $trans_tbl = false;
    
    function htmlEncode($text) {
    
      global $trans_tbl;
    
      // create translation table once
      if(!$trans_tbl) {
        // start with the default set of conversions and add more.
    
        $trans_tbl = get_html_translation_table(HTML_ENTITIES); 
    
        $trans_tbl[chr(130)] = '‚';    // Single Low-9 Quotation Mark
        $trans_tbl[chr(131)] = 'ƒ';    // Latin Small Letter F With Hook
        $trans_tbl[chr(132)] = '„';    // Double Low-9 Quotation Mark
        $trans_tbl[chr(133)] = '…';    // Horizontal Ellipsis
        $trans_tbl[chr(134)] = '†';    // Dagger
        $trans_tbl[chr(135)] = '‡';    // Double Dagger
        $trans_tbl[chr(136)] = 'ˆ';    // Modifier Letter Circumflex Accent
        $trans_tbl[chr(137)] = '‰';    // Per Mille Sign
        $trans_tbl[chr(138)] = 'Š';    // Latin Capital Letter S With Caron
        $trans_tbl[chr(139)] = '‹';    // Single Left-Pointing Angle Quotation Mark
        $trans_tbl[chr(140)] = 'Œ';    // Latin Capital Ligature OE
    
        // smart single/ double quotes (from MS)
        $trans_tbl[chr(145)] = '‘'; 
        $trans_tbl[chr(146)] = '’'; 
        $trans_tbl[chr(147)] = '“'; 
        $trans_tbl[chr(148)] = '”'; 
    
        $trans_tbl[chr(149)] = '•';    // Bullet
        $trans_tbl[chr(150)] = '–';    // En Dash
        $trans_tbl[chr(151)] = '—';    // Em Dash
        $trans_tbl[chr(152)] = '˜';    // Small Tilde
        $trans_tbl[chr(153)] = '™';    // Trade Mark Sign
        $trans_tbl[chr(154)] = 'š';    // Latin Small Letter S With Caron
        $trans_tbl[chr(155)] = '›';    // Single Right-Pointing Angle Quotation Mark
        $trans_tbl[chr(156)] = 'œ';    // Latin Small Ligature OE
        $trans_tbl[chr(159)] = 'Ÿ';    // Latin Capital Letter Y With Diaeresis
    
        ksort($trans_tbl);
      }
    
      // escape HTML      
      return strtr($text, $trans_tbl); 
    }
    
    0 讨论(0)
  • 2020-11-29 03:42

    You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.

    $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
    

    This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.

    0 讨论(0)
  • 2020-11-29 03:46

    The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.

    The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:

    <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
    

    That way, the content type of the data submitted to PHP will also be the same.

    I had a similar issue and adding the <meta> tag worked for me.

    0 讨论(0)
  • 2020-11-29 03:47

    It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.

    Here is some info on migrating your database to another character encoding, at least for a MySQL database.

    0 讨论(0)
提交回复
热议问题