Trim unicode whitespace in PHP 5.2

前端 未结 7 1465
隐瞒了意图╮
隐瞒了意图╮ 2020-12-01 09:17

How can I trim a string(6) \" page\", where the first whitespace is a 0xc2a0 non-breaking space?

I\'ve tried trim() and preg_match(\'

相关标签:
7条回答
  • 2020-12-01 09:51
    preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$str);
    
    0 讨论(0)
  • 2020-12-01 09:51

    This page may help:

    http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page

    0 讨论(0)
  • 2020-12-01 09:54

    PCRE unicode properties properties can be used to achieve this

    Here is the code that I played with and seems to do what you want:

    <?php
    function unicode_trim ($str) {
        return preg_replace('/^[\pZ\pC]+([\PZ\PC]*)[\pZ\pC]+$/u', '$1', $str);
    }
    
    $key = chr(0xc2) . chr(0xa0) . '#page#' . chr(0xc2) . chr(0xa0);
    
    var_dump(unicode_trim($key));
    

    Result

    [~]> php e.php
    string(6) "#page#"
    

    Explanation:

    \p{xx} a character with the xx property \P{xx} a character without the xx property

    If xx has only one character, then {} can be dropped, e.g. \p{Z} is the same as \pZ

    Z stands for all separators, C stands for all "other" characters (for example control characters)

    0 讨论(0)
  • 2020-12-01 09:55

    The existing solution mention only \pZ characters. However, there are six Unicode whitespace characters that fall outside the purview of that property:

    % unichars '\p{WhiteSpace}' '\PZ'
     --    9 0009 CHARACTER TABULATION
     --   10 000A LINE FEED (LF)
     --   11 000B LINE TABULATION
     --   12 000C FORM FEED (FF)
     --   13 000D CARRIAGE RETURN (CR)
     --  133 0085 NEXT LINE (NEL)
    

    Those six are all of type \pC, and in particular, type \p{Cc}. However there are fifty-nine non-whitespace characters that are also \p{Cc}:

    % unichars '\P{WhiteSpace}' '\p{Cc}' | wc -l
          59
    

    The simple version of my own test for whether something is a printable character or not is simply [\pZ\pC]; that’s what unichars uses, for example.

    A more careful test would consider whether something should take up 0, 1, or 2 print positions. That requires considering whether it’s a combining Mark, which is property \pM, and also whether it has the half-width or full-width properties. For example:

    % uniprops ff5e ffeb
    U+FF5E ‹~› \N{ FULLWIDTH TILDE }:
        \pS \p{Sm}
        All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
           CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
           Math_Symbol Print Symbol
    U+FFEB ‹→› \N{ HALFWIDTH RIGHTWARDS ARROW }:
        \pS \p{Sm}
        All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
           CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
           Math_Symbol Print Symbol
    

    For those, you would need to use the non-binary East Asian Width property. These are applicable:

    % uniprops -l | grep -i width
    Block:Halfwidth_And_Fullwidth_Forms
    InHalfwidthAndFullwidthForms
    East_Asian_Width:A
    East_Asian_Width=Ambiguous
    East_Asian_Width:Ambiguous
    East_Asian_Width:F
    East_Asian_Width=Fullwidth
    East_Asian_Width:Fullwidth
    East_Asian_Width:H
    East_Asian_Width=Halfwidth
    East_Asian_Width:Halfwidth
    East_Asian_Width=Neutral
    East_Asian_Width:Na
    East_Asian_Width=Narrow
    East_Asian_Width:Narrow
    East_Asian_Width:Neutral
    East_Asian_Width:W
    East_Asian_Width=Wide
    East_Asian_Width:Wide
    

    Those have abbreviations like \p{Ea=F} and \p{Ea=H}. There are a bunch of these:

    % uninames '(FULL|HALF)WIDTH' | wc -l
         454
    

    Of course, you mustn’t go on names for these things, but on properties:

    % unichars '[\p{Ea=F}\p{Ea=H}]' | wc -l
         227
    % unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}]' | wc -l
         338
    % unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}\pM]' | wc -l
        1488
    

    To show you how many, many properties these things truly have, here’s the full property dump of three different characters, run against Unicode 5.2:

    % uniprops -ga NEL "COMBINING TILDE" ff5e 
    U+0085 ‹U+0085› \N{ NEXT LINE (NEL) }:
        \s \v \R \pC \p{Cc}
        All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace
           White_Space WSpace
        Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1
           Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 General_Category=Other Canonical_Combining_Class:0
           Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR
           General_Category=Control Script=Common Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral
           General_Category:C General_Category:Cc General_Category:Cntrl General_Category:Control Gc=Cc General_Category:Other Gc=C
           Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA
           Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
           Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL
           Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
           Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
           Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
           Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL
           Word_Break:NL Word_Break=Newline
    U+0303 ‹̃› \N{ COMBINING TILDE }:
        \w \pM \p{Mn}
        All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt
           ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC
        Age:1.1 Bidi_Class:Nonspacing_Mark Bc=NSM Bidi_Class:NSM Bidi_Class=Nonspacing_Mark Block:Combining_Diacritical_Marks
           Canonical_Combining_Class:230 Canonical_Combining_Class=Above Canonical_Combining_Class:A
           Canonical_Combining_Class:Above Ccc=A Decomposition_Type:None Dt=None East_Asian_Width:A East_Asian_Width=Ambiguous
           East_Asian_Width:Ambiguous Ea=A General_Category:M General_Category=Mark General_Category:Mark Gc=M General_Category:Mn
           General_Category=Nonspacing_Mark General_Category:Nonspacing_Mark Gc=Mn Grapheme_Cluster_Break:EX
           Grapheme_Cluster_Break=Extend Grapheme_Cluster_Break:Extend GCB=EX Hangul_Syllable_Type:NA
           Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Script=Inherited
           Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:T Joining_Type=Transparent Joining_Type:Transparent Jt=T
           Line_Break:CM Line_Break=Combining_Mark Line_Break:Combining_Mark Lb=CM NFC_Quick_Check:M NFC_Quick_Check=Maybe
           NFC_Quick_Check:Maybe NFCQC=M NFKC_Quick_Check:M NFKC_Quick_Check=Maybe NFKC_Quick_Check:Maybe NFKCQC=M
           Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1
           In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1
           Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Inherited Sc=Zinh Script:Qaai Script:Zinh
           Sentence_Break:EX Sentence_Break=Extend Sentence_Break:Extend SB=EX Word_Break:Extend WB=Extend
    U+FF5E ‹~› \N{ FULLWIDTH TILDE }:
        \pS \p{Sm}
        All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base
           Graph GrBase Math Math_Symbol Print Symbol
        Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral Bidi_Class:Other_Neutral Bc=ON Block:Halfwidth_And_Fullwidth_Forms
           Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR
           Canonical_Combining_Class:NR Script=Common Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
           Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Wide Dt=Wide East_Asian_Width:F
           East_Asian_Width=Fullwidth East_Asian_Width:Fullwidth Ea=F General_Category:Math_Symbol Gc=Sm General_Category:S
           General_Category=Symbol General_Category:Sm General_Category=Math_Symbol General_Category:Symbol Gc=S
           Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
           Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
           Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:ID
           Line_Break=Ideographic Line_Break:Ideographic Lb=ID Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1
           Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2
           In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
           Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX Sentence_Break:XX Sentence_Break=Other Word_Break:Other
           WB=XX Word_Break:XX Word_Break=Other
    

    Pretty stunning, eh?

    If you’ve read this far and would like to know where to get the three Unicode utilities illustrated above, uniprops, unichars, and uninames, please send me mail, because the current links aren’t working right now.

    0 讨论(0)
  • 2020-12-01 10:05

    Perhaps something from the multibyte string set of functions? http://php.net/manual/en/function.mb-ereg.php Can't see mb_trim, but there is a set of MB safe regex functions.

    G

    0 讨论(0)
  • 2020-12-01 10:13

    None of the answers from above didn't actually work to remove trailing white spaces in utf-8 strings.

    This solution found here works perfectly and is the shortest:

    trim($str, "\t\n\r\0\x0B\xC2\xA0");
    
    0 讨论(0)
提交回复
热议问题