How do I detect non-printable characters in .NET?

前端 未结 4 1991
花落未央
花落未央 2020-12-30 18:43

I\'m just wondering if there is a method in .NET 2.0 that checks whether a character is printable or not – something like isprint(int) from standard C.

相关标签:
4条回答
  • 2020-12-30 19:19
    private bool IsPrintableCharacter(char candidate)
    {
        return !(candidate < 0x20 || candidate > 127);
    }
    
    0 讨论(0)
  • 2020-12-30 19:22

    You might want to use Char.IsControl(Char). That is what I'm using. You definitely do not want to use the <0x20 method because any non-latin character and most non-english characters will be above 127.

    0 讨论(0)
  • 2020-12-30 19:25

    In addition to Char.IsControlChar() there are several other functions that can be used to determine what category a given char value is:

    • IsLetter()
    • IsNumber()
    • IsDigit()
    • IsLetterOrDigit()
    • IsSymbol()
    • IsPunctuation()
    • IsSeparator()
    • IsWhiteSpace()

    If what you have is a "traditional ASCII text" file, and you want to use supplied functions, the expression:

    (Char.IsLetterOrDigit(ch) || Char.IsPunctuation(ch) || Char.IsSymbol(ch) || (ch==' '))
    

    should work.

    Now, if you are working with Unicode, you are opening a can or worms. Even back in the day, whether a space is printable or not printable was open to interpretation (hence the isprint() and isgraph() functions). See this related question and answers about "printable" unicode characters.

    0 讨论(0)
  • 2020-12-30 19:29

    If by printable you mean renders something - even if that something is blank space (whitespace), [negating] Char.IsControl() alone is not enough to determine if a character is printable.

    • It isn't enough even in the single-byte U+0000 - U+00FF Unicode range (which is compatible with ASCII / ISO-8859-1), because the ASCII whitespace characters other than the space character are also classified as control characters, so that Char.IsControl('\t') and Char.IsControl('\n') report true as well.

    • Beyond the single-byte range, there are other categories of non-rendering characters that must be recognized.


    A solution for the single-byte U+0000 - U+00FF Unicode range (which is compatible with ASCII / ISO-8859-1):

      // Sample input char.
      char c = (char)0x20; // space
    
      var isPrintable = ! Char.IsControl(c) || Char.IsWhiteSpace(c);
    

    An approximation of a solution for all Unicode characters:

    Sadly, there is no simple solution that is complete:

    • A fundamental limitation of a Char-based test is that type Char can only represent characters up to code point U+FFFF, i.e., only characters in the so-called BMP (basic multi-lingual plane). Characters outside the BMP - with higher code points - must be represented as two Char instances (so-called surrogate pairs).

    • The UnicodeCategory.PrivateUse category of characters, as the name suggests, is not standardized; for instance, U+F8FF on macOS contains the Apple symbol, whereas it is undefined on Windows. So it may contain printable characters, and you'd have to determine dynamically whether they are printable.

    • The UnicodeCategory.Format category mostly contains non-rendering characters, but there are exceptions - see this table.

      • You could hard-code these exceptions for a given version of the Unicode standard, but that is cumbersome and may become obsolete over time.

    Thus, the following code assumes that all characters in UnicodeCategory.PrivateUse and UnicodeCategory.Format are printable, which, means that at least some characters will be misclassified.

    using System;
    using System.Linq;
    using System.Globalization;
    
    // ...
    
      // Sample input char.
      char c = (char)0x20; // space
    
      // The set of Unicode character categories containing non-rendering,
      // unknown, or incomplete characters.
      // !! Unicode.Format and Unicode.PrivateUse can NOT be included in
      // !! this set, because they may (private-use) or do (format)
      // !! contain at least *some* rendering characters.
      var nonRenderingCategories = new UnicodeCategory[] {
        UnicodeCategory.Control,
        UnicodeCategory.OtherNotAssigned,
        UnicodeCategory.Surrogate };
    
      // Char.IsWhiteSpace() includes the ASCII whitespace characters that
      // are categorized as control characters. Any other character is
      // printable, unless it falls into the non-rendering categories.
      var isPrintable = Char.IsWhiteSpace(c) ||
        ! nonRenderingCategories.Contains(Char.GetUnicodeCategory(c));
    
    0 讨论(0)
提交回复
热议问题