How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented?

前端 未结 5 1079
独厮守ぢ
独厮守ぢ 2020-12-25 13:48

I\'ve read about how Zalgo text works, and I\'m looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set

相关标签:
5条回答
  • 2020-12-25 14:17

    Make the box overflow:hidden. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.

    .comment {
      /* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
      overflow: hidden;
      /* the padding gives space for any legitimate combining marks */
      padding: 0.5em;
      /* the rest are just to visually divide the three comments */
      border: solid 1px #ccc;
      margin-top: -1px;
      margin-bottom: -1px;
    }
    <div class=comment>The below comment looks awful.</div>
    <div class=comment>H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡</div>
    <div class=comment>The above comment looks awful.</div>

    0 讨论(0)
  • 2020-12-25 14:28

    A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.

    In terms of preventing this you can choose several strategies:

    1. prevent combining diacritics entirely (and piss off many international users),
    2. filter out combining characters using whitelisting or blacklisting (and piss off a smaller percentage of international users)
    3. prevent a certain number of combining characters (and piss of an even smaller percentage of users)
    4. have a healthy moderator community (with all the downsides that has, see your question as an example here)
    0 讨论(0)
  • 2020-12-25 14:30

    Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.

    $unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);
    
    0 讨论(0)
  • 2020-12-25 14:42

    You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.

    The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).

    Here is an example on how to use it with npm:

    var stripCombiningMarks = require("strip-combining-marks");
    var zalgoText = 'U̼̥̻̮͍͖n͠i͏c̯̮o̬̝̠͉̤d͖͟e̫̟̗͟ͅ';
    var stripptedText = stripCombiningMarks(zalgoText); // "Unicode"
    
    0 讨论(0)
  • 2020-12-25 14:43

    Assuming you're very serious about this and want a technical solution you could do as follows:

    1. Split the incoming text into smaller units (words or sentences);
    2. Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
    3. Train a machine learning algorithm to judge if it looks too "dark" and "busy";
    4. If the algorithm's confidence is low defer to human moderators.

    This could be fun to implement but in practice it would likely be better to go to step four straight away.

    Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES.

    #!/usr/bin/env python
    import unicodedata
    import codecs
    
    ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
    
    with codecs.open("zalgo", 'r', 'utf-8') as infile:
        for line in infile:
            print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),
    

    Example input:

    1
    H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
    2
    H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
    3
    

    Output:

    1
    How does Zalgo text work?
    2
    How does Zalgo text work?
    3
    

    Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.

    #!/usr/bin/env python
    from __future__ import division
    import unicodedata
    import codecs
    import numpy
    
    ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
    THRESHOLD = 0.5
    DEBUG = True
    
    def is_zalgo(s):
        if len(s) == 0:
            return False
        word_scores = []
        for word in s.split():
            cats = [unicodedata.category(c) for c in word]
            score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
            word_scores.append(score)
        total_score = numpy.percentile(word_scores, 75)
        if DEBUG:
            print total_score
        return total_score > THRESHOLD
    
    with codecs.open("zalgo", 'r', 'utf-8') as infile:
        for line in infile:
            print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line
    

    Sample output:

    0.911483990148
    True    Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
    
    0.333333333333
    False   Příliš žluťoučký kůň úpěl ďábelské ódy.  
    
    0 讨论(0)
提交回复
热议问题