屏蔽字过滤应该是每个游戏都必不可少的功能,然而屏蔽字库动则上10万,要实现个效率较高的过滤还是不容易的。
现在假设有一个屏蔽词为“龙在天”,那么,当出现“我是龙####在,,,,天哟”的时候,正确的过滤结果应该是得出“我是*”。接下来就讨论一下这个实现的原理。
首先,将龙在天的头尾两个字取出来,及“龙”和“天”,当“我是龙####在,,,,天哟”出现时,只要检测是否包含头尾,然后取出这一段字符串来(“龙####在,,,,天”),再去除中间的干扰字符跟原屏蔽字(“龙在天”)匹配即可,这里不做复杂的处理,只是简单将常用符号做一个表,直接用,也可以满足大部分需求了(可以考虑给字符分类,做优先级,低优先级的判定为干扰字符)。
先贴下完整代码,后面再简单讲下代码的原理咯
public sealed class SubSensitiveWord
{
public char Last;
public string Source;
}
public sealed class SensitiveWord
{
public char First;
public Dictionary<char, SubSensitiveWord> LastWords = new Dictionary<char, SubSensitiveWord>();
}
public static class XIllegalWordsManager
{
private static Dictionary<char /*first*/, SensitiveWord> SensitiveWordMap = new Dictionary<char, SensitiveWord>();
private static HashSet<char> _specialCode = new HashSet<char>
{
'·','~','!','@','#','¥','%','…','&','*','(',')','-','=','—','+','【','】','{'
};
public static bool Init()
{
foreach (XTableWords value in wordTemplateTable)
{
var first = value.Word[0];
var last = value.Word[value.Word.Length - 1];
if (!SensitiveWordMap.ContainsKey(first))
{
var sensitiveWord = new SensitiveWord
{
First = first,
};
sensitiveWord.LastWords.Add(last, new SubSensitiveWord
{
Last = last,
Source = value.Word
});
SensitiveWordMap.Add(value.Word[0], sensitiveWord);
}
else
{
var sensitiveWord = SensitiveWordMap[first];
if (sensitiveWord == null)
{
continue;
}
if (!sensitiveWord.LastWords.ContainsKey(last))
{
sensitiveWord.LastWords.Add(last, new SubSensitiveWord
{
Last = last,
Source = value.Word
});
}
}
}
}
private static string FilterValidWords(string source)
{
List<int> removeIndexList = new List<int>();
for (int i = 0; i < source.Length; ++i)
{
if (!SpecialCode.Contains(source[i]))
{
continue;
}
removeIndexList.Add(i);
}
var target = source;
for (int i = removeIndexList.Count - 1; i >= 0; i--)
{
target = target.Remove(removeIndexList[i], 1);
}
return target;
}
public static string Replace(string source, string replaceChar)
{
var target = "";
var lastEnd = -1;
var startIndex = 0;
while (startIndex >= 0 && startIndex < source.Length - 1)
{
var (begin, end) = FindSensitiveWord(source, startIndex);
if (begin < 0)
{
break;
}
target += source.Substring(lastEnd + 1, begin - lastEnd - 1);
target += replaceChar;
lastEnd = end;
startIndex = end + 1;
}
target += source.Substring(lastEnd + 1, source.Length - lastEnd - 1);
return target;
}
private static (int, int) FindSensitiveWord(string words, int startIndex)
{
int first = -1, last = -1;
for (var i = startIndex; i < words.Length; ++i)
{
if (!SensitiveWordMap.TryGetValue(words[i], out var sensitiveWord))
{
continue;
}
first = i;
for (var j = i + 1; j < words.Length; ++j)
{
if (!sensitiveWord.LastWords.TryGetValue(words[j], out var lastWord))
{
continue;
}
last = j;
var tempWords = words.Substring(first, last - first + 1);
tempWords = FilterValidWords(tempWords);
if (tempWords == lastWord.Source)
{
return (first, last);
}
}
}
return (-1, -1);
}
}
先看开头这两个结构体的定义:
public sealed class SubSensitiveWord
{
public char Last;
public string Source;
}
public sealed class SensitiveWord
{
public char First;
public Dictionary<char, SubSensitiveWord> LastWords = new Dictionary<char, SubSensitiveWord>();
}
在系统启动的时候,就是要把每个屏蔽字字符串保存成这样一个形式,First是首字符,Last是最后一个字符,这里考虑到可能有很多首字符相同的屏蔽字,于是尾字符和原字符串绑定在一起作为一个字典,key为尾字符。这里先不考虑存在首尾都相同的字符(如果需要考虑,把Source换成一个HashSet就好了)。
接下来看下具体检测的操作,毕竟这个结构就是为了检测服务的,单独看,没有太多意义。检测最重要的逻辑在这个函数里面:
private static (int, int) FindSensitiveWord(string words, int startIndex)
{
int first = -1, last = -1;
for (var i = startIndex; i < words.Length; ++i)
{
if (!SensitiveWordMap.TryGetValue(words[i], out var sensitiveWord))
{
continue;
}
first = i;
for (var j = i + 1; j < words.Length; ++j)
{
if (!sensitiveWord.LastWords.TryGetValue(words[j], out var lastWord))
{
continue;
}
last = j;
var tempWords = words.Substring(first, last - first + 1);
tempWords = FilterValidWords(tempWords);
if (tempWords == lastWord.Source)
{
return (first, last);
}
}
}
return (-1, -1);
}
函数的第一个参数是需要查找的字符串,第二个是开始检测的下标位置。首先从头到尾扫描,到SensitiveWordMap里面查找,看是否存在首字符,如果存在,记录下标,即 first = i;
接着,如果存在,从位置i开始,继续往下面遍历,看是否存在尾字符,如果存在,记录下表,即 last = j;
好了,到这一步就可以截取出可疑的子串了,var tempWords = words.Substring(first, last - first + 1); 接下来只要把可疑的字串,也就是这个tempWords中干扰的字符去除,然后跟Source(原始屏蔽字)进行比较,如果相等,就可以判定为屏蔽字,返回first,last的坐标了。具体去除干扰的代码就很简单了,不具体聊了(FilterValidWords函数)。
能够定位到first,last坐标基本就大功告成了,只要检查完一个继续接着last的坐标跑下去,一直循环检查到最后一个字符就可以了(Replace函数),也不具体聊了。
好啦,代码比较一般,大家轻喷。因为一般来说输入服务器的文字个数都是有一定限制的,这个检测方法主要跟检测字符串的长度相关,所有把,在长度有限制的情况下,效率还是不错的。大家有更好的想法欢迎分享一下。
来源:CSDN
作者:fly-dragon
链接:https://blog.csdn.net/dragonOnSky555/article/details/103812338