C#: Removing common invalid characters from a string: improve this algorithm

前端 未结 9 878
孤城傲影
孤城傲影 2020-12-29 07:18

Consider the requirement to strip invalid characters from a string. The characters just need to be removed and replace with blank or string.Empty.



        
相关标签:
9条回答
  • 2020-12-29 07:30

    if you still want to do it in a LINQy way:

    public static string CleanUp(this string orig)
    {
        var badchars = new HashSet<char>() { '!', '@', '#', '$', '%', '_' };
    
        return new string(orig.Where(c => !badchars.Contains(c)).ToArray());
    }
    
    0 讨论(0)
  • 2020-12-29 07:32

    This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:

    string clean = new string(@"Sour!ce Str&*(@ing".Where(c => 
    @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()
    
    0 讨论(0)
  • 2020-12-29 07:33

    I don't know about the readability of it, but a regular expression could do what you need it to:

    someString = Regex.Replace(someString, @"[!@#$%_]", "");
    
    0 讨论(0)
  • 2020-12-29 07:34

    The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.

    The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)

    Try the following extension method. The key is the use of StringBuilder, which means only one block of memory is assigned for the result during execution.

    private static readonly HashSet<char> badChars = 
        new HashSet<char> { '!', '@', '#', '$', '%', '_' };
    
    public static string CleanString(this string str)
    {
        var result = new StringBuilder(str.Length);
        for (int i = 0; i < str.Length; i++)
        {
            if (!badChars.Contains(str[i]))
                result.Append(str[i]);
        }
        return result.ToString();
    }
    

    This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one (m being the number of bad chars); it also is lot a better with memory usage, as explained above.

    0 讨论(0)
  • 2020-12-29 07:34

    Something to consider -- if this is for passwords (say), you want to scan for and keep good characters, and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.

    For Each Character If Character is Good -> Keep it (copy to out buffer, whatever.)

    jeff

    0 讨论(0)
  • 2020-12-29 07:35

    This one is faster than HashSet<T>. Also, if you have to perform this action often, please consider the foundations for this question I asked here.

    private static readonly bool[] BadCharValues;
    
    static StaticConstructor()
    {
        BadCharValues = new bool[char.MaxValue+1];
        char[] badChars = { '!', '@', '#', '$', '%', '_' };
        foreach (char c in badChars)
            BadCharValues[c] = true;
    }
    
    public static string CleanString(string str)
    {
        var result = new StringBuilder(str.Length);
        for (int i = 0; i < str.Length; i++)
        {
            if (!BadCharValues[str[i]])
                result.Append(str[i]);
        }
        return result.ToString();
    }
    
    0 讨论(0)
提交回复
热议问题