Find Unique Characters in a File

前端 未结 22 2292
耶瑟儿~
耶瑟儿~ 2021-02-04 03:30

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my f

相关标签:
22条回答
  • 2021-02-04 04:24

    Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code

    using System;
    using System.IO;
    using System.Collections;
    using System.Diagnostics;
    
    namespace ConsoleApplication {
        class Program {
            static void Main(string[] args) {
                FileInfo fileInfo = new FileInfo(@"C:/data.txt");
                Console.WriteLine(fileInfo.Length);
    
                Stopwatch sw = new Stopwatch();
                sw.Start();
    
                Hashtable table = new Hashtable();
    
                StreamReader sr = new StreamReader(@"C:/data.txt");
                while (!sr.EndOfStream) {
                    char c = Char.ToLower((char)sr.Read());
                    if (!table.Contains(c)) {
                        table.Add(c, null);
                    }
                }
                sr.Close();
    
                foreach (char c in table.Keys) {
                    Console.Write(c);
                }
                Console.WriteLine();
    
                sw.Stop();
                Console.WriteLine(sw.ElapsedMilliseconds);
            }
        }
    }
    

    produces output

    4093767
    mytojevqlgbxsnidhzupkfawr
    c
    889
    Press any key to continue . . .

    The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.

    0 讨论(0)
  • 2021-02-04 04:24
    s=open("text.txt","r").read()
    l= len(s)
    unique ={}
    for i in range(l):
     if unique.has_key(s[i]):
      unique[s[i]]=unique[s[i]]+1
     else:
      unique[s[i]]=1
    print unique
    
    0 讨论(0)
  • 2021-02-04 04:25

    BASH shell script version (no sed/awk):

    while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u
    

    UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

    #include <iostream>
    #include <set>
    
    int main() {
        std::set<char> seen_chars;
        std::set<char>::const_iterator iter;
        char ch;
    
        /* ignore whitespace and case */
        while ( std::cin.get(ch) ) {
            if (! isspace(ch) ) {
                seen_chars.insert(tolower(ch));
            }
        }
    
        for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
            std::cout << *iter << std::endl;
        }
    
        return 0;
    }
    

    Note that I'm ignoring whitespace and it's case insensitive as requested.

    For a 450,000+ entry file (chars.txt), here's a sample run time:

    [user@host]$ g++ -o unique_chars unique_chars.cpp 
    [user@host]$ time ./unique_chars < chars.txt
    a
    b
    d
    o
    y
    
    real    0m0.638s
    user    0m0.612s
    sys     0m0.017s
    
    0 讨论(0)
  • 2021-02-04 04:25

    Algorithm: Slurp the file into memory.

    Create an array of unsigned ints, initialized to zero.
    
    Iterate though the in memory file, using each byte as a subscript into the array.
        increment that array element.
    
    Discard the in memory file
    
    Iterate the array of unsigned int
           if the count is not zero,
               display the character, and its corresponding count.
    
    0 讨论(0)
提交回复
热议问题