Find Unique Characters in a File

前端未结

关注

 22  2243

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my f

相关标签:

22条回答

悲&欢浪女

2021-02-04 04:24

Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code

using System;
using System.IO;
using System.Collections;
using System.Diagnostics;

namespace ConsoleApplication {
    class Program {
        static void Main(string[] args) {
            FileInfo fileInfo = new FileInfo(@"C:/data.txt");
            Console.WriteLine(fileInfo.Length);

            Stopwatch sw = new Stopwatch();
            sw.Start();

            Hashtable table = new Hashtable();

            StreamReader sr = new StreamReader(@"C:/data.txt");
            while (!sr.EndOfStream) {
                char c = Char.ToLower((char)sr.Read());
                if (!table.Contains(c)) {
                    table.Add(c, null);
                }
            }
            sr.Close();

            foreach (char c in table.Keys) {
                Console.Write(c);
            }
            Console.WriteLine();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }
}

produces output

4093767 mytojevqlgbxsnidhzupkfawr c 889 Press any key to continue . . .

The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.

0 讨论(0)

名媛妹妹

2021-02-04 04:24

s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
 if unique.has_key(s[i]):
  unique[s[i]]=unique[s[i]]+1
 else:
  unique[s[i]]=1
print unique

0 讨论(0)

失恋的感觉

2021-02-04 04:25

BASH shell script version (no sed/awk):

while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u

UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

#include <iostream>
#include <set>

int main() {
    std::set<char> seen_chars;
    std::set<char>::const_iterator iter;
    char ch;

    /* ignore whitespace and case */
    while ( std::cin.get(ch) ) {
        if (! isspace(ch) ) {
            seen_chars.insert(tolower(ch));
        }
    }

    for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
        std::cout << *iter << std::endl;
    }

    return 0;
}

Note that I'm ignoring whitespace and it's case insensitive as requested.

For a 450,000+ entry file (chars.txt), here's a sample run time:

[user@host]$ g++ -o unique_chars unique_chars.cpp 
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y

real    0m0.638s
user    0m0.612s
sys     0m0.017s

0 讨论(0)

没有蜡笔的小新

2021-02-04 04:25

Algorithm: Slurp the file into memory.

Create an array of unsigned ints, initialized to zero.

Iterate though the in memory file, using each byte as a subscript into the array.
    increment that array element.

Discard the in memory file

Iterate the array of unsigned int
       if the count is not zero,
           display the character, and its corresponding count.

0 讨论(0)

上一页 1 2 3 4