I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my f
Where C:/data.txt
contains 454,863 rows of seven random alphabetic characters, the following code
using System;
using System.IO;
using System.Collections;
using System.Diagnostics;
namespace ConsoleApplication {
class Program {
static void Main(string[] args) {
FileInfo fileInfo = new FileInfo(@"C:/data.txt");
Console.WriteLine(fileInfo.Length);
Stopwatch sw = new Stopwatch();
sw.Start();
Hashtable table = new Hashtable();
StreamReader sr = new StreamReader(@"C:/data.txt");
while (!sr.EndOfStream) {
char c = Char.ToLower((char)sr.Read());
if (!table.Contains(c)) {
table.Add(c, null);
}
}
sr.Close();
foreach (char c in table.Keys) {
Console.Write(c);
}
Console.WriteLine();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
produces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in C:/data.txt
(454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt
(including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.
s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
if unique.has_key(s[i]):
unique[s[i]]=unique[s[i]]+1
else:
unique[s[i]]=1
print unique
BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user@host]$ g++ -o unique_chars unique_chars.cpp
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s
Algorithm: Slurp the file into memory.
Create an array of unsigned ints, initialized to zero.
Iterate though the in memory file, using each byte as a subscript into the array.
increment that array element.
Discard the in memory file
Iterate the array of unsigned int
if the count is not zero,
display the character, and its corresponding count.