I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my f
As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Alternative solution using bash:
sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$
EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c
switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed
pattern instead of explicit captures).
Use a set
data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set
, from its mathematical usage: sets don't have a particular order and only unique values).
import codecs
file = codecs.open('my_file_name', encoding='utf-8')
# Runtime: O(1)
letters = set()
# Runtime: O(n^2)
for line in file:
for character in line:
letters.add(character)
# Runtime: O(n)
letter_str = ''.join(letters)
print(letter_str)
Save as unique.py
, and run as python unique.py
.
cat yourfile |
perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.