Find Unique Characters in a File

前端未结

关注

 22  2325

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my f

相关标签:

22条回答

别跟我提以往

2021-02-04 04:05

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

0 讨论(0)

予麋鹿

2021-02-04 04:07
Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.
```
file = open('location.txt', 'r')
letters = {}
for line in file:
  if line == "":
    break
  for character in line.strip():
    if character not in letters:
      letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-02-04 04:07
Quick and dirty solution using grep (assuming the file name is "file"):
```
for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do 
    if [ ! -z "`grep -li $char file`" ]; then 
        echo -n $char; 
    fi; 
done; 
echo
```
I could have made it a one-liner but just want to make it easier to read.

(EDIT: forgot the -i switch to grep)
0 讨论(0)
发布评论:

提交评论
- 加载中...

不要未来只要你来

2021-02-04 04:09

While not an script this java program will do the work. It's easy to understand an fast ( to run )

import java.util.*;
import java.io.*;
public class  Unique {
    public static void main( String [] args ) throws IOException { 
        int c = 0;
        Set s = new TreeSet();
        while( ( c = System.in.read() ) > 0 ) {
            s.add( Character.toLowerCase((char)c));
        }
        System.out.println( "Unique characters:" + s );
    }
}

You'll invoke it like this:

type yourFile | java Unique

cat yourFile | java Unique

For instance, the unique characters in the HTML of this question are:

Unique characters:[ , , ,  , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, @, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]

0 讨论(0)

情书的邮戳

2021-02-04 04:09
Well my friend, I think this is what you had in mind....At least this is the python version!!!
```
f = open("location.txt", "r") # open file

ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
```
It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)

I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)
```
import itertools, sys

# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2021-02-04 04:09
This answer above mentioned using a dictionary.

If so, the code presented there can be streamlined a bit, since the Python documentation states:

It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten.

Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:
```
    if character not in letters:
```
And that should make it a little faster.
0 讨论(0)
发布评论:

提交评论
- 加载中...