Find Unique Characters in a File

前端 未结 22 2242
耶瑟儿~
耶瑟儿~ 2021-02-04 03:30

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my f

相关标签:
22条回答
  • 2021-02-04 04:05

    Quick and dirty C program that's blazingly fast:

    #include <stdio.h>
    
    int main(void)
    {
      int chars[256] = {0}, c;
      while((c = getchar()) != EOF)
        chars[c] = 1;
      for(c = 32; c < 127; c++)  // printable chars only
      {
        if(chars[c])
          putchar(c);
      }
    
      putchar('\n');
    
      return 0;
    }
    

    Compile it, then do

    cat file | ./a.out
    

    To get a list of the unique printable characters in file.

    0 讨论(0)
  • 2021-02-04 04:07

    Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.

    file = open('location.txt', 'r')
    letters = {}
    for line in file:
      if line == "":
        break
      for character in line.strip():
        if character not in letters:
          letters[character] = True
    file.close()
    print "Unique Characters: {" + "".join(letters.keys()) + "}"
    
    0 讨论(0)
  • Quick and dirty solution using grep (assuming the file name is "file"):

    for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do 
        if [ ! -z "`grep -li $char file`" ]; then 
            echo -n $char; 
        fi; 
    done; 
    echo
    

    I could have made it a one-liner but just want to make it easier to read.

    (EDIT: forgot the -i switch to grep)

    0 讨论(0)
  • While not an script this java program will do the work. It's easy to understand an fast ( to run )

    import java.util.*;
    import java.io.*;
    public class  Unique {
        public static void main( String [] args ) throws IOException { 
            int c = 0;
            Set s = new TreeSet();
            while( ( c = System.in.read() ) > 0 ) {
                s.add( Character.toLowerCase((char)c));
            }
            System.out.println( "Unique characters:" + s );
        }
    }
    

    You'll invoke it like this:

    type yourFile | java Unique
    

    or

    cat yourFile | java Unique
    

    For instance, the unique characters in the HTML of this question are:

    Unique characters:[ , , ,  , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, @, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
    
    0 讨论(0)
  • 2021-02-04 04:09

    Well my friend, I think this is what you had in mind....At least this is the python version!!!

    f = open("location.txt", "r") # open file
    
    ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
    ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
    f.close()
    print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
    

    It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)

    I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)

    import itertools, sys
    
    # read standard input into memory, split into characters, eliminate duplicates
    ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
    print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
    
    0 讨论(0)
  • 2021-02-04 04:09

    This answer above mentioned using a dictionary.

    If so, the code presented there can be streamlined a bit, since the Python documentation states:

    It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten.

    Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:

        if character not in letters:
    

    And that should make it a little faster.

    0 讨论(0)
提交回复
热议问题