Find Unique Characters in a File

前端 未结 22 2323
耶瑟儿~
耶瑟儿~ 2021-02-04 03:30

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my f

相关标签:
22条回答
  • 2021-02-04 03:59

    Python w/sets (quick and dirty)

    s = open("data.txt", "r").read()
    print "Unique Characters: {%s}" % ''.join(set(s))
    

    Python w/sets (with nicer output)

    import re
    
    text = open("data.txt", "r").read().lower()
    unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric
    
    print "Unique Characters: {%s}" % unique
    
    0 讨论(0)
  • 2021-02-04 04:01

    A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.

    Why the arbitrary limitation that you need a "script" that does it?

    What exactly is a script anyway?

    Would Python do?

    If so, then this is one solution:

    import sys;
    
    s = set([]);
    while True:
        line = sys.stdin.readline();
        if not line:
            break;
        line = line.rstrip();
        for c in line.lower():
            s.add(c);
    
    print("".join(sorted(s)));
    
    0 讨论(0)
  • 2021-02-04 04:02

    A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)

    #include<stdio.h>
    
    #define CHARSINSET 256
    #define FILENAME "location.txt"
    
    char buf[CHARSINSET + 1];
    
    char *getUniqueCharacters(int *charactersInFile) {
        int x;
        char *bufptr = buf;
        for (x = 0; x< CHARSINSET;x++) {
            if (charactersInFile[x] > 0)
                *bufptr++ = (char)x;
        }
        bufptr = '\0';
        return buf;
    }
    
    int main() {
        FILE *fp;
        char c;
        int *charactersInFile = calloc(sizeof(int), CHARSINSET);
        if (NULL == (fp = fopen(FILENAME, "rt"))) {
            printf ("File not found.\n");
            return 1;
        }
        while(1) {
            c = getc(fp);
            if (c == EOF) {
                break;
            }
            if (c != '\n' && c != '\r')
                charactersInFile[c]++;
        }
    
        fclose(fp);
        printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
        return 0;
    }
    
    0 讨论(0)
  • 2021-02-04 04:02

    Here's a PowerShell example:

    gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
    

    which produces:

    D
    Y
    a
    b
    o

    I like that it's easy to read.

    EDIT: Here's a faster version:

    $letters = @{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
    
    0 讨论(0)
  • 2021-02-04 04:05

    Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):

    var seenAlreadyMap={};
    var seenAlreadyArray=[];
    while (!system.stdin.eof)
    {
      var L = system.stdin.readLine();
      for (var i = L.length; i-- > 0; )
      {
        var c = L[i].toLowerCase();
        if (!(c in seenAlreadyMap))
        {
          seenAlreadyMap[c] = true;
          seenAlreadyArray.push(c);
        }
      }
    }
    system.stdout.writeln(seenAlreadyArray.sort().join(''));
    
    0 讨论(0)
  • 2021-02-04 04:05

    Python without using a set.

    file = open('location', 'r')
    
    letters = []
    for line in file:
        for character in line:
            if character not in letters:
                letters.append(character)
    
    print(letters)
    
    0 讨论(0)
提交回复
热议问题