How to count number of unique values of a field in a tab-delimited text file?

前端 未结 7 647
滥情空心
滥情空心 2020-12-23 14:21

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,

相关标签:
7条回答
  • 2020-12-23 14:33
    # COLUMN is integer column number
    # INPUT_FILE is input file name
    
    cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
    
    0 讨论(0)
  • 2020-12-23 14:33

    Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.

    #!/bin/bash
    # Syntax: $0 filename   
    # The input is assumed to be a .tsv file
    
    FILE="$1"
    
    cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
    cols=$((cols + 2 ))
    i=0
    for ((i=1; i < $cols; i++))
    do
      echo Column $i ::
      cut -f $i < "$FILE" | sort | uniq -c
      echo
    done
    
    0 讨论(0)
  • 2020-12-23 14:37
    awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
    
    0 讨论(0)
  • 2020-12-23 14:38

    This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.

    Code

    #!/bin/bash
    
    awk '
    (NR==1){
        for(fi=1; fi<=NF; fi++)
            fname[fi]=$fi;
    } 
    (NR!=1){
        for(fi=1; fi<=NF; fi++) 
            arr[fname[fi]][$fi]++;
    } 
    END{
        for(fi=1; fi<=NF; fi++){
            out=fname[fi];
            for (item in arr[fname[fi]])
                out=out"\t"item"_"arr[fname[fi]][item];
            print(out);
        }
    }
    ' $1
    

    Execution Example:

    bash> ./script.sh <path to tab-delimited file>

    Output Example

    isRef    A_15      C_42     G_24     T_18
    isCar    YEA_10    NO_40    NA_50
    isTv     FALSE_33  TRUE_66
    
    0 讨论(0)
  • 2020-12-23 14:40

    You can use awk, sort & uniq to do this, for example to list all the unique values in the first column

    awk < test.txt '{print $1}' | sort | uniq
    

    As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l

    0 讨论(0)
  • 2020-12-23 14:43

    You can make use of cut, sort and uniq commands as follows:

    cat input_file | cut -f 1 | sort | uniq
    

    gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.

    Avoiding UUOC :)

    cut -f 1 input_file | sort | uniq
    

    EDIT:

    To count the number of unique occurences you can make use of wc command in the chain as:

    cut -f 1 input_file | sort | uniq | wc -l
    
    0 讨论(0)
提交回复
热议问题