Sort a text file by line length including spaces

后端 未结 11 2054
故里飘歌
故里飘歌 2020-11-27 11:21

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Exampl         


        
相关标签:
11条回答
  • 2020-11-27 11:38

    With POSIX Awk:

    {
      c = length
      m[c] = m[c] ? m[c] RS $0 : $0
    } END {
      for (c in m) print m[c]
    }
    

    Example

    0 讨论(0)
  • 2020-11-27 11:38

    Here is a multibyte-compatible method of sorting lines by length. It requires:

    1. wc -m is available to you (macOS has it).
    2. Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
    3. testfile has a character encoding matching your locale (e.g., UTF-8).

    Here's the full command:

    cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
    

    Explaining part-by-part:

    • l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
    • cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
    • cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
    • close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
    • sub(/ */, "", c); ← trims white space from the character count value returned by wc.
    • { print c, $0 } ← prints the line's character count value, a space, and the original line.
    • | sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
    • | cut -d" " -f2- ← removes the prepended character count values.

    It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

    Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

    0 讨论(0)
  • 2020-11-27 11:39

    Benchmark results

    Below are the results of a benchmark across solutions from other answers to this question.

    Test method

    • 10 sequential runs on a fast machine, averaged
    • Perl 5.24
    • awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
    • The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)

    Results

    1. Caleb's perl solution took 11.2 seconds
    2. my perl solution took 11.6 seconds
    3. neillb's awk solution #1 took 20 seconds
    4. neillb's awk solution #2 took 23 seconds
    5. anubhava's awk solution took 24 seconds
    6. Jonathan's awk solution took 25 seconds
    7. Fretz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.

    Another perl solution

    perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
    
    0 讨论(0)
  • 2020-11-27 11:41

    Try this command instead:

    awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
    
    0 讨论(0)
  • 2020-11-27 11:46

    Pure Bash:

    declare -a sorted
    
    while read line; do
      if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
        sorted[${#line}]="$line"                      # element for new length
      else
        sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
      fi
    done < data.csv
    
    for key in ${!sorted[*]}; do                      # iterate over existing indices
      echo -e "${sorted[$key]}"                       # echo lines with equal length
    done
    
    0 讨论(0)
提交回复
热议问题