Sort a text file by line length including spaces

后端未结

关注

 11  2054

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Exampl

相关标签:

11条回答

北海茫月

2020-11-27 11:38
With POSIX Awk:
```
{
  c = length
  m[c] = m[c] ? m[c] RS $0 : $0
} END {
  for (c in m) print m[c]
}
```
Example
0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2020-11-27 11:38
Here is a multibyte-compatible method of sorting lines by length. It requires:
1. wc -m is available to you (macOS has it).
2. Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
3. testfile has a character encoding matching your locale (e.g., UTF-8).
Here's the full command:
```
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
```
Explaining part-by-part:
- l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
- cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
- cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
- close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
- sub(/ */, "", c); ← trims white space from the character count value returned by wc.
- { print c, $0 } ← prints the line's character count value, a space, and the original line.
- | sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
- | cut -d" " -f2- ← removes the prepended character count values.
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-11-27 11:39
Benchmark results

Below are the results of a benchmark across solutions from other answers to this question.

Test method
- 10 sequential runs on a fast machine, averaged
- Perl 5.24
- awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
- The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
Results
1. Caleb's perl solution took 11.2 seconds
2. my perl solution took 11.6 seconds
3. neillb's awk solution #1 took 20 seconds
4. neillb's awk solution #2 took 23 seconds
5. anubhava's awk solution took 24 seconds
6. Jonathan's awk solution took 25 seconds
7. Fretz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
Another perl solution
```
perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-11-27 11:41
Try this command instead:
```
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

别跟我提以往

2020-11-27 11:46

Pure Bash:

declare -a sorted

while read line; do
  if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
    sorted[${#line}]="$line"                      # element for new length
  else
    sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
  fi
done < data.csv

for key in ${!sorted[*]}; do                      # iterate over existing indices
  echo -e "${sorted[$key]}"                       # echo lines with equal length
done

0 讨论(0)

上一页 1 2

Sort a text file by line length including spaces

Benchmark results

Test method

Results

Another perl solution

Another `perl` solution