I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mrs. Plain Exampl
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example
Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m
is available to you (macOS has it).LC_ALL=UTF-8
. You can set this either in your .bash_profile, or simply by prepending it before the following command.testfile
has a character encoding matching your locale (e.g., UTF-8).Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l);
← makes of a copy of each line in awk variable l
and double-escapes every '
so the line can safely be echoed as a shell command (\047
is a single-quote in octal notation).cmd=sprintf("echo \047%s\047 | wc -m", l);
← this is the command we'll execute, which echoes the escaped line to wc -m
.cmd | getline c;
← executes the command and copies the character count value that is returned into awk variable c
.close(cmd);
← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.sub(/ */, "", c);
← trims white space from the character count value returned by wc
.{ print c, $0 }
← prints the line's character count value, a space, and the original line.| sort -ns
← sorts the lines (by prepended character count values) numerically (-n
), and maintaining stable sort order (-s
).| cut -d" " -f2-
← removes the prepended character count values.It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk
(as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
Below are the results of a benchmark across solutions from other answers to this question.
awk
solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.perl
solutionperl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done