Sort a text file by line length including spaces

后端未结

关注

 11  2053

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Exampl

相关标签:

11条回答

失恋的感觉

2020-11-27 11:24

1) pure awk solution. Let's suppose that line length cannot be more > 1024 then

cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'

2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:

LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1

0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-11-27 11:29
The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
```
awk '{ printf "%d:%s\n", length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]*://'
```
The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:
```
awk '{ print length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]* //'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-11-27 11:33
The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:
```
perl -e 'print sort { length($a) <=> length($b) } <>'
```
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.

In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-11-27 11:33
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):
```
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

南方客

2020-11-27 11:35

using Raku (formerly known as Perl6)

~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56

To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:

~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0

Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:

~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
    
    real    0m1.308s
    user    0m1.213s
    sys 0m0.173s
    
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-  > /dev/null
    
    real    0m1.189s
    user    0m1.170s
    sys 0m0.050s

HTH.

https://raku.org

0 讨论(0)

佛祖请我去吃肉

2020-11-27 11:38
Answer
```
cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
```
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
```
cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
```
In both cases, we have solved your stated problem by moving away from awk for your final cut.

Lines of matching length - what to do in the case of a tie:

The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

(Those who want more control of sorting these ties might look at sort's --key option.)

Why the question's attempted solution fails (awk line-rebuilding):

It is interesting to note the difference between:
```
echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'
```
They yield respectively
```
hello   awk   world
hello awk world
```
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
```
 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0
```
"This forces awk to rebuild the record."

Test input including some lines of equal length:
```
aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

Sort a text file by line length including spaces

Answer

Lines of matching length - what to do in the case of a tie:

Why the question's attempted solution fails (awk line-rebuilding):

Test input including some lines of equal length: