grep: match all characters up to (not including) first blank space

前端 未结 4 786
南旧
南旧 2021-02-05 04:24

I have a text file that has the following format:

characters(that I want to keep) (space) characters(that I want to remove)

So for example:

相关标签:
4条回答
  • 2021-02-05 04:50

    I realize this has long since been answered with the grep solution, but for future generations I'd like to note that there are at least two other solutions for this particular situation, both of which are more efficient than grep.

    Since you are not doing any complex text pattern matching, just taking the first column delimited by a space, you can use some of the utilities which are column-based, such as awk or cut.

    Using awk

    $ awk '{print $1}' text1.txt > text2.txt
    

    Using cut

    $ cut -f1 -d' ' text1.txt > text2.txt
    

    Benchmarks on a ~1.1MB file

    $ time grep -o '^[^ ]*' text1.txt > text2.txt
    
    real    0m0.064s
    user    0m0.062s
    sys     0m0.001s
    $ time awk '{print $1}' text1.txt > text2.txt
    
    real    0m0.021s
    user    0m0.017s
    sys     0m0.004s
    $ time cut -f1 -d' ' text1.txt > text2.txt
    
    real    0m0.007s
    user    0m0.004s
    sys     0m0.003s
    

    awk is about 3x faster than grep, and cut is about 3x faster than that. Again, there's not much difference for this small file for just one run, but if you're writing a script, e.g., for re-use, or doing this often on large files, you might appreciate the extra efficiency.

    0 讨论(0)
  • 2021-02-05 04:54

    You are putting quantifier * at the wrong place.

    Try instead this: -

    grep '^[^\s]*' text1.txt > text2.txt
    

    or, even better: -

    grep '^\S*' text1.txt > text2.txt  
    

    \S means match non-whitespace character. And anchor ^ is used to match at the beginning of the line.

    0 讨论(0)
  • 2021-02-05 05:12

    I use egrep a lot to help "colorize" log lines, so I'm always looking for a new twist on regex. For me, the above works better by adding a \W like this:

    $ egrep --color '^\S*\W|bag' /tmp/barf -o
    foo
    bag
    hello
    bag
    keepthis
    (etc.)
    

    Problem is, my log files almost always are time-stamped, so I added a line to the example file:

    2013-06-11 date stamped line
    

    and then it doesn't work so well. So I reverted to my previous regex:

    egrep --color '^\w*\b|bag' /tmp/barf
    

    but the non-date-stamped lines revealed problems with that. It is hard to see this without colorization...

    0 讨论(0)
  • 2021-02-05 05:12

    Following up on the answer by @Steve, if you want to use a different separator (eg. comma), you can specify it using -F. This will be useful if you want the content of each line upto the first comma, such as when trying to read the value of the first field in a csv file.

    $ awk -F "," '{print $1}' text1.txt > text2.txt
    
    0 讨论(0)
提交回复
热议问题