Each word on a separate line

前端 未结 7 1280
囚心锁ツ
囚心锁ツ 2021-01-11 09:36

I have a sentence like

This is for example

I want to write this to a file such that each word in this sentence is written to a s

7条回答
  •  离开以前
    2021-01-11 10:42

    N.B. I wrote this in a few drafts simplifying the regexp so if there's any inconsistency that's probably why.

    Do you care about punctuation marks? For example in some invocations you would see e.g. a 'word' like (etc) as that exactly with the parentheses. Or the word would be 'parentheses.' rather than 'parentheses'. If you're parsing a file with proper sentences that could be a problem esp if you're wanting to sort by word or even get a word count for each word.

    There are ways to deal with this but there are some caveats and certainly there's room for improvement. These happen to do with numbers, dashes (in numbers) and decimal points/dots (in numbers). Perhaps having an exact set of rules would help resolve this but the below examples can give you some things to work on. I have made some contrived input examples to demonstrate these flaws (or whatever you wish to call them).

    $ echo "This is an example sentence with punctuation marks and digits i.e. , . ; \! 7 8 9" | grep -o -E '\<[A-Za-z0-9.]*\>'
    This
    is
    an
    example
    sentence
    with
    punctuation
    marks
    and
    digits
    i.e
    7
    8
    9
    

    As you can see the i.e.` turns out to be just i.e and the punctuation marks otherwise are not shown. Okay but this leaves out things like version numbers in the form of major.minor.revision-release e.g. 0.0.1-1; can this be shown too? Yes:

    $ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[-A-Za-z0-9.]*\>'
    The
    current
    version
    is
    0.0.1-1
    The
    previous
    version
    was
    current
    from
    2017-2018
    

    Observe that the sentences do not end with a full stop. What happens if you add a space between the years and the dash? You won't have the dash but each year will be on its own line:

    $ echo "2017 - 2018" | grep -o -E '\<[-A-Za-z0-9.]*\>'
    2017
    2018
    

    The question then becomes if you want - by themselves to be counted; by the very nature of separating words you won't have the years as a single string if there are spaces. Because it's not a word by itself I would think not.

    I am sure these could be simplified further. In addition if you don't want any punctuation or numbers at all you could change it to:

    $ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z]*\>'
    The
    current
    version
    is
    The
    previous
    version
    was
    current
    from
    

    If you wanted to have the numbers:

    $ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z0-9]*\>'
    The
    current
    version
    is
    0
    0
    1
    1
    The
    previous
    version
    was
    current
    from
    2017
    2018
    

    As for 'words' with both letters and numbers that's another thing that might or might not be of consideration but demonstrating the above:

    $ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z0-9]*\>'
    The
    current
    version
    is
    0
    0
    1
    1
    test1
    

    Outputs them. But the following does not (because it doesn't consider numbers at all):

    $ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z]*\>'
    The
    current
    version
    is
    

    It's quite easy to disregard punctuation marks but in some cases there might be need or desire for them. In the case of e.g. I suppose you could use say sed to change lines like e.g to e.g. but that would be a personal preference, I guess.

    I can summarise how it works but only just; I’m far too tired to even consider much:

    How does it work?

    I will only explain the invocation grep -o -E '\<[-A-Za-z0-9.]*\>' but much of it is the same in the others (the vertical bar/pipe symbol in extended grep allows for more than one pattern):

    The -o option is for only printing matches rather than the entire line. The -E is for extended grep (could just as well have used egrep). As for the regexp itself:

    The <\ and \> are word boundaries (beginning and ending respectively - you can specify only one if you want); I believe the -w option is the same as specifying both but maybe the invocation is a bit different (I don't actually know).

    The '\<[-A-Za-z0-9.]*\>' says dashes, upper and lower case letters and a dot zero or more times. As for why then it turns e.g. to .e.g I at this time can only say it is the pattern but I do not have the faculties to consider it more.

    Bonus script for word frequency count

    #!/bin/bash
    
    if [ $# -eq 0 ]; then
        echo "Usage: $(basename ${0})  [FILE...]"
        exit 1
    fi
    
    for file do
        if [ -e "${file}" ]
        then
            echo "** ${file}: "
            grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|sort | uniq -c | sort -rn
        else
        echo >&2 "${1}: file not found"
        continue
        fi
    done
    

    Example:

    $ cat example 
    The current version is 0.0.1-1 but the previous version was non-existent.
    
    This sentence contains an abbreviation i.e. e.g. (so actually two abbreviations).
    
    This sentence has no numbers and no punctuation  
    $ ./wordfreq example 
    ** example: 
       2 version
       2 sentence
       2 no
       2 This
       1 was
       1 two
       1 the
       1 so
       1 punctuation
       1 previous
       1 numbers
       1 non-existent
       1 is
       1 i.e
       1 has
       1 e.g
       1 current
       1 contains
       1 but
       1 and
       1 an
       1 actually
       1 abbreviations
       1 abbreviation
       1 The
       1 0.0.1-1
    

    N.B. I didn't transliterate upper case to lower case so the words 'The' and 'the' show up as different words. If you wanted them to be all lower case you could change the grep invocation in the script to be piped to tr before sorting:

        grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|tr '[A-Z]' '[a-z]'|sort | uniq -c | sort -rn
    

    Oh and since you asked if you want to write it to a file you can just add to the command line (this is for the raw invocation):

    > output_file
    

    For the script you would use it like:

    $ ./wordfreq file1 file2 file3 > output_file
    

提交回复
热议问题