问题
okay so I wish to keep lines containing several keywords,
example of list:
Name:email:username #registered
Name2:email2:username2
Name3:email3:username3 #registered #subscribed #phonever
Name4:email4:username4 #unconfirmed
What I want to do is extract lines if they contain " #registered, #subscribed, #phonever
example of output I want,
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
回答1:
With awk
(use regex alternation operator, |
, on a list of fixed strings):
awk '/#registered|#subscribed|#phonever/' file
The part under /.../
is called an awk pattern and for the matching lines it executes the action that follows (as { ... }
). But since the default action is: { print $0 }
(printing the complete input record/line), there's no need to specify it here.
Similarly with sed
you could say:
sed -nE '/#registered|#subscribed|#phonever/p' file
but now we have to specify -n
to skip printing by default, and print with the p
command only those lines that match the pattern (called sed
address). The -E
tells sed
to used POSIX ERE (extended regex), and we need it here, because the default, POSIX BRE (basic regex) does not define the alternation operator.
For simple filtering (and printing the lines that match some pattern), grep
is also an option (and a very fast option at that):
grep '#registered\|#subscribed\|#phonever' file
A bit more general solution (awk
with patterns file)
Solution for larger (and possibly dynamic) lists of patterns could be to keep all patterns in a separate file, for example in patterns
:
#registered
#subscribed
#phonever
and to use this awk
program:
awk 'NR==FNR { pat[$0]=1 } NR>FNR { for (p in pat) if ($0 ~ p) {print;next} }' patterns file
which will first load all patterns into pat
array, and then try to match any of those patterns on each of the lines in file
, printing and advancing on to the next line on the first match found.
The result is the same:
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
but the script now doesn't change for each new set of patterns. Note however, this caries a performance penalty (as general solutions usually do). For shorter lists of patterns and smaller files, this shouldn't be a problem.
And a lot faster variant of the above (grep
with fixed-string patterns file)
Building on the approach from above (of keeping a list of fixed-string "patterns" in a file), we can actually use grep
-- which provides a specialized option (-f FILE
) for obtaining patterns from file, one per line. To further speed-up the matching, we should also use -F
/--fixed-strings
option.
So, this:
grep -Ff patterns file
will be incredibly fast, handling long lists of fixed-string patterns and huge files with minimal memory overhead.
回答2:
Simple awk approach:
awk '/#(registered|subscribed|phonever)/' file
The output:
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
(registered|subscribed|phonever)
- regexp alternation group to match a single regular expression out of several possible regular expressions
回答3:
$ cat tst.awk
NR==FNR {
strings[$0]
next
}
{
for (i=2; i<=NF; i++) {
if ($i in strings) {
print
next
}
}
}
$ awk -f tst.awk strings file
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
$ cat strings
#registered
#subscribed
#phonever
$ cat file
Name:email:username #registered
Name2:email2:username2
Name3:email3:username3 #registered #subscribed #phonever
Name4:email4:username4 #unconfirmed
If your file was huge and your set of target words relatively small and speed of execution was important to you then you could do this to generate every possible combination of every possible non-empty subset of those target words:
$ cat subsets.awk
###################
# Calculate all subsets of a given set, see
# https://en.wikipedia.org/wiki/Power_set
function get_subset(A,subsetNr,numVals, str, sep) {
while (subsetNr) {
if (subsetNr%2 != 0) {
str = str sep A[numVals]
sep = " "
}
numVals--
subsetNr = int(subsetNr/2)
}
return str
}
function get_subsets(A,B, i,lgth) {
lgth = length(A)
for (i=1;i<2^lgth;i++) {
B[get_subset(A,i,lgth)]
}
}
###################
# Input should be a list of strings
{
split($0,A)
delete B
get_subsets(A,B)
for (subset in B) {
print subset
}
}
.
$ cat permutations.awk
###################
# Calculate all permutations of a set of strings, see
# https://en.wikipedia.org/wiki/Heap%27s_algorithm
function get_perm(A, i, lgth, sep, str) {
lgth = length(A)
for (i=1; i<=lgth; i++) {
str = str sep A[i]
sep = " "
}
return str
}
function swap(A, x, y, tmp) {
tmp = A[x]
A[x] = A[y]
A[y] = tmp
}
function generate(n, A, B, i) {
if (n == 1) {
B[get_perm(A)]
}
else {
for (i=1; i <= n; i++) {
generate(n - 1, A, B)
if ((n%2) == 0) {
swap(A, 1, n)
}
else {
swap(A, i, n)
}
}
}
}
function get_perms(A,B) {
generate(length(A), A, B)
}
###################
# Input should be a list of strings
{
split($0,A)
delete B
get_perms(A,B)
for (perm in B) {
print perm
}
}
.
$ echo '#registered #subscribed #phonever' |
awk -f subsets.awk |
awk -f permutations.awk
#registered #subscribed #phonever
#subscribed #phonever #registered
#phonever #subscribed #registered
#phonever #registered #subscribed
#subscribed #registered #phonever
#registered #phonever #subscribed
#phonever
#subscribed
#registered #subscribed
#subscribed #registered
#registered
#registered #phonever
#phonever #registered
#subscribed #phonever
#phonever #subscribed
and then you could make the rest of the processing just a simple hash lookup:
$ echo '#registered #subscribed #phonever' |
awk -f subsets.awk |
awk -f permutations.awk |
awk 'NR==FNR{strings[$0];next} {k=(NF>1?$0:"");sub(/[^ ]+ /,"",k)} k in strings' - file
Name:email:username #registered
Name3:email3:username3 #registered #subscribed #phonever
来源:https://stackoverflow.com/questions/45536224/awk-keep-if-line-contains-example