Need help in scanning text files and find all the words between two patterns. Like say if we have a .sql file, Need to scan and find all words between from\' and \'where\'.
Sed has this:
sed -n -e '/from/,/where/ p' file.sql
Prints all the lines between a line with a from
and a line with a where
.
For something that can include lines that have both from and where:
#!/bin/sed -nf
/from.*where/ {
s/.*\(from.*where\).*/\1/p
d
}
/from/ {
: next
N
/where/ {
s/^[^\n]*\(from.*where\)[^\n]*/\1/p
d
}
$! b next
}
This (written as a sed script) is slightly more complex, and I'll try to explain the details.
The first line is executed on a line that contains both a from
and a where
. If a line matches that pattern, two commands are executed. We use the s
substitute command to extract only the parts between from and where (including the from and where). The p
suffix in that command prints the line. The delete command clears the pattern space (the working buffer), loading the next line and restarting the script.
The second command starts to execute a series of commands (grouped by the braces) when a line containing from
is found. Basically, the commands form a loop that will keep appending lines from the input into the pattern space until a line with a where
is found or until we reach the last line.
The :
"command" creates a label, a marker in the script that allows us to "jump" back when we want to. The N
command reads a line from the input, and appends it to the pattern space (separating the lines with a newline character).
When a where
is found, we can print out the contents of the pattern space, but first we have to clean it with the substitute command. It is analogous to the one used previously, but we now replace the leading and trailing .*
with [^\n]*
, which tells sed to match only non-newline characters, effectively matching a from in the first line and a where in the last line. The d
command then clears the pattern space and restarts the script on the next line.
The b
command will jump to a label, in our case, the label next
. However, the $!
address says it should not be executed on the last line, allowing us to leave the loop. When leaving the loop this way, we haven't found a respective where
, so you may not want to print it.
Note however, this has some drawbacks. The following cases won't be handled as expected:
from ... where ... from
from ... from
where
from
where ... where
from
from
where
where
Handling these cases require more code.
Hope this helps =)
You could use ed
for this, it allows positive and negative offsets for the regex range. If the input is:
seq 10 | tee > infile
1
2
3
4
5
6
7
8
9
10
Pipe in the command to ed
:
<<< /3/,/6/p | ed -s infile
i.e. print everything between lines containing 3
and 6
.
Result:
3
4
5
6
To get one more line at each end:
<<< /3/-1,/5/+1p | ed -s infile
Result:
2
3
4
5
6
7
Or the other way around:
<<< /3/+1,/6/-1p | ed -s infile
Result:
4
5
With GNU awk so you can set the RS to an RE:
gawk -v RS='[[:space:]]+' '
/where/ { found=0 }
found { print }
/from/ { found=1 }
' file
The above assumes you do not want the "from" and "where" printed, move the lines around if necessary to do otherwise.
In case it helps, the following idioms describe how to select a range of records given a specific pattern to match:
a) Print all records from some pattern:
awk '/pattern/{f=1}f' file
b) Print all records after some pattern:
awk 'f;/pattern/{f=1}' file
c) Print the Nth record after some pattern:
awk 'c&&!--c;/pattern/{c=N}' file
d) Print every record except the Nth record after some pattern:
awk 'c&&!--c{next}/pattern/{c=N}1' file
e) Print the N records after some pattern:
awk 'c&&c--;/pattern/{c=N}' file
f) Print every record except the N records after some pattern:
awk 'c&&c--{next}/pattern/{c=N}1' file
g) Print the N records from some pattern:
awk '/pattern/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where appropriate as that's more expressive of what the variable actually IS.
To return just a string within two given strings, along the lines of awk
(without getting crazy) I just run this very flat script, verbosity in tow:
.\gnucoreutils\bin\awk "{startstring = \"RETURN STUFF AFTER ME \"; endstring = \"RETURN STUFF BEFORE ME\"; endofstartstring = index($0,startstring)+length(startstring); print substr($0,endofstartstring,index($0,endstring)-endofstartstring)}" /dev/stdin
Note that I am using cmd.exe
(the command interpreter with Windows) and the gnuwin32 awk, so mind the "double-quotes" and ^\escape characters^\:
GNU Awk 3.1.6
Copyright (C) 1989, 1991-2007 Free Software Foundation.
Please point out flaws.
example:
echo "hello. RETURN STUFF AFTER ME i get returned RETURN STUFF BEFORE ME my face is melting" | .\gnucoreutils\bin\awk "{startstring = \"RETURN STUFF AFTER ME \"; endstring = \" RETURN STUFF BEFORE ME\"; endofstartstring = index($0,startstring)+length(startstring); print substr($0,endofstartstring,index($0,endstring)-endofstartstring)}" /dev/stdin
i get returned
I was able to accomplish this using just grep:
#> grep -A#### "start pattern" file | grep -B#### "end pattern"
The problem was that I had to find the right amount of lines to include in the A and B options, which are the same. Hope this helps