I have a file named data_file with data: london paris newyork italy...50 more items
Have a directory with over 75 files, say dfile1, dfie2...afle75 in which i am perform
You could use grep's q
option to stop searching after the first match and f
option to obtain the patterns from a file:
for f in $(find . -type f); do
if $(grep -qf data_file "$f"); then
...
fi
done
If data_file
contains:
xxx
yyy
zzz
then grep -qf "$data_file" "$f"
evaluates to true if either xxx
, yyy
, or zzz
are found in $f
.
You can do it like this :
files=$(find . -type f)
for f in $files; do
while read -r line; do
{
found=$(grep $line $f)
if [ ! -z "$found" ]; then
## perform task here
fi
} &
done < data_file
done
wait
It will execute the block within {} in the background. So basically it will open as many background processes as files you have. If you want finer control over how many processes are actually spawned you can instead use parallel.
The following example is a full blown parallel execution method, that deals with:
In your example, your (hardened) code would look like:
# Load the ExecTasks function described below (must be in the same directory as this one)
source ./exectasks.sh
directoryToProcess="/my/dir/to/find/stuff/into"
tasklist=""
# Prepare task list separated by semicolumn
while IFS= read -r -d $'\0' file; do
if grep "$line" "$file" > /dev/null 2>&1; then
tasklist="$tasklist""my_task;"
done < <(find "$directoryToProcess" -type f -print0)
# Run tasks
ExecTasks "$tasklist" "trivial-task-id" false 1800 3600 18000 36000 true 1 1800 true false false 8
Here we used a complex function ExecTasks that will deal with parallel queueing the tasks, and let you keep control of what's going on without fear to block the script because of some hanged task.
Quick explanation of ExecTasks arguments:
"$tasklist" = variable containing task list
"some name" trivial task id (in order to identify in logs)
boolean: read tasks from file (you may have passed a task list from a file if there are too many to fit in a variable
1800 = maximum number of seconds a task may be executed before a warning is raised
3600 = maximum number of seconds a task may be executed before an error is raised and the tasks is stopped
18000 = maximum number of seconds the whole tasks may be executed before a warning is raised
36000 = maximum number of seconds the whole tasks may be executed before an error is raised and all the tasks are stopped
boolean: account execution time since beginning of tasks execution (true) or since script begin
1 = number of seconds between each state check (accepts float like .1)
1800 = Number of seconds between each "i am alive" log just to know everything works as expected
boolean: show spinner (true) or not (false)
boolean: log errors when reaching max times (false) or do not log them (true)
boolean: do not log any errors at all (false) or do log them (true)
And finally
8 = number of simultaneous tasks to launch (8 in our case)
Here's the source to exectasks.sh (which you can also copy paste directly into your script header instead of source ./exectasks.sh
):
function Logger {
# Dummy log function, replace with whatever you need
echo "$2: $1"
}
# Nice cli spinner so we now execution is ongoing
_OFUNCTIONS_SPINNER="|/-\\"
function Spinner {
printf " [%c] \b\b\b\b\b\b" "$_OFUNCTIONS_SPINNER"
_OFUNCTIONS_SPINNER=${_OFUNCTIONS_SPINNER#?}${_OFUNCTIONS_SPINNER%%???}
return 0
}
# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
local pid="${1}" # Parent pid to kill childs
local self="${2:-false}" # Should parent be killed too ?
# Paranoid checks, we can safely assume that $pid should not be 0 nor 1
if [ $(IsInteger "$pid") -eq 0 ] || [ "$pid" == "" ] || [ "$pid" == "0" ] || [ "$pid" == "1" ]; then
Logger "Bogus pid given [$pid]." "CRITICAL"
return 1
fi
if kill -0 "$pid" > /dev/null 2>&1; then
if children="$(pgrep -P "$pid")"; then
if [[ "$pid" == *"$children"* ]]; then
Logger "Bogus pgrep implementation." "CRITICAL"
children="${children/$pid/}"
fi
for child in $children; do
Logger "Launching KillChilds \"$child\" true" "DEBUG" #__WITH_PARANOIA_DEBUG
KillChilds "$child" true
done
fi
fi
# Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
if [ "$self" == true ]; then
# We need to check for pid again because it may have disappeared after recursive function call
if kill -0 "$pid" > /dev/null 2>&1; then
kill -s TERM "$pid"
Logger "Sent SIGTERM to process [$pid]." "DEBUG"
if [ $? -ne 0 ]; then
sleep 15
Logger "Sending SIGTERM to process [$pid] failed." "DEBUG"
kill -9 "$pid"
if [ $? -ne 0 ]; then
Logger "Sending SIGKILL to process [$pid] failed." "DEBUG"
return 1
fi # Simplify the return 0 logic here
else
return 0
fi
else
return 0
fi
else
return 0
fi
}
function ExecTasks {
# Mandatory arguments
local mainInput="${1}" # Contains list of pids / commands separated by semicolons or filepath to list of pids / commands
# Optional arguments
local id="${2:-base}" # Optional ID in order to identify global variables from this run (only bash variable names, no '-'). Global variables are WAIT_FOR_TASK_COMPLETION_$id and HARD_MAX_EXEC_TIME_REACHED_$id
local readFromFile="${3:-false}" # Is mainInput / auxInput a semicolon separated list (true) or a filepath (false)
local softPerProcessTime="${4:-0}" # Max time (in seconds) a pid or command can run before a warning is logged, unless set to 0
local hardPerProcessTime="${5:-0}" # Max time (in seconds) a pid or command can run before the given command / pid is stopped, unless set to 0
local softMaxTime="${6:-0}" # Max time (in seconds) for the whole function to run before a warning is logged, unless set to 0
local hardMaxTime="${7:-0}" # Max time (in seconds) for the whole function to run before all pids / commands given are stopped, unless set to 0
local counting="${8:-true}" # Should softMaxTime and hardMaxTime be accounted since function begin (true) or since script begin (false)
local sleepTime="${9:-.5}" # Seconds between each state check. The shorter the value, the snappier ExecTasks will be, but as a tradeoff, more cpu power will be used (good values are between .05 and 1)
local keepLogging="${10:-1800}" # Every keepLogging seconds, an alive message is logged. Setting this value to zero disables any alive logging
local spinner="${11:-true}" # Show spinner (true) or do not show anything (false) while running
local noTimeErrorLog="${12:-false}" # Log errors when reaching soft / hard execution times (false) or do not log errors on those triggers (true)
local noErrorLogsAtAll="${13:-false}" # Do not log any errros at all (useful for recursive ExecTasks checks)
# Parallelism specific arguments
local numberOfProcesses="${14:-0}" # Number of simulanteous commands to run, given as mainInput. Set to 0 by default (WaitForTaskCompletion mode). Setting this value enables ParallelExec mode.
local auxInput="${15}" # Contains list of commands separated by semicolons or filepath fo list of commands. Exit code of those commands decide whether main commands will be executed or not
local maxPostponeRetries="${16:-3}" # If a conditional command fails, how many times shall we try to postpone the associated main command. Set this to 0 to disable postponing
local minTimeBetweenRetries="${17:-300}" # Time (in seconds) between postponed command retries
local validExitCodes="${18:-0}" # Semi colon separated list of valid main command exit codes which will not trigger errors
local i
# Expand validExitCodes into array
IFS=';' read -r -a validExitCodes <<< "$validExitCodes"
# ParallelExec specific variables
local auxItemCount=0 # Number of conditional commands
local commandsArray=() # Array containing commands
local commandsConditionArray=() # Array containing conditional commands
local currentCommand # Variable containing currently processed command
local currentCommandCondition # Variable containing currently processed conditional command
local commandsArrayPid=() # Array containing commands indexed by pids
local commandsArrayOutput=() # Array containing command results indexed by pids
local postponedRetryCount=0 # Number of current postponed commands retries
local postponedItemCount=0 # Number of commands that have been postponed (keep at least one in order to check once)
local postponedCounter=0
local isPostponedCommand=false # Is the current command from a postponed file ?
local postponedExecTime=0 # How much time has passed since last postponed condition was checked
local needsPostponing # Does currentCommand need to be postponed
local temp
# Common variables
local pid # Current pid working on
local pidState # State of the process
local mainItemCount=0 # number of given items (pids or commands)
local readFromFile # Should we read pids / commands from a file (true)
local counter=0
local log_ttime=0 # local time instance for comparaison
local seconds_begin=$SECONDS # Seconds since the beginning of the script
local exec_time=0 # Seconds since the beginning of this function
local retval=0 # return value of monitored pid process
local subRetval=0 # return value of condition commands
local errorcount=0 # Number of pids that finished with errors
local pidsArray # Array of currently running pids
local newPidsArray # New array of currently running pids for next iteration
local pidsTimeArray # Array containing execution begin time of pids
local executeCommand # Boolean to check if currentCommand can be executed given a condition
local functionMode
local softAlert=false # Does a soft alert need to be triggered, if yes, send an alert once
local failedPidsList # List containing failed pids with exit code separated by semicolons (eg : 2355:1;4534:2;2354:3)
local randomOutputName # Random filename for command outputs
local currentRunningPids # String of pids running, used for debugging purposes only
# fnver 2019081401
# Initialise global variable
eval "WAIT_FOR_TASK_COMPLETION_$id=\"\""
eval "HARD_MAX_EXEC_TIME_REACHED_$id=false"
# Init function variables depending on mode
if [ $numberOfProcesses -gt 0 ]; then
functionMode=ParallelExec
else
functionMode=WaitForTaskCompletion
fi
if [ $readFromFile == false ]; then
if [ $functionMode == "WaitForTaskCompletion" ]; then
IFS=';' read -r -a pidsArray <<< "$mainInput"
mainItemCount="${#pidsArray[@]}"
else
IFS=';' read -r -a commandsArray <<< "$mainInput"
mainItemCount="${#commandsArray[@]}"
IFS=';' read -r -a commandsConditionArray <<< "$auxInput"
auxItemCount="${#commandsConditionArray[@]}"
fi
else
if [ -f "$mainInput" ]; then
mainItemCount=$(wc -l < "$mainInput")
readFromFile=true
else
Logger "Cannot read main file [$mainInput]." "WARN"
fi
if [ "$auxInput" != "" ]; then
if [ -f "$auxInput" ]; then
auxItemCount=$(wc -l < "$auxInput")
else
Logger "Cannot read aux file [$auxInput]." "WARN"
fi
fi
fi
if [ $functionMode == "WaitForTaskCompletion" ]; then
# Force first while loop condition to be true because we don't deal with counters but pids in WaitForTaskCompletion mode
counter=$mainItemCount
fi
# soft / hard execution time checks that needs to be a subfunction since it is called both from main loop and from parallelExec sub loop
function _ExecTasksTimeCheck {
if [ $spinner == true ]; then
Spinner
fi
if [ $counting == true ]; then
exec_time=$((SECONDS - seconds_begin))
else
exec_time=$SECONDS
fi
if [ $keepLogging -ne 0 ]; then
# This log solely exists for readability purposes before having next set of logs
if [ ${#pidsArray[@]} -eq $numberOfProcesses ] && [ $log_ttime -eq 0 ]; then
log_ttime=$exec_time
Logger "There are $((mainItemCount-counter+postponedItemCount)) / $mainItemCount tasks in the queue of which $postponedItemCount are postponed. Currently, ${#pidsArray[@]} tasks running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
fi
if [ $(((exec_time + 1) % keepLogging)) -eq 0 ]; then
if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1 second
log_ttime=$exec_time
if [ $functionMode == "WaitForTaskCompletion" ]; then
Logger "Current tasks still running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
elif [ $functionMode == "ParallelExec" ]; then
Logger "There are $((mainItemCount-counter+postponedItemCount)) / $mainItemCount tasks in the queue of which $postponedItemCount are postponed. Currently, ${#pidsArray[@]} tasks running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
fi
fi
fi
fi
if [ $exec_time -gt $softMaxTime ]; then
if [ "$softAlert" != true ] && [ $softMaxTime -ne 0 ] && [ $noTimeErrorLog != true ]; then
Logger "Max soft execution time [$softMaxTime] exceeded for task [$id] with pids [$(joinString , ${pidsArray[@]})]." "WARN"
softAlert=true
SendAlert true
fi
fi
if [ $exec_time -gt $hardMaxTime ] && [ $hardMaxTime -ne 0 ]; then
if [ $noTimeErrorLog != true ]; then
Logger "Max hard execution time [$hardMaxTime] exceeded for task [$id] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution." "ERROR"
fi
for pid in "${pidsArray[@]}"; do
KillChilds $pid true
if [ $? -eq 0 ]; then
Logger "Task with pid [$pid] stopped successfully." "NOTICE"
else
if [ $noErrorLogsAtAll != true ]; then
Logger "Could not stop task with pid [$pid]." "ERROR"
fi
fi
errorcount=$((errorcount+1))
done
if [ $noTimeErrorLog != true ]; then
SendAlert true
fi
eval "HARD_MAX_EXEC_TIME_REACHED_$id=true"
if [ $functionMode == "WaitForTaskCompletion" ]; then
return $errorcount
else
return 129
fi
fi
}
function _ExecTasksPidsCheck {
newPidsArray=()
if [ "$currentRunningPids" != "$(joinString " " ${pidsArray[@]})" ]; then
Logger "ExecTask running for pids [$(joinString " " ${pidsArray[@]})]." "DEBUG"
currentRunningPids="$(joinString " " ${pidsArray[@]})"
fi
for pid in "${pidsArray[@]}"; do
if [ $(IsInteger $pid) -eq 1 ]; then
if kill -0 $pid > /dev/null 2>&1; then
# Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
pidState="$(eval $PROCESS_STATE_CMD)"
if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
# Check if pid hasn't run more than soft/hard perProcessTime
pidsTimeArray[$pid]=$((SECONDS - seconds_begin))
if [ ${pidsTimeArray[$pid]} -gt $softPerProcessTime ]; then
if [ "$softAlert" != true ] && [ $softPerProcessTime -ne 0 ] && [ $noTimeErrorLog != true ]; then
Logger "Max soft execution time [$softPerProcessTime] exceeded for pid [$pid]." "WARN"
if [ "${commandsArrayPid[$pid]}]" != "" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]]." "WARN"
fi
softAlert=true
SendAlert true
fi
fi
if [ ${pidsTimeArray[$pid]} -gt $hardPerProcessTime ] && [ $hardPerProcessTime -ne 0 ]; then
if [ $noTimeErrorLog != true ] && [ $noErrorLogsAtAll != true ]; then
Logger "Max hard execution time [$hardPerProcessTime] exceeded for pid [$pid]. Stopping command execution." "ERROR"
if [ "${commandsArrayPid[$pid]}]" != "" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]]." "WARN"
fi
fi
KillChilds $pid true
if [ $? -eq 0 ]; then
Logger "Command with pid [$pid] stopped successfully." "NOTICE"
else
if [ $noErrorLogsAtAll != true ]; then
Logger "Could not stop command with pid [$pid]." "ERROR"
fi
fi
errorcount=$((errorcount+1))
if [ $noTimeErrorLog != true ]; then
SendAlert true
fi
fi
newPidsArray+=($pid)
fi
else
# pid is dead, get its exit code from wait command
wait $pid
retval=$?
# Check for valid exit codes
if [ $(ArrayContains $retval "${validExitCodes[@]}") -eq 0 ]; then
if [ $noErrorLogsAtAll != true ]; then
Logger "${FUNCNAME[0]} called by [$id] finished monitoring pid [$pid] with exitcode [$retval]." "ERROR"
if [ "$functionMode" == "ParallelExec" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]." "ERROR"
fi
if [ -f "${commandsArrayOutput[$pid]}" ]; then
Logger "Truncated output:\n$(head -c16384 "${commandsArrayOutput[$pid]}")" "ERROR"
fi
fi
errorcount=$((errorcount+1))
# Welcome to variable variable bash hell
if [ "$failedPidsList" == "" ]; then
failedPidsList="$pid:$retval"
else
failedPidsList="$failedPidsList;$pid:$retval"
fi
else
Logger "${FUNCNAME[0]} called by [$id] finished monitoring pid [$pid] with exitcode [$retval]." "DEBUG"
fi
fi
fi
done
# hasPids can be false on last iteration in ParallelExec mode
pidsArray=("${newPidsArray[@]}")
# Trivial wait time for bash to not eat up all CPU
sleep $sleepTime
}
while [ ${#pidsArray[@]} -gt 0 ] || [ $counter -lt $mainItemCount ] || [ $postponedItemCount -ne 0 ]; do
_ExecTasksTimeCheck
retval=$?
if [ $retval -ne 0 ]; then
return $retval;
fi
# The following execution bloc is only needed in ParallelExec mode since WaitForTaskCompletion does not execute commands, but only monitors them
if [ $functionMode == "ParallelExec" ]; then
while [ ${#pidsArray[@]} -lt $numberOfProcesses ] && ([ $counter -lt $mainItemCount ] || [ $postponedItemCount -ne 0 ]); do
_ExecTasksTimeCheck
retval=$?
if [ $retval -ne 0 ]; then
return $retval;
fi
executeCommand=false
isPostponedCommand=false
currentCommand=""
currentCommandCondition=""
needsPostponing=false
if [ $readFromFile == true ]; then
# awk identifies first line as 1 instead of 0 so we need to increase counter
currentCommand=$(awk 'NR == num_line {print; exit}' num_line=$((counter+1)) "$mainInput")
if [ $auxItemCount -ne 0 ]; then
currentCommandCondition=$(awk 'NR == num_line {print; exit}' num_line=$((counter+1)) "$auxInput")
fi
# Check if we need to fetch postponed commands
if [ "$currentCommand" == "" ]; then
currentCommand=$(awk 'NR == num_line {print; exit}' num_line=$((postponedCounter+1)) "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedMain.$id.$SCRIPT_PID.$TSTAMP")
currentCommandCondition=$(awk 'NR == num_line {print; exit}' num_line=$((postponedCounter+1)) "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedAux.$id.$SCRIPT_PID.$TSTAMP")
isPostponedCommand=true
fi
else
currentCommand="${commandsArray[$counter]}"
if [ $auxItemCount -ne 0 ]; then
currentCommandCondition="${commandsConditionArray[$counter]}"
fi
if [ "$currentCommand" == "" ]; then
currentCommand="${postponedCommandsArray[$postponedCounter]}"
currentCommandCondition="${postponedCommandsConditionArray[$postponedCounter]}"
isPostponedCommand=true
fi
fi
# Check if we execute postponed commands, or if we delay them
if [ $isPostponedCommand == true ]; then
# Get first value before '@'
postponedExecTime="${currentCommand%%@*}"
postponedExecTime=$((SECONDS-postponedExecTime))
# Get everything after first '@'
temp="${currentCommand#*@}"
# Get first value before '@'
postponedRetryCount="${temp%%@*}"
# Replace currentCommand with actual filtered currentCommand
currentCommand="${temp#*@}"
# Since we read a postponed command, we may decrase postponedItemCounter
postponedItemCount=$((postponedItemCount-1))
#Since we read one line, we need to increase the counter
postponedCounter=$((postponedCounter+1))
else
postponedRetryCount=0
postponedExecTime=0
fi
if ([ $postponedRetryCount -lt $maxPostponeRetries ] && [ $postponedExecTime -ge $minTimeBetweenRetries ]) || [ $isPostponedCommand == false ]; then
if [ "$currentCommandCondition" != "" ]; then
Logger "Checking condition [$currentCommandCondition] for command [$currentCommand]." "DEBUG"
eval "$currentCommandCondition" &
ExecTasks $! "subConditionCheck" false 0 0 1800 3600 true $SLEEP_TIME $KEEP_LOGGING true true true
subRetval=$?
if [ $subRetval -ne 0 ]; then
# is postponing enabled ?
if [ $maxPostponeRetries -gt 0 ]; then
Logger "Condition [$currentCommandCondition] not met for command [$currentCommand]. Exit code [$subRetval]. Postponing command." "NOTICE"
postponedRetryCount=$((postponedRetryCount+1))
if [ $postponedRetryCount -ge $maxPostponeRetries ]; then
Logger "Max retries reached for postponed command [$currentCommand]. Skipping command." "NOTICE"
else
needsPostponing=true
fi
postponedExecTime=0
else
Logger "Condition [$currentCommandCondition] not met for command [$currentCommand]. Exit code [$subRetval]. Ignoring command." "NOTICE"
fi
else
executeCommand=true
fi
else
executeCommand=true
fi
else
needsPostponing=true
fi
if [ $needsPostponing == true ]; then
postponedItemCount=$((postponedItemCount+1))
if [ $readFromFile == true ]; then
echo "$((SECONDS-postponedExecTime))@$postponedRetryCount@$currentCommand" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedMain.$id.$SCRIPT_PID.$TSTAMP"
echo "$currentCommandCondition" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedAux.$id.$SCRIPT_PID.$TSTAMP"
else
postponedCommandsArray+=("$((SECONDS-postponedExecTime))@$postponedRetryCount@$currentCommand")
postponedCommandsConditionArray+=("$currentCommandCondition")
fi
fi
if [ $executeCommand == true ]; then
Logger "Running command [$currentCommand]." "DEBUG"
randomOutputName=$(date '+%Y%m%dT%H%M%S').$(PoorMansRandomGenerator 5)
eval "$currentCommand" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}.$id.$pid.$randomOutputName.$SCRIPT_PID.$TSTAMP" 2>&1 &
pid=$!
pidsArray+=($pid)
commandsArrayPid[$pid]="$currentCommand"
commandsArrayOutput[$pid]="$RUN_DIR/$PROGRAM.${FUNCNAME[0]}.$id.$pid.$randomOutputName.$SCRIPT_PID.$TSTAMP"
# Initialize pid execution time array
pidsTimeArray[$pid]=0
else
Logger "Skipping command [$currentCommand]." "DEBUG"
fi
if [ $isPostponedCommand == false ]; then
counter=$((counter+1))
fi
_ExecTasksPidsCheck
done
fi
_ExecTasksPidsCheck
done
# Return exit code if only one process was monitored, else return number of errors
# As we cannot return multiple values, a global variable WAIT_FOR_TASK_COMPLETION contains all pids with their return value
eval "WAIT_FOR_TASK_COMPLETION_$id=\"$failedPidsList\""
if [ $mainItemCount -eq 1 ]; then
return $retval
else
return $errorcount
fi
}
Hope you have fun.
The find command will slow things down and the script is more complicated than it needs to be.
If you want to do this with grep, better to loop through data_file and within that grep $line * > /dev/null && do_something
(or grep -R $line * > /dev/null && do_something
if there are subdirectories to deal with)
Using GNU Parallel you can do something like this:
doit() {
f="$1"
line="$2"
found=$(grep $line $f)
if [ ! -z "$found" ]; then
perform task here
fi
}
export -f doit
find . -type f | parallel doit :::: - data_file