问题
Set Up
I currently have the below script working to download files with curl
, using a ref file with multiple variables. When I created the script it suited my needs however as the ref file has gotten larger and the data I am requesting via curl
is takes longer to generate, my script is now taking too much time to complete.
Objective
I want to be able to update this script so I have curl
request and download multiple files as they are ready - as opposed to waiting for each file to be requested and downloaded sequentially.
I've had a look around and seen that I could use either xargs
or parallel
to achieve this however based on the past questions I've seen, youtube videos and other forum posts, I have haven't been able to find an example that explains if this is possible using more than one variable.
Can someone confirm if this is possible and which tool is better suited to achieve this? Is my current script in the right configuration or do I need to amend a lot of it to shoe horn these commands in?
I suspect this may be a questions that's been asked previously and I may have just not found the right one.
account-list.tsv
client1 account1 123 platform1 50
client2 account1 234 platform1 66
client3 account1 344 platform1 78
client3 account2 321 platform1 209
client3 account2 321 platform2 342
client4 account1 505 platform1 69
download.sh
#!/bin/bash
set -eu
user="user"
pwd="pwd"
D1=$(date "+%Y-%m-%d" -d "1 days ago")
D2=$(date "+%Y-%m-%d" -d "1 days ago")
curr=$D2
cheese=$(pwd)
curl -o /dev/null -s -S -L -f -c cookiejar 'https://url/auth/' -d name=$user -d passwd=$pwd
while true; do
while IFS=$' ' read -r client account accountid platform platformid
do
curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
done < account-list.tsv
[ "$curr" \< "$D1" ] || break
curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done
exit
回答1:
Using GNU Parallel it looks something like this to fetch 100 entries in parallel:
#!/bin/bash
set -eu
user="user"
pwd="pwd"
D1=$(date "+%Y-%m-%d" -d "1 days ago")
D2=$(date "+%Y-%m-%d" -d "1 days ago")
curr=$D2
cheese=$(pwd)
curl -o /dev/null -s -S -L -f -c cookiejar 'https://url/auth/' -d name=$user -d passwd=$pwd
fetch_one() {
client="$1"
account="$2"
accountid="$3"
platform="$4"
platformid="$5"
curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
}
export -f fetch_one
while true; do
cat account-list.tsv | parallel -j100 --colsep '\t' fetch_one
[ "$curr" \< "$D1" ] || break
curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done
exit
回答2:
One (relatively) easy way to run several processes in parallel is to wrap the guts of the call in a function and then call the function inside the while
loop, making sure to put the function call in the background, eg:
# function definition
docurl () {
curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
}
# call the function within OP's inner while loop
while true; do
while IFS=$' ' read -r client account accountid platform platformid
do
docurl & # put the function call in the background so we can continue loop processing while the function call is running
done < account-list.tsv
wait # wait for all background calls to complete
[ "$curr" \< "$D1" ] || break
curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done
One issue with this approach is that for a large volume of curl
calls it may be possible to bog down the underlying system and/or cause the remote system to reject 'too many' concurrent calls. In this case it'll be necessary to limit the number of concurrent curl
calls.
One idea would be to keep a counter of the number of currently running (backgrounded) curl
calls and when we hit a limit we wait
for a background process to complete before spawning a new one, eg:
max=5 # limit of 5 concurrent/backgrounded calls
ctr=0
while true; do
while IFS=$' ' read -r client account accountid platform platformid
do
docurl &
ctr=$((ctr+1))
if [[ "${ctr}" -ge "${max}" ]]
then
wait -n # wait for a background process to complete
ctr=$((ctr-1))
fi
done < account-list.tsv
wait # wait for last ${ctr} background calls to complete
[ "$curr" \< "$D1" ] || break
curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done
来源:https://stackoverflow.com/questions/64694448/how-to-run-multiple-curl-requests-in-parallel-with-multiple-variables