How to run multiple curl requests in parallel with multiple variables

问题

Set Up

I currently have the below script working to download files with curl, using a ref file with multiple variables. When I created the script it suited my needs however as the ref file has gotten larger and the data I am requesting via curl is takes longer to generate, my script is now taking too much time to complete.

Objective

I want to be able to update this script so I have curl request and download multiple files as they are ready - as opposed to waiting for each file to be requested and downloaded sequentially.

I've had a look around and seen that I could use either xargs or parallel to achieve this however based on the past questions I've seen, youtube videos and other forum posts, I have haven't been able to find an example that explains if this is possible using more than one variable.

Can someone confirm if this is possible and which tool is better suited to achieve this? Is my current script in the right configuration or do I need to amend a lot of it to shoe horn these commands in?

I suspect this may be a questions that's been asked previously and I may have just not found the right one.

account-list.tsv

client1 account1    123 platform1   50
client2 account1    234 platform1   66
client3 account1    344 platform1   78
client3 account2    321 platform1   209
client3 account2    321 platform2   342
client4 account1    505 platform1   69

download.sh

#!/bin/bash
set -eu

user="user"
pwd="pwd"
D1=$(date "+%Y-%m-%d" -d "1 days ago")
D2=$(date "+%Y-%m-%d" -d "1 days ago")
curr=$D2
cheese=$(pwd)

curl -o /dev/null -s -S -L -f -c cookiejar 'https://url/auth/' -d name=$user -d passwd=$pwd

while true; do

        while IFS=$'    ' read -r client account accountid platform platformid
        do
                curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
                curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
        done < account-list.tsv

        [ "$curr" \< "$D1" ] || break
        curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.

done

exit

回答1:

Using GNU Parallel it looks something like this to fetch 100 entries in parallel:

#!/bin/bash
set -eu

user="user"
pwd="pwd"
D1=$(date "+%Y-%m-%d" -d "1 days ago")
D2=$(date "+%Y-%m-%d" -d "1 days ago")
curr=$D2
cheese=$(pwd)

curl -o /dev/null -s -S -L -f -c cookiejar 'https://url/auth/' -d name=$user -d passwd=$pwd

fetch_one() {
    client="$1"
    account="$2"
    accountid="$3"
    platform="$4"
    platformid="$5"

    curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
    curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
}
export -f fetch_one

while true; do
    cat account-list.tsv | parallel -j100 --colsep '\t' fetch_one
    [ "$curr" \< "$D1" ] || break
    curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done

exit

回答2:

One (relatively) easy way to run several processes in parallel is to wrap the guts of the call in a function and then call the function inside the while loop, making sure to put the function call in the background, eg:

# function definition

docurl () {
    curl -o /dev/null -s -S -f -b cookiejar -c cookiejar 'https://url/auth/' -d account=$accountid
    curl -sSfL -o "$client€$account@$platform£$curr.xlsx" -J -b cookiejar -c cookiejar "https://url/platform=$platformid&date=$curr"
}

# call the function within OP's inner while loop

while true; do

    while IFS=$'    ' read -r client account accountid platform platformid
    do
        docurl &            # put the function call in the background so we can continue loop processing while the function call is running

    done < account-list.tsv

    wait                    # wait for all background calls to complete 

    [ "$curr" \< "$D1" ] || break

    curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done

One issue with this approach is that for a large volume of curl calls it may be possible to bog down the underlying system and/or cause the remote system to reject 'too many' concurrent calls. In this case it'll be necessary to limit the number of concurrent curl calls.

One idea would be to keep a counter of the number of currently running (backgrounded) curl calls and when we hit a limit we wait for a background process to complete before spawning a new one, eg:

max=5                       # limit of 5 concurrent/backgrounded calls
ctr=0

while true; do

    while IFS=$'    ' read -r client account accountid platform platformid
    do
        docurl &

        ctr=$((ctr+1))

        if [[ "${ctr}" -ge "${max}" ]]
        then
            wait -n         # wait for a background process to complete
            ctr=$((ctr-1))
        fi

    done < account-list.tsv

    wait                    # wait for last ${ctr} background calls to complete

    [ "$curr" \< "$D1" ] || break

    curr=$( date +%Y-%m-%d --date "$curr +1 day" ) ## used in instances where I need to grade data for past date ranges.
done

来源：https://stackoverflow.com/questions/64694448/how-to-run-multiple-curl-requests-in-parallel-with-multiple-variables

标签

bash

shell

xargs

gnu-parallel