A bash job pool script

A bash script to help you create a job pool to run embarrassingly parallel jobs.

Rationale

Sometimes I write my Python computation script in such a way that it processes the data in a single year, specified at the top of my script file as a global parameter, e.g. YEAR=1984. When I need to process a different year, I change this YEAR global parameter and re-run the script. When the work load becomes bigger, e.g. a decade’s data need to be processed, it would be nice to automate this parameter-changing-script-running process.

If the computation is embarrassingly parallel, i.e. no inter-dependency between different jobs in different years, I also would like to make multiple jobs run in parallel. But maybe the resources demanded to parallelize all the jobs exceed my PC’s capacity, which is only capable of supporting a few parallel jobs, then a job pool model would be an ideal solution.

The idea is to first build a job queue, e.g. a year list from 1980 to 2020. Then create a worker queue whose size is the maximum allowable number of parallel jobs being run simultaneously. Each worker in the worker queue fetches a job from the job queue and starts processing. Whichever worker finishes its job first, it fetches another job from the job queue and processes it. This repeats until the job queue is depleted.

Change the global parameter

I use the sed command to change the global parameter that defines the job, For instance,

tmpfile=/path/to/the/temporary/job/file_${proc}.py
sed "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./my_script_file.py > "$tmpfile"

${proc} is the bash variable that defines the job, it is the id of the job, in this case a year number, e.g. 1985. tmpfile is a temporary file to store the modified Python script file, labeled in its name the job id (${proc}) so that they don’t overwrite each other.

my_script_file.py is the Python script that does the computation. The sed command does a search-and-replace operation on my_script_file.py. What it does is at Line 10 of the file (10 in "10s/YEAR[0-9]\{4\}/YEAR=$proc/"), match the pattern YEAR[0-9]\{4\}, which is the string YEAR= followed by 4 digits. Then replace the match with a new string YEAR=$proc. The modified result is saved to the temporary file $tmpfile.

Run the job pool using a bash script

I probably got this from a Stackoverflow answer or something. But here is the job pool script:

#!/bin/bash

#set -e   # this doesn't work here for some reason
POOL_SIZE=3   # number of workers running in parallel

#######################################################################
#                            populate jobs                            #
#######################################################################

declare -a jobs

for (( i = 1980; i < 2021; i++ )); do
	jobs+=($i)
done

echo '################################################'
echo '    Launching jobs'
echo '################################################'

parallel() {
    local proc procs jobs cur
    jobs=("$@")         # input jobs array
    declare -a procs=() # processes array
    cur=0               # current job idx

    morework=true
    while $morework; do
        # if process array size < pool size, try forking a new proc
        if [[ "${#procs[@]}" -lt "$POOL_SIZE" ]]; then
            if [[ $cur -lt "${#jobs[@]}" ]]; then
                proc=${jobs[$cur]}
                echo "JOB ID = $cur; JOB = $proc."

                ###############
                # do job here #
                ###############

                tmpfile=./erai_tmp_pacific_comp_${proc}.py
                sed -e "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./pacific_comp.py > "$tmpfile"
                python "$tmpfile" &

                # add to current running processes
                procs+=("$!")
                # move to the next job
                ((cur++))
            else
                morework=false
                continue
            fi
        fi

        for n in "${!procs[@]}"; do
            kill -0 "${procs[n]}" 2>/dev/null && continue
            # if process is not running anymore, remove from array
            unset procs[n]
        done
    done
    wait
}

parallel "${jobs[@]}"

Usage

POOL_SIZE defines the number of workers in the worker queue.

This is where the job queue is created, i.e. a year list from 1980 to 2020. It is saved to the bash array jobs:

for (( i = 1980; i < 2021; i++ )); do
	jobs+=($i)
done

The worker does its job in these 3 lines:

tmpfile=./erai_tmp_pacific_comp_${proc}.py
sed -e "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./pacific_comp.py > "$tmpfile"
python "$tmpfile" &

Leave a Reply