Rationale
Sometimes I write my Python computation script in such a way that it
processes the data in a single year, specified at the top of my script
file as a global parameter, e.g. YEAR=1984. When I need to process
a different year, I change this YEAR global parameter and re-run the
script. When the work load becomes bigger, e.g. a decade’s data
need to be processed, it would be nice to automate this
parameter-changing-script-running process.
If the computation is embarrassingly parallel, i.e. no inter-dependency between different jobs in different years, I also would like to make multiple jobs run in parallel. But maybe the resources demanded to parallelize all the jobs exceed my PC’s capacity, which is only capable of supporting a few parallel jobs, then a job pool model would be an ideal solution.
The idea is to first build a job queue, e.g. a year list from 1980 to 2020. Then create a worker queue whose size is the maximum allowable number of parallel jobs being run simultaneously. Each worker in the worker queue fetches a job from the job queue and starts processing. Whichever worker finishes its job first, it fetches another job from the job queue and processes it. This repeats until the job queue is depleted.
Change the global parameter
I use the sed command to change the global parameter that defines
the job, For instance,
tmpfile=/path/to/the/temporary/job/file_${proc}.py
sed "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./my_script_file.py > "$tmpfile"
${proc} is the bash variable that defines the job, it is the id of the
job, in this case a year number, e.g. 1985. tmpfile is a temporary
file to store the modified Python script file, labeled in its name the
job id (${proc}) so that they don’t overwrite each other.
my_script_file.py is the Python script that does the
computation. The sed command does a search-and-replace operation
on my_script_file.py. What it does is at Line 10 of the file (10
in "10s/YEAR[0-9]\{4\}/YEAR=$proc/"), match the pattern
YEAR[0-9]\{4\}, which is the string YEAR= followed by 4
digits. Then replace the match with a new string YEAR=$proc. The
modified result is saved to the temporary file $tmpfile.
Run the job pool using a bash script
I probably got this from a Stackoverflow answer or something. But here is the job pool script:
#!/bin/bash
#set -e # this doesn't work here for some reason
POOL_SIZE=3 # number of workers running in parallel
#######################################################################
# populate jobs #
#######################################################################
declare -a jobs
for (( i = 1980; i < 2021; i++ )); do
jobs+=($i)
done
echo '################################################'
echo ' Launching jobs'
echo '################################################'
parallel() {
local proc procs jobs cur
jobs=("$@") # input jobs array
declare -a procs=() # processes array
cur=0 # current job idx
morework=true
while $morework; do
# if process array size < pool size, try forking a new proc
if [[ "${#procs[@]}" -lt "$POOL_SIZE" ]]; then
if [[ $cur -lt "${#jobs[@]}" ]]; then
proc=${jobs[$cur]}
echo "JOB ID = $cur; JOB = $proc."
###############
# do job here #
###############
tmpfile=./erai_tmp_pacific_comp_${proc}.py
sed -e "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./pacific_comp.py > "$tmpfile"
python "$tmpfile" &
# add to current running processes
procs+=("$!")
# move to the next job
((cur++))
else
morework=false
continue
fi
fi
for n in "${!procs[@]}"; do
kill -0 "${procs[n]}" 2>/dev/null && continue
# if process is not running anymore, remove from array
unset procs[n]
done
done
wait
}
parallel "${jobs[@]}"
Usage
POOL_SIZE defines the number of workers in the worker queue.
This is where the job queue is created, i.e. a year list from 1980
to 2020. It is saved to the bash array jobs:
for (( i = 1980; i < 2021; i++ )); do jobs+=($i) done
The worker does its job in these 3 lines:
tmpfile=./erai_tmp_pacific_comp_${proc}.py
sed -e "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./pacific_comp.py > "$tmpfile"
python "$tmpfile" &



