Rationale
Sometimes I write my Python computation script in such a way that it
processes the data in a single year, specified at the top of my script
file as a global parameter, e.g. YEAR=1984
. When I need to process
a different year, I change this YEAR
global parameter and re-run the
script. When the work load becomes bigger, e.g. a decade’s data
need to be processed, it would be nice to automate this
parameter-changing-script-running process.
If the computation is embarrassingly parallel, i.e. no inter-dependency between different jobs in different years, I also would like to make multiple jobs run in parallel. But maybe the resources demanded to parallelize all the jobs exceed my PC’s capacity, which is only capable of supporting a few parallel jobs, then a job pool model would be an ideal solution.
The idea is to first build a job queue, e.g. a year list from 1980 to 2020. Then create a worker queue whose size is the maximum allowable number of parallel jobs being run simultaneously. Each worker in the worker queue fetches a job from the job queue and starts processing. Whichever worker finishes its job first, it fetches another job from the job queue and processes it. This repeats until the job queue is depleted.
Change the global parameter
I use the sed
command to change the global parameter that defines
the job, For instance,
tmpfile=/path/to/the/temporary/job/file_${proc}.py sed "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./my_script_file.py > "$tmpfile"
${proc}
is the bash variable that defines the job, it is the id of the
job, in this case a year number, e.g. 1985. tmpfile
is a temporary
file to store the modified Python script file, labeled in its name the
job id (${proc}
) so that they don’t overwrite each other.
my_script_file.py
is the Python script that does the
computation. The sed
command does a search-and-replace operation
on my_script_file.py
. What it does is at Line 10
of the file (10
in "10s/YEAR[0-9]\{4\}/YEAR=$proc/"
), match the pattern
YEAR[0-9]\{4\}
, which is the string YEAR=
followed by 4
digits. Then replace the match with a new string YEAR=$proc
. The
modified result is saved to the temporary file $tmpfile
.
Run the job pool using a bash script
I probably got this from a Stackoverflow answer or something. But here is the job pool script:
#!/bin/bash #set -e # this doesn't work here for some reason POOL_SIZE=3 # number of workers running in parallel ####################################################################### # populate jobs # ####################################################################### declare -a jobs for (( i = 1980; i < 2021; i++ )); do jobs+=($i) done echo '################################################' echo ' Launching jobs' echo '################################################' parallel() { local proc procs jobs cur jobs=("$@") # input jobs array declare -a procs=() # processes array cur=0 # current job idx morework=true while $morework; do # if process array size < pool size, try forking a new proc if [[ "${#procs[@]}" -lt "$POOL_SIZE" ]]; then if [[ $cur -lt "${#jobs[@]}" ]]; then proc=${jobs[$cur]} echo "JOB ID = $cur; JOB = $proc." ############### # do job here # ############### tmpfile=./erai_tmp_pacific_comp_${proc}.py sed -e "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./pacific_comp.py > "$tmpfile" python "$tmpfile" & # add to current running processes procs+=("$!") # move to the next job ((cur++)) else morework=false continue fi fi for n in "${!procs[@]}"; do kill -0 "${procs[n]}" 2>/dev/null && continue # if process is not running anymore, remove from array unset procs[n] done done wait } parallel "${jobs[@]}"
Usage
POOL_SIZE
defines the number of workers in the worker queue.
This is where the job queue is created, i.e. a year list from 1980
to 2020. It is saved to the bash array jobs
:
for (( i = 1980; i < 2021; i++ )); do jobs+=($i) done
The worker does its job in these 3 lines:
tmpfile=./erai_tmp_pacific_comp_${proc}.py sed -e "10s/YEAR=[0-9]\{4\}/YEAR=$proc/" ./pacific_comp.py > "$tmpfile" python "$tmpfile" &