Previously …
This is a follow-up post to Experiment with a new data and program files naming convention – part 1, in which I talked about an experimental data file naming scheme that appends 2 hash strings to the end of a file name. These hashes strings, one computed from the script filename, and the other the current git repository revision, are used to pin-point a specific version of a specific script that created a certain piece of data, thus making back-tracking easier. In this post I am sharing the program tools I made to facilitate the implementation of the scheme.
Based on the workflow, the tools are divided into 2 parts:
- Create and append hashes to filenames.
- Retrieve the script from given hashes.
Create and append hashes to filenames
I mostly do computations using Python, therefore this part is implemented in Python code.
Create a (truncated) hash string from texts
Code first:
def createHash(texts, length=6, dec='[]'):
'''Create a short hash string from a given string
Args:
texts (str): input string.
Keyword Args:
length (int): length of hash string to truncate.
dec (str): decorator texts around hash.
Returns:
result (str): computed hash
'''
import hashlib
hsh = hashlib.sha1(texts.encode('ascii'))
result = hsh.hexdigest()
if len(dec)==2:
result = '%s%s%s' %(dec[0], result[:length], dec[1])
else:
result = result[:length]
return result
Python has a built-in module –haslib– for the computation of
hashes. Here we are using a sha1 hash:
import hashlib
hsh = hashlib.sha1(texts.encode('ascii'))
result = hsh.hexdigest()
It doesn’t really matter which hashing algorithm is picked, as it has
no security implications. The length and dec optional arguments
are used to truncate the 40 digits long sha1 digest to a much shorter
length, and surround it with a bracket pair for easier identification
later. I chose to truncate it at 6. The chance of having a hash collision
should be very low with this length.
Query the hash of the current git revision
Code first:
def getGitHash():
'''Get the hash of the current git commit'''
from subprocess import Popen, PIPE
try:
proc=Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=PIPE,
stderr=PIPE)
stdout, stderr=proc.communicate()
except:
raise Exception("Failed to obtain git hash.")
# NOTE if not in a git repo, stdout=='', stderr contains the error
result=stdout.decode().strip()
return result
To get the current git revision hash from the command line:
git rev-parse --short HEAD
This gives a 7-digits sha1 hash corresponding to the HEAD of the
branch, e.g. 25f0bd6. Without the --short flag, the return would
be a 40-digit hash.
To call the above git command as an external process inside Python, use the
subprocess.Popen function like so:
from subprocess import Popen, PIPE proc=Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=PIPE, stderr=PIPE) stdout, stderr=proc.communicate()
Upon successful execution, the desired hash string is obtained from
the stdout variable, after a decode() and a strip() call.
Append hashes to output filenames
Code first:
def hashSuffix(fname, length=6, dec='[]', append_git_hash=True):
'''Create a hash as file name suffix
Args:
fname (str): input file name to modify.
Keyword Args:
length (int): length of hash string to truncate.
dec (str): decorator texts around hash.
append_git_hash (bool): whether to append the git hash or not.
Returns:
result (str): fname appended with the hash string of the calling
python script. E.g. Execute:
```
python compute_divergence.py
```
And <fname> is "div_p850_6_2000_suffix.nc".
A hash is computed for the string "compute_divergence.py", and
appended after <fname> to give "div_p850_6_2000_suffix[6ec0aa].nc"
If <append_git_hash> is True, the hash will be appended further by
the git hash of the current commit, e.g.:
"div_p850_6_2000_suffix[6ec0aa-8b44fac].nc"
These hash strings can be used to track down the script generating
the results.
'''
import sys, os
fname, ext=os.path.splitext(fname)
script_file=os.path.split(sys.argv[0])[1]
if append_git_hash:
hash_str = createHash(script_file, length, dec='')
githash=getGitHash()
hash_str='%s%s-%s%s' %(dec[0], hash_str, githash, dec[1])
else:
hash_str = createHash(sys.argv[0], length, dec=dec)
result='%s%s%s' %(fname, hash_str, ext)
return result
What happens here is that given an input filename fname, we first
separate the extension part from the basename, and then append the
following composition to the end of the filename:
[script_hash-git_hash]
script_hash is the truncated hash computed from the filename
of the main executed script,
e.g. /home/user/scripts/compute_divergence.py. This can be obtained
from sys.argv[0].
We then used the above introduced createHash() function to compute
the hash, which is then joint by the git revision hash, computed from
the getGitHash() function, with a hyphen -. After adding the surrounding
square brackets, the result is something like this:
[c5909e-5a77e20]
To use the hashSuffix() function in practice, I could call, for
instance:
outputfile = hashSuffix('vertical_profiles.png')
figure.savefig(outputfile)
Then the image file would have a file name of
vertical_profiles[c5909e-5a77e20].png. A signature has been planted
to the data file name. How we need to implement the tools to back-track
the script file from this signature.
Retrieve the script from given hashes
For this part I decided to use bash scripting instead, as Python would probably end up being too slow for tasks like traversing nested folder structures.
As usual, code first (My bash scripting skill is pretty amateur level, so corrections are welcome):
#!/usr/bin/env bash
# View the script specified by the (short) hash of the script file and
# the (short) hash of the git commit
# Usage:
# hash_find.sh some_data_file[8d01c9-47823d8].png
# where:
# 8d01c9 is the 6-digits hash of the script file.
# 47823d8 is the 7-digits hash of the git commit
# If given an optional 2nd argument NEXT_COMMIT, will get the revision
# following 47823d8:
# hash_find.sh some_data_file[8d01c9-47823d8].png 1
set -e
VERBOSE=0
TERM=gnome-terminal
BASE_FOLDER=~/scripts/ # script folder
TMP_SCRIPT_FOLDER=/tmp/ # tmp folder to save git show results
TARGET_FILE=$1
NEXT_COMMIT=${2:0}
function extractHash() {
file_name=$1
choice=$2
hash1=$(echo $file_name | sed -En 's/.*\[(.{6})-(.{7})\].*/\1/p')
hash2=$(echo $file_name | sed -En 's/.*\[(.{6})-(.{7})\].*/\2/p')
if [[ $choice -eq 1 ]]; then
echo $hash1
else
echo $hash2
fi
}
findFile () {
hash=$1
folder=$2
find "$folder" -type f -name "*.py" |
while IFS= read -r filename; do
basename=$(basename "$filename")
fname_hash=$(echo -n "$basename" | openssl dgst -sha1 -hex | cut -c10-15)
if [[ "$fname_hash" = "$hash" ]]; then
echo $filename
break
fi
done
}
function findCommit() {
TARGET_HASH=$1
NEXT_COMMIT=$2
# collect git log hash list
declare -a hash_list
tmp=$(git log | sed -n "s/^commit \(.*\)$/\1/p")
for hii in $tmp; do
hash_list+=($hii)
done
n=${#hash_list[@]}
# Find next commit
if [[ $NEXT_COMMIT == 1 ]]; then
for (( i = 0; i < $n; i++ )); do
hii=${hash_list[$i]}
shii=$(echo $hii | cut -c1-7)
if [[ $TARGET_HASH == $shii ]]; then
if [[ $i -eq 0 ]]; then
echo ${hash_list[$i]}
else
echo ${hash_list[$i-1]}
fi
break
fi
done
else
echo $TARGET_HASH
fi
}
# extract hash strings from file name
script_hash=$(extractHash $1 1)
git_hash=$(extractHash $1 2)
if [[ -z $script_hash || -z $git_hash ]]; then
exit 2
fi
if [[ $VERBOSE == 1 ]]; then
echo 'script_hash=' $script_hash
echo 'git_hash=' $git_hash
fi
# find script file from hash
script_file=$(findFile $script_hash $BASE_FOLDER)
if [[ -z $script_file ]]; then
exit 2
fi
if [[ $VERBOSE == 1 ]]; then
echo $script_file
fi
# get dirname and basename from script file path
dirname=$(dirname $script_file)
basename=$(basename $script_file)
# cd into folder
cd $dirname
# get git commit
got_hash=$(findCommit $git_hash 0)
if [[ $VERBOSE == 1 ]]; then
echo 'git_hash=' $git_hash
fi
# git show old version
tmpfile=${TMP_SCRIPT_FOLDER}/${basename}
if [[ $VERBOSE == 1 ]]; then
echo "$tmpfile"
fi
#git show ${got_hash}:${basename} | less
git show ${got_hash}:${basename} > $tmpfile
if [[ $? == 0 ]]; then
$TERM -- vim $tmpfile
fi
Some more explanations are given below.
Global parameters
At the top of the bash script I defined a few global parameters:
TERM gives which
terminal emulator to use when opening the text editor (vim in this
case) after locating the target script file, I used gnome-terminal
here, you could pick whichever you prefer.
BASE_FOLDER gives the root level folder within which the script file
is sought. Narrowing down to a more specific folder would help speed
up the process.
TMP_SCRIPT_FOLDER specifies the folder to save a temporary copy of
the located script file. /tmp/ is a good option for such
things.
The 1st command line input argument is the target data file name,
e.g. vertical_profiles[c5909e-5a77e20].png. A secondary optional
argument is stored to NEXT_COMMIT. If not given, it would have a
default value of 0. If given a value of 1, it will try to search for
the git revision after that given in the file name. E.g. the revision
following 5a77e20.
Extract hashes from filename
The function extractHash is responsible for extracting hash strings
from the filenames. It does so by 2 regular expression (regex)
searches using sed:
sed -En 's/.*\[(.{6})-(.{7})\].*/\1/p'
sed -En 's/.*\[(.{6})-(.{7})\].*/\2/p'
The 2nd input function argument choice decides which one to return:
1 for the 6-digits script hash, 2 for the 7-digits git hash.
Find the script file by matching script hash
This search is performed in the findFile function.
To located the very script with matching hash strings, I did a
sweeping search within the BASE_FOLDER using find:
find "$folder" -type f -name "*.py" | while IFS= read -r filename; do ... done
All Python
scripts (with a .py extension) inside the BASE_FOLDER are
iterated, and hash computed. This time the hash is computed using
openssl:
fname_hash=$(echo -n "$basename" | openssl dgst -sha1 -hex | cut -c10-15)
NOTE that the -n option to the echo command is to strip the
trailing new line character in the $basename variable.
The cut command is to truncate the first 6 characters.
The search is stopped upon finding a match.
Find the version of the script by matching git revision hash
This step is done in the findCommit function. After locating the
target script, we still need to retrieve the matching version of the
script. Much of the hassle inside the findCommit function is
actually only relevant if the NEXT_COMMIT argument is 1, i.e. when I
want to look at the git revision immediately after the one specified
in the filename. This is because sometimes I would create a new script
with a clean repo stage, and execute the script to generate output
data. Then this newly created script would be registered into the git
history in the next commit, rather than the one whose hash has been
appended to the data filename.
To find this next commit, I first stored all git revision hash strings into an array:
declare -a hash_list tmp=$(git log | sed -n "s/^commit \(.*\)$/\1/p") for hii in $tmp; do hash_list+=($hii) done
Then iterate through the array to find the matching hash. Upon locating
it, the one after it, with an array index of i-1, if there is one,
is returned. Note that the git log command outputs history in
a reversed order, so the next revision is i-1, rather than i+1.
Lastly, to get a historical revision of the given file in a git repository, one uses:
git show ${got_hash}:${basename} > $tmpfile
${got_hash} was found by the findCommit function, and ${basename} by
the findFile function. The stdout of the command is redirected to
a temporary file $tmpfile, which is then opened in vim in a new
terminal window:
$TERM -- vim $tmpfile
Build a rofi script and bind a hotkey
This is one additional, "bonus" step. Code first:
#!/usr/bin/env bash SEARCH_FOLDER1=~/scripts/ SEARCH_FOLDER2=~/datasets/ HASH_FIND_SCRIPT=~/.config/rofi/hash_find.sh SELECT=$(find $SEARCH_FOLDER1 $SEARCH_FOLDER2 -type f -regex ".*\.\(png\|jpg\|csv\|pdf\|nc\|dat\)"| rofi -dmenu -lines 30 -config ~/.config/rofi/config_recoll) if [[ -n $SELECT ]]; then bash "$HASH_FIND_SCRIPT" "$SELECT" fi
SEARCH_FOLDER1 and SEARCH_FOLDER2 are where I save output data,
these are used as the target searching area of the find command. The
data files have extensions of jpg, png, csv, pdf, nc or
dat. The outputs from the find command is piped to a rofi
interface, from which I pick the file name I’d like to parse.
After picking the desired file, hash_find.sh, which is the bash
script in the above section, is executed on the
selected file. Once run successfully, the located script file will be
open in a new terminal window, inside vim editor.
A screen capture of this searching is shown in the video below.




