Previously …
This is a follow-up post to Experiment with a new data and program files naming convention – part 1, in which I talked about an experimental data file naming scheme that appends 2 hash strings to the end of a file name. These hashes strings, one computed from the script filename, and the other the current git repository revision, are used to pin-point a specific version of a specific script that created a certain piece of data, thus making back-tracking easier. In this post I am sharing the program tools I made to facilitate the implementation of the scheme.
Based on the workflow, the tools are divided into 2 parts:
- Create and append hashes to filenames.
- Retrieve the script from given hashes.
Create and append hashes to filenames
I mostly do computations using Python, therefore this part is implemented in Python code.
Create a (truncated) hash string from texts
Code first:
def createHash(texts, length=6, dec='[]'): '''Create a short hash string from a given string Args: texts (str): input string. Keyword Args: length (int): length of hash string to truncate. dec (str): decorator texts around hash. Returns: result (str): computed hash ''' import hashlib hsh = hashlib.sha1(texts.encode('ascii')) result = hsh.hexdigest() if len(dec)==2: result = '%s%s%s' %(dec[0], result[:length], dec[1]) else: result = result[:length] return result
Python has a built-in module –haslib
– for the computation of
hashes. Here we are using a sha1 hash:
import hashlib hsh = hashlib.sha1(texts.encode('ascii')) result = hsh.hexdigest()
It doesn’t really matter which hashing algorithm is picked, as it has
no security implications. The length
and dec
optional arguments
are used to truncate the 40 digits long sha1 digest to a much shorter
length, and surround it with a bracket pair for easier identification
later. I chose to truncate it at 6
. The chance of having a hash collision
should be very low with this length.
Query the hash of the current git revision
Code first:
def getGitHash(): '''Get the hash of the current git commit''' from subprocess import Popen, PIPE try: proc=Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=PIPE, stderr=PIPE) stdout, stderr=proc.communicate() except: raise Exception("Failed to obtain git hash.") # NOTE if not in a git repo, stdout=='', stderr contains the error result=stdout.decode().strip() return result
To get the current git revision hash from the command line:
git rev-parse --short HEAD
This gives a 7-digits sha1 hash corresponding to the HEAD
of the
branch, e.g. 25f0bd6
. Without the --short
flag, the return would
be a 40-digit hash.
To call the above git command as an external process inside Python, use the
subprocess.Popen
function like so:
from subprocess import Popen, PIPE proc=Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=PIPE, stderr=PIPE) stdout, stderr=proc.communicate()
Upon successful execution, the desired hash string is obtained from
the stdout
variable, after a decode()
and a strip()
call.
Append hashes to output filenames
Code first:
def hashSuffix(fname, length=6, dec='[]', append_git_hash=True): '''Create a hash as file name suffix Args: fname (str): input file name to modify. Keyword Args: length (int): length of hash string to truncate. dec (str): decorator texts around hash. append_git_hash (bool): whether to append the git hash or not. Returns: result (str): fname appended with the hash string of the calling python script. E.g. Execute: ``` python compute_divergence.py ``` And <fname> is "div_p850_6_2000_suffix.nc". A hash is computed for the string "compute_divergence.py", and appended after <fname> to give "div_p850_6_2000_suffix[6ec0aa].nc" If <append_git_hash> is True, the hash will be appended further by the git hash of the current commit, e.g.: "div_p850_6_2000_suffix[6ec0aa-8b44fac].nc" These hash strings can be used to track down the script generating the results. ''' import sys, os fname, ext=os.path.splitext(fname) script_file=os.path.split(sys.argv[0])[1] if append_git_hash: hash_str = createHash(script_file, length, dec='') githash=getGitHash() hash_str='%s%s-%s%s' %(dec[0], hash_str, githash, dec[1]) else: hash_str = createHash(sys.argv[0], length, dec=dec) result='%s%s%s' %(fname, hash_str, ext) return result
What happens here is that given an input filename fname
, we first
separate the extension part from the basename, and then append the
following composition to the end of the filename:
[script_hash-git_hash]
script_hash
is the truncated hash computed from the filename
of the main executed script,
e.g. /home/user/scripts/compute_divergence.py
. This can be obtained
from sys.argv[0]
.
We then used the above introduced createHash()
function to compute
the hash, which is then joint by the git revision hash, computed from
the getGitHash()
function, with a hyphen -
. After adding the surrounding
square brackets, the result is something like this:
[c5909e-5a77e20]
To use the hashSuffix()
function in practice, I could call, for
instance:
outputfile = hashSuffix('vertical_profiles.png')
figure.savefig(outputfile)
Then the image file would have a file name of
vertical_profiles[c5909e-5a77e20].png
. A signature has been planted
to the data file name. How we need to implement the tools to back-track
the script file from this signature.
Retrieve the script from given hashes
For this part I decided to use bash scripting instead, as Python would probably end up being too slow for tasks like traversing nested folder structures.
As usual, code first (My bash scripting skill is pretty amateur level, so corrections are welcome):
#!/usr/bin/env bash # View the script specified by the (short) hash of the script file and # the (short) hash of the git commit # Usage: # hash_find.sh some_data_file[8d01c9-47823d8].png # where: # 8d01c9 is the 6-digits hash of the script file. # 47823d8 is the 7-digits hash of the git commit # If given an optional 2nd argument NEXT_COMMIT, will get the revision # following 47823d8: # hash_find.sh some_data_file[8d01c9-47823d8].png 1 set -e VERBOSE=0 TERM=gnome-terminal BASE_FOLDER=~/scripts/ # script folder TMP_SCRIPT_FOLDER=/tmp/ # tmp folder to save git show results TARGET_FILE=$1 NEXT_COMMIT=${2:0} function extractHash() { file_name=$1 choice=$2 hash1=$(echo $file_name | sed -En 's/.*\[(.{6})-(.{7})\].*/\1/p') hash2=$(echo $file_name | sed -En 's/.*\[(.{6})-(.{7})\].*/\2/p') if [[ $choice -eq 1 ]]; then echo $hash1 else echo $hash2 fi } findFile () { hash=$1 folder=$2 find "$folder" -type f -name "*.py" | while IFS= read -r filename; do basename=$(basename "$filename") fname_hash=$(echo -n "$basename" | openssl dgst -sha1 -hex | cut -c10-15) if [[ "$fname_hash" = "$hash" ]]; then echo $filename break fi done } function findCommit() { TARGET_HASH=$1 NEXT_COMMIT=$2 # collect git log hash list declare -a hash_list tmp=$(git log | sed -n "s/^commit \(.*\)$/\1/p") for hii in $tmp; do hash_list+=($hii) done n=${#hash_list[@]} # Find next commit if [[ $NEXT_COMMIT == 1 ]]; then for (( i = 0; i < $n; i++ )); do hii=${hash_list[$i]} shii=$(echo $hii | cut -c1-7) if [[ $TARGET_HASH == $shii ]]; then if [[ $i -eq 0 ]]; then echo ${hash_list[$i]} else echo ${hash_list[$i-1]} fi break fi done else echo $TARGET_HASH fi } # extract hash strings from file name script_hash=$(extractHash $1 1) git_hash=$(extractHash $1 2) if [[ -z $script_hash || -z $git_hash ]]; then exit 2 fi if [[ $VERBOSE == 1 ]]; then echo 'script_hash=' $script_hash echo 'git_hash=' $git_hash fi # find script file from hash script_file=$(findFile $script_hash $BASE_FOLDER) if [[ -z $script_file ]]; then exit 2 fi if [[ $VERBOSE == 1 ]]; then echo $script_file fi # get dirname and basename from script file path dirname=$(dirname $script_file) basename=$(basename $script_file) # cd into folder cd $dirname # get git commit got_hash=$(findCommit $git_hash 0) if [[ $VERBOSE == 1 ]]; then echo 'git_hash=' $git_hash fi # git show old version tmpfile=${TMP_SCRIPT_FOLDER}/${basename} if [[ $VERBOSE == 1 ]]; then echo "$tmpfile" fi #git show ${got_hash}:${basename} | less git show ${got_hash}:${basename} > $tmpfile if [[ $? == 0 ]]; then $TERM -- vim $tmpfile fi
Some more explanations are given below.
Global parameters
At the top of the bash script I defined a few global parameters:
TERM
gives which
terminal emulator to use when opening the text editor (vim in this
case) after locating the target script file, I used gnome-terminal
here, you could pick whichever you prefer.
BASE_FOLDER
gives the root level folder within which the script file
is sought. Narrowing down to a more specific folder would help speed
up the process.
TMP_SCRIPT_FOLDER
specifies the folder to save a temporary copy of
the located script file. /tmp/
is a good option for such
things.
The 1st command line input argument is the target data file name,
e.g. vertical_profiles[c5909e-5a77e20].png
. A secondary optional
argument is stored to NEXT_COMMIT
. If not given, it would have a
default value of 0
. If given a value of 1, it will try to search for
the git revision after that given in the file name. E.g. the revision
following 5a77e20
.
Extract hashes from filename
The function extractHash
is responsible for extracting hash strings
from the filenames. It does so by 2 regular expression (regex)
searches using sed:
sed -En 's/.*\[(.{6})-(.{7})\].*/\1/p'
sed -En 's/.*\[(.{6})-(.{7})\].*/\2/p'
The 2nd input function argument choice
decides which one to return:
1
for the 6-digits script hash, 2
for the 7-digits git hash.
Find the script file by matching script hash
This search is performed in the findFile
function.
To located the very script with matching hash strings, I did a
sweeping search within the BASE_FOLDER
using find:
find "$folder" -type f -name "*.py" | while IFS= read -r filename; do ... done
All Python
scripts (with a .py
extension) inside the BASE_FOLDER
are
iterated, and hash computed. This time the hash is computed using
openssl:
fname_hash=$(echo -n "$basename" | openssl dgst -sha1 -hex | cut -c10-15)
NOTE that the -n
option to the echo
command is to strip the
trailing new line character in the $basename
variable.
The cut
command is to truncate the first 6 characters.
The search is stopped upon finding a match.
Find the version of the script by matching git revision hash
This step is done in the findCommit
function. After locating the
target script, we still need to retrieve the matching version of the
script. Much of the hassle inside the findCommit
function is
actually only relevant if the NEXT_COMMIT
argument is 1, i.e. when I
want to look at the git revision immediately after the one specified
in the filename. This is because sometimes I would create a new script
with a clean repo stage, and execute the script to generate output
data. Then this newly created script would be registered into the git
history in the next commit, rather than the one whose hash has been
appended to the data filename.
To find this next commit, I first stored all git revision hash strings into an array:
declare -a hash_list tmp=$(git log | sed -n "s/^commit \(.*\)$/\1/p") for hii in $tmp; do hash_list+=($hii) done
Then iterate through the array to find the matching hash. Upon locating
it, the one after it, with an array index of i-1
, if there is one,
is returned. Note that the git log
command outputs history in
a reversed order, so the next revision is i-1
, rather than i+1
.
Lastly, to get a historical revision of the given file in a git repository, one uses:
git show ${got_hash}:${basename} > $tmpfile
${got_hash}
was found by the findCommit
function, and ${basename}
by
the findFile
function. The stdout of the command is redirected to
a temporary file $tmpfile
, which is then opened in vim in a new
terminal window:
$TERM -- vim $tmpfile
Build a rofi script and bind a hotkey
This is one additional, "bonus" step. Code first:
#!/usr/bin/env bash SEARCH_FOLDER1=~/scripts/ SEARCH_FOLDER2=~/datasets/ HASH_FIND_SCRIPT=~/.config/rofi/hash_find.sh SELECT=$(find $SEARCH_FOLDER1 $SEARCH_FOLDER2 -type f -regex ".*\.\(png\|jpg\|csv\|pdf\|nc\|dat\)"| rofi -dmenu -lines 30 -config ~/.config/rofi/config_recoll) if [[ -n $SELECT ]]; then bash "$HASH_FIND_SCRIPT" "$SELECT" fi
SEARCH_FOLDER1
and SEARCH_FOLDER2
are where I save output data,
these are used as the target searching area of the find command. The
data files have extensions of jpg
, png
, csv
, pdf
, nc
or
dat
. The outputs from the find
command is piped to a rofi
interface, from which I pick the file name I’d like to parse.
After picking the desired file, hash_find.sh
, which is the bash
script in the above section, is executed on the
selected file. Once run successfully, the located script file will be
open in a new terminal window, inside vim editor.
A screen capture of this searching is shown in the video below.