Experiment with a new data and program files naming convention – part 2

This is a follow-up post to [Experiment with a new data and program files naming convention - part 1]. In this post I am sharing the program tools I made to facilitate the implementation of the scheme.

Previously …

This is a follow-up post to Experiment with a new data and program files naming convention – part 1, in which I talked about an experimental data file naming scheme that appends 2 hash strings to the end of a file name. These hashes strings, one computed from the script filename, and the other the current git repository revision, are used to pin-point a specific version of a specific script that created a certain piece of data, thus making back-tracking easier. In this post I am sharing the program tools I made to facilitate the implementation of the scheme.

Based on the workflow, the tools are divided into 2 parts:

  1. Create and append hashes to filenames.
  2. Retrieve the script from given hashes.

Create and append hashes to filenames

I mostly do computations using Python, therefore this part is implemented in Python code.

Create a (truncated) hash string from texts

Code first:

def createHash(texts, length=6, dec='[]'):
    '''Create a short hash string from a given string

    Args:
        texts (str): input string.
    Keyword Args:
        length (int): length of hash string to truncate.
        dec (str): decorator texts around hash.
    Returns:
        result (str): computed hash
    '''

    import hashlib
    hsh = hashlib.sha1(texts.encode('ascii'))
    result = hsh.hexdigest()
    if len(dec)==2:
        result = '%s%s%s' %(dec[0], result[:length], dec[1])
    else:
        result = result[:length]

    return result

Python has a built-in module –haslib– for the computation of hashes. Here we are using a sha1 hash:

import hashlib
hsh = hashlib.sha1(texts.encode('ascii'))
result = hsh.hexdigest()

It doesn’t really matter which hashing algorithm is picked, as it has no security implications. The length and dec optional arguments are used to truncate the 40 digits long sha1 digest to a much shorter length, and surround it with a bracket pair for easier identification later. I chose to truncate it at 6. The chance of having a hash collision should be very low with this length.

Query the hash of the current git revision

Code first:

def getGitHash():
    '''Get the hash of the current git commit'''

    from subprocess import Popen, PIPE

    try:
        proc=Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=PIPE,
                stderr=PIPE)
        stdout, stderr=proc.communicate()
    except:
        raise Exception("Failed to obtain git hash.")

    # NOTE if not in a git repo, stdout=='', stderr contains the error
    result=stdout.decode().strip()

    return result

To get the current git revision hash from the command line:

git rev-parse --short HEAD

This gives a 7-digits sha1 hash corresponding to the HEAD of the branch, e.g. 25f0bd6. Without the --short flag, the return would be a 40-digit hash.

To call the above git command as an external process inside Python, use the subprocess.Popen function like so:

from subprocess import Popen, PIPE
proc=Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=PIPE,
	stderr=PIPE)
stdout, stderr=proc.communicate()

Upon successful execution, the desired hash string is obtained from the stdout variable, after a decode() and a strip() call.

Append hashes to output filenames

Code first:

def hashSuffix(fname, length=6, dec='[]', append_git_hash=True):
    '''Create a hash as file name suffix

    Args:
        fname (str): input file name to modify.
    Keyword Args:
        length (int): length of hash string to truncate.
        dec (str): decorator texts around hash.
        append_git_hash (bool): whether to append the git hash or not.
    Returns:
        result (str): fname appended with the hash string of the calling
            python script. E.g. Execute:
                ```
                python compute_divergence.py
                ```
            And <fname> is "div_p850_6_2000_suffix.nc".

            A hash is computed for the string "compute_divergence.py", and
            appended after <fname> to give "div_p850_6_2000_suffix[6ec0aa].nc"
            If <append_git_hash> is True, the hash will be appended further by
            the git hash of the current commit, e.g.:
                "div_p850_6_2000_suffix[6ec0aa-8b44fac].nc"

            These hash strings can be used to track down the script generating
            the results.
    '''
    import sys, os

    fname, ext=os.path.splitext(fname)
    script_file=os.path.split(sys.argv[0])[1]
    if append_git_hash:
        hash_str = createHash(script_file, length, dec='')
        githash=getGitHash()
        hash_str='%s%s-%s%s' %(dec[0], hash_str, githash, dec[1])
    else:
        hash_str = createHash(sys.argv[0], length, dec=dec)

    result='%s%s%s' %(fname, hash_str, ext)

    return result

What happens here is that given an input filename fname, we first separate the extension part from the basename, and then append the following composition to the end of the filename:

[script_hash-git_hash]

script_hash is the truncated hash computed from the filename of the main executed script, e.g. /home/user/scripts/compute_divergence.py. This can be obtained from sys.argv[0].

We then used the above introduced createHash() function to compute the hash, which is then joint by the git revision hash, computed from the getGitHash() function, with a hyphen -. After adding the surrounding square brackets, the result is something like this:

[c5909e-5a77e20]

To use the hashSuffix() function in practice, I could call, for instance:

outputfile = hashSuffix('vertical_profiles.png')
figure.savefig(outputfile)

Then the image file would have a file name of vertical_profiles[c5909e-5a77e20].png. A signature has been planted to the data file name. How we need to implement the tools to back-track the script file from this signature.

Retrieve the script from given hashes

For this part I decided to use bash scripting instead, as Python would probably end up being too slow for tasks like traversing nested folder structures.

As usual, code first (My bash scripting skill is pretty amateur level, so corrections are welcome):

#!/usr/bin/env bash

# View the script specified by the (short) hash of the script file and
# the (short) hash of the git commit

# Usage:

# hash_find.sh some_data_file[8d01c9-47823d8].png
# where:
#    8d01c9 is the 6-digits hash of the script file.
#    47823d8 is the 7-digits hash of the git commit

# If given an optional 2nd argument NEXT_COMMIT, will get the revision
# following 47823d8:
# hash_find.sh some_data_file[8d01c9-47823d8].png 1

set -e
VERBOSE=0
TERM=gnome-terminal

BASE_FOLDER=~/scripts/   # script folder
TMP_SCRIPT_FOLDER=/tmp/  # tmp folder to save git show results

TARGET_FILE=$1
NEXT_COMMIT=${2:0}

function extractHash() {
	file_name=$1
	choice=$2

	hash1=$(echo $file_name | sed -En 's/.*\[(.{6})-(.{7})\].*/\1/p')
	hash2=$(echo $file_name | sed -En 's/.*\[(.{6})-(.{7})\].*/\2/p')
	if [[ $choice -eq 1 ]]; then
		echo $hash1
	else
		echo $hash2
	fi
}

findFile () {

	hash=$1
	folder=$2
	find "$folder" -type f -name "*.py" |
		while IFS= read -r filename; do
			basename=$(basename "$filename")
			fname_hash=$(echo -n "$basename" | openssl dgst -sha1 -hex | cut -c10-15)
			if [[ "$fname_hash" = "$hash" ]]; then
				echo $filename
				break
			fi
		done
}

function findCommit() {

	TARGET_HASH=$1
	NEXT_COMMIT=$2

	# collect git log hash list
	declare -a hash_list
	tmp=$(git log | sed -n "s/^commit \(.*\)$/\1/p")
	for hii in $tmp; do
		hash_list+=($hii)
	done

	n=${#hash_list[@]}

	# Find next commit
	if [[ $NEXT_COMMIT == 1 ]]; then
		for (( i = 0; i < $n; i++ )); do
			hii=${hash_list[$i]}
			shii=$(echo $hii | cut -c1-7)
			if [[ $TARGET_HASH == $shii ]]; then
				if [[ $i -eq 0 ]]; then
					echo ${hash_list[$i]}
				else
					echo ${hash_list[$i-1]}
				fi
				break
			fi
		done
	else
		echo $TARGET_HASH
	fi
}

# extract hash strings from file name
script_hash=$(extractHash $1 1)
git_hash=$(extractHash $1 2)

if [[ -z $script_hash || -z $git_hash ]]; then
	exit 2
fi

if [[ $VERBOSE == 1 ]]; then
	echo 'script_hash=' $script_hash
	echo 'git_hash=' $git_hash
fi

# find script file from hash
script_file=$(findFile $script_hash $BASE_FOLDER)
if [[ -z $script_file ]]; then
	exit 2
fi

if [[ $VERBOSE == 1 ]]; then
	echo $script_file
fi

# get dirname and basename from script file path
dirname=$(dirname $script_file)
basename=$(basename $script_file)

# cd into folder
cd $dirname

# get git commit
got_hash=$(findCommit $git_hash 0)
if [[ $VERBOSE == 1 ]]; then
	echo 'git_hash=' $git_hash
fi

# git show old version
tmpfile=${TMP_SCRIPT_FOLDER}/${basename}
if [[ $VERBOSE == 1 ]]; then
	echo "$tmpfile"
fi

#git show ${got_hash}:${basename} | less
git show ${got_hash}:${basename} > $tmpfile

if [[ $? == 0 ]]; then
	$TERM -- vim $tmpfile
fi

Some more explanations are given below.

Global parameters

At the top of the bash script I defined a few global parameters:

TERM gives which terminal emulator to use when opening the text editor (vim in this case) after locating the target script file, I used gnome-terminal here, you could pick whichever you prefer.

BASE_FOLDER gives the root level folder within which the script file is sought. Narrowing down to a more specific folder would help speed up the process.

TMP_SCRIPT_FOLDER specifies the folder to save a temporary copy of the located script file. /tmp/ is a good option for such things.

The 1st command line input argument is the target data file name, e.g. vertical_profiles[c5909e-5a77e20].png. A secondary optional argument is stored to NEXT_COMMIT. If not given, it would have a default value of 0. If given a value of 1, it will try to search for the git revision after that given in the file name. E.g. the revision following 5a77e20.

Extract hashes from filename

The function extractHash is responsible for extracting hash strings from the filenames. It does so by 2 regular expression (regex) searches using sed:

sed -En 's/.*\[(.{6})-(.{7})\].*/\1/p'
sed -En 's/.*\[(.{6})-(.{7})\].*/\2/p'

The 2nd input function argument choice decides which one to return: 1 for the 6-digits script hash, 2 for the 7-digits git hash.

Find the script file by matching script hash

This search is performed in the findFile function. To located the very script with matching hash strings, I did a sweeping search within the BASE_FOLDER using find:

find "$folder" -type f -name "*.py" |
	while IFS= read -r filename; do
		...
	done

All Python scripts (with a .py extension) inside the BASE_FOLDER are iterated, and hash computed. This time the hash is computed using openssl:

fname_hash=$(echo -n "$basename" | openssl dgst -sha1 -hex | cut -c10-15)

NOTE that the -n option to the echo command is to strip the trailing new line character in the $basename variable. The cut command is to truncate the first 6 characters. The search is stopped upon finding a match.

Find the version of the script by matching git revision hash

This step is done in the findCommit function. After locating the target script, we still need to retrieve the matching version of the script. Much of the hassle inside the findCommit function is actually only relevant if the NEXT_COMMIT argument is 1, i.e. when I want to look at the git revision immediately after the one specified in the filename. This is because sometimes I would create a new script with a clean repo stage, and execute the script to generate output data. Then this newly created script would be registered into the git history in the next commit, rather than the one whose hash has been appended to the data filename.

To find this next commit, I first stored all git revision hash strings into an array:

declare -a hash_list
tmp=$(git log | sed -n "s/^commit \(.*\)$/\1/p")
for hii in $tmp; do
	hash_list+=($hii)
done

Then iterate through the array to find the matching hash. Upon locating it, the one after it, with an array index of i-1, if there is one, is returned. Note that the git log command outputs history in a reversed order, so the next revision is i-1, rather than i+1.

Lastly, to get a historical revision of the given file in a git repository, one uses:

git show ${got_hash}:${basename} > $tmpfile

${got_hash} was found by the findCommit function, and ${basename} by the findFile function. The stdout of the command is redirected to a temporary file $tmpfile, which is then opened in vim in a new terminal window:

$TERM -- vim $tmpfile

Build a rofi script and bind a hotkey

This is one additional, "bonus" step. Code first:

#!/usr/bin/env bash

SEARCH_FOLDER1=~/scripts/
SEARCH_FOLDER2=~/datasets/

HASH_FIND_SCRIPT=~/.config/rofi/hash_find.sh

SELECT=$(find $SEARCH_FOLDER1 $SEARCH_FOLDER2 -type f -regex ".*\.\(png\|jpg\|csv\|pdf\|nc\|dat\)"| rofi -dmenu -lines 30 -config ~/.config/rofi/config_recoll)

if [[ -n $SELECT ]]; then
	bash "$HASH_FIND_SCRIPT" "$SELECT"
fi

SEARCH_FOLDER1 and SEARCH_FOLDER2 are where I save output data, these are used as the target searching area of the find command. The data files have extensions of jpg, png, csv, pdf, nc or dat. The outputs from the find command is piped to a rofi interface, from which I pick the file name I’d like to parse.

After picking the desired file, hash_find.sh, which is the bash script in the above section, is executed on the selected file. Once run successfully, the located script file will be open in a new terminal window, inside vim editor.

A screen capture of this searching is shown in the video below.

Leave a Reply