The issue
For researchers in many fields today, including me myself, a good chunk of the work is about doing computations, often in the form of writing and executing some computational programs/scripts. As my main programming language is Python, I write Python scripts on a regular basis, sometimes several scripts a day. The scripts have different purposes, some are designated for data pre-processing, some for producing intermediate results, and some others for generating graphs. Over time, the number of scripts for a single project can accumulate to a pretty notable value. I did a quick accounting before writing this post, and found out that for the project I’m currently working on, I have written 110 scripts, which have created about 300 graphs, and 1323 intermediate data files. This may not seem overwhelming for some, but is already posting a challenge for my digital data management skills.
Take for instance a recurring issue that I keep on having: how to properly name the script files, and the outputs they generate, so that you can quickly and reliably locate the correct script that is responsible for the creation of a certain piece of data/graph?
For a relatively small project with only a few scripts, this might be a trivial task and not worth of special attention. But as you write more and more codes, testing out ideas, making experiments, mistakes and corrections along the way, the level of complexity in managing these scripts and output data scales up quickly. For several times, I have found myself having to set aside an entire morning, to print out a list of script file names on a sheet, and go through the list reminding myself of what each one of them is doing, and where their respective output data are stored.
And it is not fun.
So I decided to pause and think about what I’m doing wrong, and see if I could come up with some better file naming practices. This is the 1st part on this topic, as I’m still experimenting these new ideas, and changes are to be expected.
Data file naming convention
I came up with this data file naming convention years ago, and have been sticking to it ever since:
<varid>_<level>_<time_step>_<year>_<suffix>.ext
Where:
varid
: id of the variable stored in file. E.g.t
for temperature,u
for zonal wind, etc..level
: vertical level. E.g.p850
for 850 hPa pressure level,s
for surface,m30
for the 30th model sigma level.time_step
: temporal resolution of data, E.g.6
for 6-hourly,m
for monthly.year
: year in format of YYYY. If only a month of a month,YYYY-mm
. If multiple years,YYYY-YYYY
.suffix
: any additional information necessary to label the data, using-
to concatenate words. E.g.preprocessed
for data that have been preprocessed,preprocessed-cln
for data that have been preprocessed, then column integrated..ext
: file extension.
A few examples:
sst_s_m_1984-1989_ori.nc
: original (ori
) monthly (m
) SST (sst
) data during 1984-1989. By definition, SST data are at surface levels
.z_p1000-200_6_2010_10d-highpass.nc
: 10-day high-pass filtered (10d-highpass
) geopotential height (z
) data at 6-hourly interval (6
), within 1000-200 hPa pressure levels (p1000-200
) in 2010.
This convention uses underscores (_
) to separate the 5 main fields, and
dashes (-
) as delimiters within fields, so the fields can be
easily obtained using a split('_')
command in Python.
The first 4 fields have very specific meanings, and can be used to
programmatically search and query some data stored on disk.
The suffix
field is a kind of garbage collection field where one can
put arbitrary information, for instance, one can use a dash-
concatenated string to indicate the workflow from which this piece
of data is created, e.g. preprocessed-10day-highpass-cln-EOF1
, I
don’t need to explain the very meaning of it, the point is that
whatever string you decide to use, it only needs to make sense to
you, or your team member who shares the data with you. This suffix
field is also the one I am going to exploit implementing the
new naming convention, as discussed later.
Script naming convention
Unlike the naming of data files, I had no systematic script file naming convention and have been giving file names rather casually and carelessly. This has proved to be a big issue as mentioned previously. So I decided to experiment this new script file naming format:
<prefix>_<main>[_<suffix>].ext
The prefix
field tells the overall category of the script. I have
come up with 7 so far:
util
: store utility functions/classes.plot
: plotting script.comp
: computation script, save intermediate results.cppl
: do computations, then create plot.prep
: pre-processing, preparation script.test
: testing script.dprt
: deprecated script.
For each prefix, there are also a shorter 3-char version (e.g. utl
for util
, cmp
for comp
), and a 2-char version (ut
,
cp
). But I have yet to decide which one to stick to.
The main
field is the main descriptive part of the file name. I
haven’t thought of any further guidelines regarding this part, other
than that it might be a good practice to start with a verb,
e.g. decompose-EOF-of-z-anomalies
, something like that.
The optional suffix
part is mainly used to distinguish closely related
scripts, for instance, V1
, V2
to denote different versions, or
some arbitrary texts to distinguish similar computations but with
different approaches, with different parameter choices, etc.. I will
need to think further about this as I move along.
Use git as a version control system
Before discussing the new naming convention I’m experimenting with, it might be worth emphasizing the potential benefits from using git as a version control system. I have another post: Use git in your programming or research work talking about some basic usages of git in my everyday work. I highly recommend at least trying out git to see whether it fits in and improves your workflow. As mentioned in the above post, it is not just about technical convenience, I found that at the end of a days’ work, writing and committing a day’s progress to be particularly relaxing and reassuring. Because git tracks and stores version history, I’m going to implement a new data naming scheme that helps tracking the script files and the data they create.
Bind the script and data – the design choice
The goal is simple: I’d like to make an association between a certain
script file, and the data/graph it creates, such that I can easily
trace back the original version of the program, when examining
the data/graph.
There must be more than one ways of achieving this. Some time ago I created a function to automatically save a copy of the script being executed into a new sub-folder into which the script writes its outputs. But I rarely use it in practice, because it creates many redundant sub-folders, making subsequent data processing more cumbersome.
The new approach I’m experimenting now is: compute a short hash string
from the script’s file name, append it with the short hash string of
the current git status, and add both to the suffix
field of the data
file.
Example, a netCDF data file is named:
qflux_s_6_2004_straighten-normed-vec-relative-to-AR-movement[b63e61-6159eb8].nc
The part starting from stratighten
till the extension is the
suffix
field of the file name, and the 2 dash separated strings
inside the square brackets are the 2 hash strings:
b63e61
: the first 6 hex digits of the sha1 hash of the script file name, which in this case ispacific_straighten_vector_relativeV3.py
(I haven’t fully transitioned to the new naming scheme). I have to admit that even for me, this file name is a bit cryptic.6159eb8
: this is the first 7 digits of the hash computed by git, depending on the current status of the git repository.
The surrounding square brackets are added purely for easier identification of the hashes, both by eye and by a program.
Therefore, the 1st hash serves as an ID of the script, such that I can quickly locate the program responsible for creating a certain result; and the 2nd hash pin-points to a specific version of the script. Combined, these 2 hashes unambiguously associate a data file and the script that creates it.
This can also be used for other types of data, like image files,
e.g. straightened_total_relative_trans_qflux[6309d4-6159eb8].pdf
.
It can be seen that this figure is created from a different script
(with short hash 6309d4
) but in the same git revision (git hash is
the same 6159eb8
).
Of cause I’m not manually computing these hashes and type them into the file names, I’ve created functions to achieve this hash-appending operation, and a bash script to search through a given folder for a target hash. These will be introduced in the 2nd part of this topic.
If you also have some experience designing a file naming convention, feel free to leave a comment below.