Metadata defines the NetCDF format
Network Common Data Form (NetCDF) is the arguably the most popular data format in the field of atmospheric sciences, meteorology and oceanography. To quote the description of this format by the official site:
Data in netCDF format is:
Self-Describing. A netCDF file includes information about the data it contains.
Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.
Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
Sharable. One writer and multiple readers may simultaneously access the same netCDF file.
Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software.
The 1st defining attribute of the NetCDF format is identified as Self-describing, and I believe it is also the most important one. NetCDF files contain not only data, but also metadata that describe what the data are about, the name, units, data type, and data creators etc.. Anyone with some suitable tools should be able to get at least some basic understandings of the data just by reading the metadata.
There exist libraries to access the NetCDF format in C, Fortran, C++, Python, and other languages. For Python, the relevant library is called
netcdf4. On top of that, there are also some other packages that offer more advanced functionalities, for instance CDAT, iris and xarray. I’m still on my way learning
xarray, and have had little experience with
iris, so the comparison will be centered around
We will be using some sample NetCDF data as an illustration. You can obtain the zipped data file from here. The data is the monthly average global Sea Surface Temperature (SST) of year 1989, from the ERA-Interim reanalysis dataset. Below lists some basic information about the file:
- File name:
- File size: ~ 2.1 Mb
- Data rank: 4
- Data dimensions: 12 (time) x 1 (level) x 241 (latitude) x 480 (longitude)
- Data id: ‘sst’
- Data unit: ‘K’
- Data long_name: ‘Sea surface temperature’
- Horizontal resolution: 0.75 x 0.75 degree latitude/longitude
- Temporal resolution: monthly
The data are really not important here, you can follow along the article without using any actual data.
Read NetCDF data using CDAT
Below is the code snippet to read in the data using the
cdms2 module of the
import cdms2 as cdms abpath_in=os.path.join(SOURCEDIR, 'sst_s_m_1989_ori-preprocessed.nc') print('\n### <netcdf_manipulation>: Read in file:\n',abpath_in) fin=cdms.open(abpath_in,'r') var_cdat=fin('sst') fin.close()
SOURCEDIR is the path to the folder where you saved the sample data file.
We assigned the read-in data to the
var_cdat variable, which, by the definition of NetCDF format, should be a combination of data values and metadata. Let’s do some queries to check this out:
print('type of var_cdat =', type(var_cdat)) print('id = ', var_cdat.id) print('standard_name = ', var_cdat.standard_name) print('long_name = ', var_cdat.long_name) print('unit = ', var_cdat.units) print('shape = ', var_cdat.shape)
The output we get:
type of var_cdat = <class 'cdms2.tvariable.TransientVariable'> id = sst standard_name = Sea surface temperature long_name = Sea surface temperature unit = K shape = (12, 1, 241, 480)
var_cdat variable is a
cdms2.tvariable.TransientVariable, which is a wrapper class built on top of the
numpy.ma.core.MaskedArray class. Masked array facilitates manipulation of data with missing values, in this case, the land areas where SST has no definition.
units attributes tell us what the data is about and in what unit it is measured. The
shape attribute tells the rank and orderings of the data array.
Now do some query on the coordinate system of the data. This is how you obtain the time and latitude axes of a
timeax=var_cdat.getTime() timeax_comp=timeax.asComponentTime() print('timeax =') print(timeax[:]) print('timeax_comp =') print(timeax_comp) print('id of time =', timeax.id) print('unit of time =', timeax.units) print('\n') latax=var_cdat.getLatitude() lonax=var_cdat.getLongitude() print('id of latax =', latax.id) print('unit of latax =', latax.units)
Below is the output:
timeax = [30680. 30711. 30740. 30771. 30801. 30832. 30862. 30893. 30924. 30954. 30985. 31015.] timeax_comp = [1984-1-1 0:0:0.0, 1984-2-1 0:0:0.0, 1984-3-1 0:0:0.0, 1984-4-1 0:0:0.0, 1984-5-1 0:0:0.0, 1984-6-1 0:0:0.0, 1984-7-1 0:0:0.0, 1984-8-1 0:0:0.0, 1984-9-1 0:0:0.0, 1984-10-1 0:0:0.0, 1984-11-1 0:0:0.0, 1984-12-1 0:0:0.0] id of time = time unit of time = days since 1900-1-1 id of latax = latitude unit of latax = degrees_north
Notice that we obtained the time axis/dimension (I’ll be using these two terms interchangeably) using
timeax = var_cdat.getTime()
timeax[:] shows the values of the time axis, which is an 1d array of floats. This is typically how the NetCDF file encodes time stamps: it uses a numerical array to save the amount of time lapses since a given reference time point. In this case, the reference is
days since 1900-1-1 (with 00:00 UTC being implicit).
However, this is not particularly readable, so the
asComponentTime() method does the translation work for you, and the output is
timeax_comp = [1984-1-1 0:0:0.0, 1984-2-1 0:0:0.0, 1984-3-1 0:0:0.0, 1984-4-1 0:0:0.0, 1984-5-1 0:0:0.0, 1984-6-1 0:0:0.0, 1984-7-1 0:0:0.0, 1984-8-1 0:0:0.0, 1984-9-1 0:0:0.0, 1984-10-1 0:0:0.0, 1984-11-1 0:0:0.0, 1984-12-1 0:0:0.0]
This is much more human-readable. That is to say,
30680.0 days since
1900-01-01 00:00 UTC is
1984-1-1 0:0:0.0. There are also different types of calendars, but I’m not going to bother you with such nuances for now.
The latitude axis is obtained using
var_cdat.getLatitude(), and it has an id of
latitude, and a unit of
degrees_north. We can also query the values of the latitude axis by
latax[:], which again would be a numerical array. This is not shown for brevity.
Read NetCDF data using netcdf4
We repeat the similar reading and querying operations using the
netcdf4 package, using the code shown below:
from netCDF4 import Dataset, num2date abpath_in=os.path.join(SOURCEDIR, 'sst_s_m_1989_ori-preprocessed.nc') print('\n### <netcdf_manipulation>: Read in file:\n',abpath_in) fin=Dataset(abpath_in, 'r') var_nc4=fin.variables['sst'] print('type of var_nc4 =', type(var_nc4)) print('name = ', var_nc4.name) print('standard_name = ', var_nc4.standard_name) print('long_name = ', var_nc4.long_name) print('unit = ', var_nc4.units) print('shape = ', var_nc4.shape) print('\n') timeax_nc4=fin.variables['time'] timeax_comp_nc4=num2date(timeax_nc4[:], timeax_nc4.units, only_use_cftime_datetimes=False, only_use_python_datetimes=True) print('timeax =') print(timeax_nc4[:]) print('timeax_comp =') print(timeax_comp_nc4) print('name of time =', timeax_nc4.name) print('unit of time =', timeax_nc4.units) print('\n') latax_nc4=fin.variables['latitude'] print('name of latax =', latax_nc4.name) print('unit of latax =', latax_nc4.units)
Notice that we obtained the time axis of the data not from the data variable
var_nc4 itself, but from the file handle
And the time stamp translation work is achieved differently, using the
num2date() function. Other than these, there is not much of a difference, the
var_nc4 variable is also a combination of data values and metadata. Below is the outputs:
type of var_nc4 = <class 'netCDF4._netCDF4.Variable'> name = sst standard_name = Sea surface temperature long_name = Sea surface temperature unit = K shape = (12, 1, 241, 480) timeax = [30680. 30711. 30740. 30771. 30801. 30832. 30862. 30893. 30924. 30954. 30985. 31015.] timeax_comp = [real_datetime(1984, 1, 1, 0, 0) real_datetime(1984, 2, 1, 0, 0) real_datetime(1984, 3, 1, 0, 0) real_datetime(1984, 4, 1, 0, 0) real_datetime(1984, 5, 1, 0, 0) real_datetime(1984, 6, 1, 0, 0) real_datetime(1984, 7, 1, 0, 0) real_datetime(1984, 8, 1, 0, 0) real_datetime(1984, 9, 1, 0, 0) real_datetime(1984, 10, 1, 0, 0) real_datetime(1984, 11, 1, 0, 0) real_datetime(1984, 12, 1, 0, 0)] name of time = time unit of time = days since 1900-1-1 name of latax = latitude unit of latax = degrees_north
Difference between CDAT ant netcdf4
However, differences start to emerge as you go beyond such trivial attribute querying operations and perform some more complex manipulations of the data. For instance, let’s do some simple slicing of the data, by cutting out the domain of East Asia-Northwestern Pacific in the box of
10 - 60 degree North, and
80 - 150 degree East.
To do this with the
var_cdat variable, it would be
var_cdat_c_sub=var_cdat_c(latitude=(10, 60), longitude=(80, 150)) print('var_cdat_c_sub.getLatitude() =', var_cdat_c_sub.getLatitude())
We sliced the data using the latitude/longitude coordinates, and queried the latitude axis of the subdomain. Here is the output:
var_cdat_c_sub.getLatitude() = id: latitude Designated a latitude axis. units: degrees_north Length: 67 First: 10.5 Last: 60.0 Other axis attributes: axis: Y long_name: latitude realtopology: linear Python id: 0x7f59d74b2760
What if we want to derive a new variable out from
var_cdat, say convert the units from
Celsius. We can achieve this using:
var_cdat_c is another
TransientVariable, with all the same metadata as
var_cdat, except that we need to update its
Now let’s see how to do these operations with the
Unfortunately, there is no built-in method to slice the data domain using dimension coordinates, like what we did with the
var_cdat variable. For
var_nc4, we will have to do the following steps:
- Obtain the latitude and longitude axes (which we already did).
- Locate the indices (
lat_idx2) for the latitude bounds (
60) from the latitude coordinates. And do the same for the longitude bounds to get
- Slice the data value using the above obtained indices:
subdomain = var_nc4[:, :, lat_idx1:lat_idx2, lon_idx1:lon_idx2]
(Note this assumes you already knew where the latitude and longitude dimension are in the rank.)
If you also need the coordinate information for this new subdomain variable, you will have to slice the latitude and longitude axes themselves. And don’t forget the time and level dimensions, they also belong to the metadata of
If the coordinate bounds are not found exactly in the latitude/longitude coordinates, and accuracy around the map edges are of importance, you may need to do some 2D interpolation.
What about the Kelvin – Celsius conversion? The numerical conversion is easy:
var_nc4_c now is a plain
numpy.ma.core.MaskedArray, it lost all the metadata. A
var_nc4_c.long_name would give you this error message:
AttributeError: 'MaskedArray' object has no attribute 'long_name'
Don’t even think about dimension coordinates,
var_nc4_c has none of such things with it.
I think you start to see the pattern. Although come with the metadata from the NetCDF file, the variable read by the
netcdf4 package does not support metadata-preserving operations, and the data values and metadata start to fall apart as soon as you start to do any meaningful computations/operations with the variable. I feel that this is going against the very philosophy of the NetCDF format: metadata is bundled with the data values when saved in files, but are separated when read into the memory, as if they are something totally irrelevant.
And this is not just metaphysical, it can cause troubles in your programming. I actually started off using
CDAT, then learned a bit of
netcdf4 years later, and it was a huge pain. I found myself having to maintain the metadata, particularly those dimension coordinates all the time. Say I have a function that accepts some 3D (time, latitude and longitude) input data as argument, and the output of the function is another 3D data variable with a different shape. With
TransientVariable, I could just pass in the
TransientVariable to the function, query the dimension coordinates when I need to, or derive a new
TransientVariable as the return value, which is also equipped with its own metadata.
When doing the same with
netcdf4, I will have to pass in 3 more variables besides the data value itself (the time, latitude and longitude axes) to the function. Similarly, for every such data variable, there are 3 more return values (the updated time, latitude and longitude axes) that accompany the returned data value. The code becomes notably messier, and such maintenance work doesn’t really provide much added value, but adds extra burdens, and more lines of code to nurture bugs.
This is the greatest limitation I observed in the
netcdf4 Python library. I have some limited experiences with the NetCDF’s Fortran and Matlab libraries, and they both seem to share the similar API and workflow, and therefore the same limitation, i.e. a very loose connection between data and metadata.
Out of frustration, I ended up writing a rudimentary custom class, following the similar interfaces of
CDAT, to store the data read by
netcdf4. I wouldn’t recommend you to do the same, it is not trivial to wrap the
numpy.ndarray data type. Luckily, someone did this for us already.
xarray are all more advanced tools that could make the manipulation of NetCDF data, and your life a lot easier.
There are quite a lot of power tools in the
CDAT package, some of which I never used before, so there is a lot of content worth talking about. I will make a series of posts in the future to cover some aspects of
CDAT that I found most useful in my previous work.