In a previous post, I introduced the installation of (the "lite" version of) CDAT in Linux. This post will move on to talk about some basic netCDF file reading, data creation and saving.
Sample data
We will be using some sample netCDF data as an illustration. You can obtain the zipped data file from here. The data is the monthly average global Sea Surface Temperature (SST) of year 1989, obtained from the ERA-Interim reanalysis dataset. Below lists some basic information about the file:
- File name:
sst_s_m_1989_ori-preprocessed.nc
- File size: ~ 2.1 Mb
- Data rank: 4
- Data dimensions: 12 (time) x 1 (level) x 241 (latitude) x 480 (longitude)
- Data id: ‘sst’
- Data unit: ‘K’
- Data long<sub>name</sub>: ‘Sea surface temperature’
- Horizontal resolution: 0.75 x 0.75 degree latitude/longitude
- Temporal resolution: monthly
Jupyter notebook
The result of the post will be represented as a (static) Jupyter notebook. You can download the original notebook file from ->here<-.
1. Read in a netCDF file¶
Assuming you have obtained the sample netCDF data file and saved it to an accessible location (e.g. ./sst_s_m_1989_ori-preprocessed.nc'
).
To read in the netCDF data, we use the cdms2
module.
The “Community Data Management System” (cdms) module is specialized for the organization of multi-dimensional, gridded data. It is mostly used for the file I/O, and the creation of a data type for storing the netCDF data.
The following lines open the netCDT file using cdms2.open()
:
FILE = './sst_s_m_1989_ori-preprocessed.nc'
import cdms2 as cdms
fin = cdms.open(FILE, 'r')
print(fin)
<CDMS file '/home/guangzhi/Notebooks/numbersmithy/cdat_tutorial_notebooks/sst_s_m_1989_ori-preprocessed.nc', mode 'r' at 7f7db3d918a0, status: open>
We can now perform some queries on the opened file fin
.
1.1 List variables in the data file¶
print(fin.listvariables())
['bounds_latitude', 'bounds_longitude', 'sst']
The 'bounds_latitude'
, and 'bounds_longitude'
variables are auto-created by cdms
when saving the data to a file. You can ignore them for now.
'sst'
is the actual variable we are interested in.
Other than this, you can also use the ncdump
utility with the -h
option to query the variables stored in the file: ncdump -h the_netcdf_file.nc
.
1.2 Get dimension axes¶
axes = fin.axes
print(axes)
OrderedDict([('bound', id: bound units: Length: 2 First: 0.0 Last: 1.0 Python id: 0x7f7e104ba820 ), ('latitude', id: latitude Designated a latitude axis. units: degrees_north Length: 241 First: -90.0 Last: 90.0 Other axis attributes: long_name: latitude axis: Y Python id: 0x7f7e104ba400 ), ('level', id: level Designated a level axis. units: Length: 1 First: 1 Last: 1 Other axis attributes: long_name: level axis: Z Python id: 0x7f7e104ba220 ), ('longitude', id: longitude Designated a longitude axis. units: degrees_east Length: 480 First: 0.0 Last: 359.25 Other axis attributes: modulo: [360.] long_name: longitude axis: X topology: circular Python id: 0x7f7e104ba340 ), ('time', id: time Designated a time axis. units: days since 1900-1-1 Length: 12 First: 32507.0 Last: 32841.0 Other axis attributes: calendar: gregorian axis: T Python id: 0x7f7e104ba6a0 )])
The axes
attribute of fin
is an OrderedDict
object containing the axis of each dimension. An OrderedDict
is just like an ornidnary Python dict
, except that the keys are sorted in the order of insertion.
Again, ignore the bound
dimension for now. We can see that the data are 4-dimensional:
12 times x 1 level x 241 latitudes x 480 longitudes.
Individual axis can be obtained from the axes
object or from fin
directly:
latax = axes['latitude']
# or
#latax = fin.getAxis('latitude')
print(latax)
print(type(latax))
id: latitude Designated a latitude axis. units: degrees_north Length: 241 First: -90.0 Last: 90.0 Other axis attributes: long_name: latitude axis: Y Python id: 0x7f7e104ba400 <class 'cdms2.axis.FileAxis'>
The latax
variable is a special data type: cdms2.axis.FileAxis
. Like the netCDF data, the axis type also consists of a data array (storing the dimension coordinates), and its own metadata (information like the dimension name, coordinate units etc.).
To view the data array of the axis:
latax[:]
array([-90. , -89.25, -88.5 , -87.75, -87. , -86.25, -85.5 , -84.75, -84. , -83.25, -82.5 , -81.75, -81. , -80.25, -79.5 , -78.75, -78. , -77.25, -76.5 , -75.75, -75. , -74.25, -73.5 , -72.75, -72. , -71.25, -70.5 , -69.75, -69. , -68.25, -67.5 , -66.75, -66. , -65.25, -64.5 , -63.75, -63. , -62.25, -61.5 , -60.75, -60. , -59.25, -58.5 , -57.75, -57. , -56.25, -55.5 , -54.75, -54. , -53.25, -52.5 , -51.75, -51. , -50.25, -49.5 , -48.75, -48. , -47.25, -46.5 , -45.75, -45. , -44.25, -43.5 , -42.75, -42. , -41.25, -40.5 , -39.75, -39. , -38.25, -37.5 , -36.75, -36. , -35.25, -34.5 , -33.75, -33. , -32.25, -31.5 , -30.75, -30. , -29.25, -28.5 , -27.75, -27. , -26.25, -25.5 , -24.75, -24. , -23.25, -22.5 , -21.75, -21. , -20.25, -19.5 , -18.75, -18. , -17.25, -16.5 , -15.75, -15. , -14.25, -13.5 , -12.75, -12. , -11.25, -10.5 , -9.75, -9. , -8.25, -7.5 , -6.75, -6. , -5.25, -4.5 , -3.75, -3. , -2.25, -1.5 , -0.75, 0. , 0.75, 1.5 , 2.25, 3. , 3.75, 4.5 , 5.25, 6. , 6.75, 7.5 , 8.25, 9. , 9.75, 10.5 , 11.25, 12. , 12.75, 13.5 , 14.25, 15. , 15.75, 16.5 , 17.25, 18. , 18.75, 19.5 , 20.25, 21. , 21.75, 22.5 , 23.25, 24. , 24.75, 25.5 , 26.25, 27. , 27.75, 28.5 , 29.25, 30. , 30.75, 31.5 , 32.25, 33. , 33.75, 34.5 , 35.25, 36. , 36.75, 37.5 , 38.25, 39. , 39.75, 40.5 , 41.25, 42. , 42.75, 43.5 , 44.25, 45. , 45.75, 46.5 , 47.25, 48. , 48.75, 49.5 , 50.25, 51. , 51.75, 52.5 , 53.25, 54. , 54.75, 55.5 , 56.25, 57. , 57.75, 58.5 , 59.25, 60. , 60.75, 61.5 , 62.25, 63. , 63.75, 64.5 , 65.25, 66. , 66.75, 67.5 , 68.25, 69. , 69.75, 70.5 , 71.25, 72. , 72.75, 73.5 , 74.25, 75. , 75.75, 76.5 , 77.25, 78. , 78.75, 79.5 , 80.25, 81. , 81.75, 82.5 , 83.25, 84. , 84.75, 85.5 , 86.25, 87. , 87.75, 88.5 , 89.25, 90. ], dtype=float32)
1.3 Read in the variable’s metadata¶
We can read in the metadata of a variable without loading the data array into memory.
This is achieved by using the square brackets notation: fin['variable_id']
.
This can be useful when you only need to examine the metadata of a variable, for instance its shape, long_name, or unit, without loading a big chunck of data into RAM.
sst_meta = fin['sst']
print(sst_meta)
<cdms2.fvariable.FileVariable object at 0x7f7e104a7f10>
sst_meta.shape
(12, 1, 241, 480)
sst_meta.size()
1388160
sst_meta.id, sst_meta.name, sst_meta.long_name
('sst', 'sst', 'Sea surface temperature')
The .info()
method is quite usefull. It prints out a summary of the variable’s metadata.
sst_meta.info()
*** Description of Slab sst *** id: sst shape: (12, 1, 241, 480) filename: /home/guangzhi/Notebooks/numbersmithy/cdat_tutorial_notebooks/sst_s_m_1989_ori-preprocessed.nc missing_value: [1.e+20] comments: grid_name: grid_241x480 grid_type: None time_statistic: long_name: Sea surface temperature units: K standard_name: Sea surface temperature ndim: 4 Grid has Python id 0x7f7db556c580. Gridtype: generic Grid shape: (241, 480) Order: yx ** Dimension 1 ** id: time Designated a time axis. units: days since 1900-1-1 Length: 12 First: 32507.0 Last: 32841.0 Other axis attributes: calendar: gregorian axis: T Python id: 0x7f7e104ba6a0 ** Dimension 2 ** id: level Designated a level axis. units: Length: 1 First: 1 Last: 1 Other axis attributes: long_name: level axis: Z Python id: 0x7f7e104ba220 ** Dimension 3 ** id: latitude Designated a latitude axis. units: degrees_north Length: 241 First: -90.0 Last: 90.0 Other axis attributes: long_name: latitude axis: Y Python id: 0x7f7e104ba400 ** Dimension 4 ** id: longitude Designated a longitude axis. units: degrees_east Length: 480 First: 0.0 Last: 359.25 Other axis attributes: modulo: [360.] long_name: longitude axis: X topology: circular Python id: 0x7f7e104ba340 *** End of description for sst ***
We can obtain the dimension axis from the sst_meta
variable as well:
# NOTE that we are giving an integer index (2) as input, meaning the 3rd axis in the data.
sst_meta.getAxis(2)
id: latitude Designated a latitude axis. units: degrees_north Length: 241 First: -90.0 Last: 90.0 Other axis attributes: long_name: latitude axis: Y Python id: 0x7f7e104ba400
# Alternatively:
sst_meta.getLatitude()
id: latitude Designated a latitude axis. units: degrees_north Length: 241 First: -90.0 Last: 90.0 Other axis attributes: long_name: latitude axis: Y Python id: 0x7f7e104ba400
# Similarly, to get the time axis:
sst_meta.getTime()
id: time Designated a time axis. units: days since 1900-1-1 Length: 12 First: 32507.0 Last: 32841.0 Other axis attributes: calendar: gregorian axis: T Python id: 0x7f7e104ba6a0
1.4 Read in the variable with data array¶
To read in the entire variable (data array + metadata), use parenthese instead of square brackets:
sst = fin('sst')
print(type(sst))
<class 'cdms2.tvariable.TransientVariable'>
Notice that sst_meta
is of FileVariable
type, and sst
is a TransientVariable
.
When manipulating netCDF data using CDAT
, TransientVariable
is our primary data type. It is derived from numpy.ma.MaskedArray
, which is just like numpy.ndarray
but with an extra layer of data mask to denote missing values. Therefore, the array manipulation API of TransientVariable
is almost the same as native numpy.ndarray
, with only some minor differences.
For instance, we can query the shape and size of the sst
variable just like a numpy.ndarray
:
sst.shape, sst.size
((12, 1, 241, 480), 1388160)
Slicing can be achieved using similar opertions as on ndarray
:
slab1 = sst[0]
slab2 = sst[0:2, :, 0:50, 200:300]
print(slab1.shape, slab2.shape)
(1, 241, 480) (2, 1, 50, 100)
We will talk more about data slicing in a later post.
All dimension axes are accessible from sst
just like before. Notice that the returned value is of type TransientAxis
, no practical difference from FileAxis
though.
type(sst.getLatitude())
cdms2.axis.TransientAxis
Lastly, it is good practice to close the file after reading in data:
fin.close()
You can, of cause, use the with
context to achieve an automatic file closing.
sst.units
'K'
Suppose we would like to convert it to Celcius, and compute an annual average from the 12 monthly mean values. We can achieve this using the following 3 lines:
import MV2 as MV
sst_c = sst - 273.15
sst_c_annual = MV.sum(sst_c, axis=0)/12.
# some simple checks:
print('A sample box in K:', sst[0, 0, 100:102, 100:102])
print('The same sample box in Celcius:', sst_c[0, 0, 100:102, 100:102])
print('annual mean shape =', sst_c_annual.shape)
A sample box in K: [[301.3503723144531 301.35693359375] [301.5281066894531 301.5483703613281]] The same sample box in Celcius: [[28.20037841796875 28.206939697265625] [28.37811279296875 28.39837646484375]] annual mean shape = (1, 241, 480)
Notice that I’m using the MV2
module to compute the summation over the time dimension.
MV2
handles numerical computations involving TransientVariables
much like numpy.ndarray
. One big difference from ordinary ndarray
computations is that the MV2
counterparts will try to be metadata preserving, as will be shown in a minute.
Also note that I’m using a MV.sum(sst_c, axis=0)/12.
instead of a simple MV.mean(sst_c, axis=0)
to compute the averages. This is because the MV2
module doesn’t have a mean()
function. It might sound a bit weird that such a fundemental function is totally missing, but they have good reasons for making such a decision.
Since CDAT
is designed for numerical computations involving meteorological or oceanorgraphical data, the averaging operation almost always needs to be a weighted average. Just like the case of deriving an annual average from monthly means we are dealing with here. Strictly speaking, different weights should be given to the 12 months, in proportion to the number of days in each month. For instance, Januray has a slightly larger weight than Feburary (31 vs 28). A simple mean(axis=0)
, or sum(axis=0)/12.
does not take this into account.
Therefore, to discourage reckless usages of un-weighted averaging, MV2
removed the mean()
function (it used to have one in an earlier version). Instead, one should use a dedicated function in another module (cdutil
) of CDAT
for weighted average computations. We will cover that in a later post.
The new sst_c_annual
variable is derived from sst
, and is itself a TransientVariable
.
type(sst_c_annual)
cdms2.tvariable.TransientVariable
We can query some info on sst_c_annual
using its info()
method:
sst_c_annual.info()
*** Description of Slab variable_210 *** id: variable_210 shape: (1, 241, 480) filename: missing_value: 1e+20 comments: grid_name: <None> grid_type: generic time_statistic: long_name: Sea surface temperature units: K tileIndex: None standard_name: Sea surface temperature Grid has Python id 0x7f7db4618a00. Gridtype: generic Grid shape: (241, 480) Order: yx ** Dimension 1 ** id: level Designated a level axis. units: Length: 1 First: 1 Last: 1 Other axis attributes: axis: Z long_name: level Python id: 0x7f7db3d961c0 ** Dimension 2 ** id: latitude Designated a latitude axis. units: degrees_north Length: 241 First: -90.0 Last: 90.0 Other axis attributes: axis: Y long_name: latitude realtopology: linear Python id: 0x7f7db3d96130 ** Dimension 3 ** id: longitude Designated a longitude axis. units: degrees_east Length: 480 First: 0.0 Last: 359.25 Other axis attributes: axis: X modulo: 360.0 topology: circular long_name: longitude realtopology: circular Python id: 0x7f7db3d961f0 *** End of description for variable_210 ***
Notice that the sst_c_annual
variable has inherited metadata from its ancestor (sst_c
), which in turn inherited from the original sst
. This is what I meant by “metadata-preserving operations.
However, the id
attribute of sst_c_annual
is not quite meaningful: variable_190
. This is a automatically generated variable id to distinguish this particular TransientVariable
from others in this current Python session.
The units
attribute is still K
, because CDAT
has no ways of knowing the relationship between Kelvin and Celcius. These are the metadata that we need to maintain ourselves.
As a rule of thumb, MV2
and TransientVariable
operations will try to be metadata-preserving, by inheriting from the operand(s). When this could not be achieved, or some ambiguity is encountered (for instance, when adding the data from year 1989 to 1990, it doesn’t know which year’s time axis should be given to the result), the relevant metadata will either be wrong, or lost.
2.2 Create a new TransientVariable
from scratch¶
A new TransientVariable
can be created from scratch by calling the cdms2.createVariable()
function.
We need to provide it with the following input arguments:
- (mandatory) a
np.ndarray
compabile array. This will be used as the data array of theTransientVariable
. A Python list will do, or simply a single scalar, likecdms2.createVariable(10)
. - (optional)
axes
. This is a list/tuple ofTransientAxis
objects. These are the dimension metadata. The axis size should agree with the dimension size in the corresponding rank, e.g. a length-12 axis for the 1st dimension if your data have a shape of(12, 1, 241, 480)
. - (optional)
attributes
. This is a Pythondictionary
containing some text-based metadata. For instance,{'id': 'sst', 'name': 'sst', 'long_name': 'sea surface temperature', 'units': 'K'}
. - (optional)
typecode
. Data format. Could bef
for floating point;double
for double precision floats;b
forbyte
;i
for integer.
Let’s see an example.
Suppose we need to create a hypothetical surface pressure distribution within the box of 100 – 180 E, 20 – 60 N. The data are composed by adding a Gaussian-shaped low pressure center on top of the 1000 hPa standard sea level pressure.
Let’s start by creating the latitude and longitude axes:
lats = np.arange(20, 61)
lons = np.arange(100, 181)
latax = cdms.createAxis(lats)
latax.designateLatitude()
latax.id = 'y'
latax.name = 'latitude'
latax.units = 'degree north'
lonax = cdms.createAxis(lons)
lonax.designateLongitude()
lonax.id = 'x'
lonax.name = 'longitude'
lonax.units = 'degree east'
Note that we are using cdms.createAxis()
function to create a dimension axis.
Next, create a hypothetical pressure purturbation:
lat_c = lats.mean()
lon_c = lons.mean()
XX, YY = np.meshgrid(lons, lats)
r = 8 # purturbation radius, in degree latitude/longitude.
P0 = 1000 # hPa
low = -10 * np.exp(-(XX-lon_c)**2/2./r**2 - (YY-lat_c)**2/2./r**2)
pressure = P0 + low
print(type(pressure), pressure.shape)
<class 'numpy.ndarray'> (41, 81)
Then create a new TransientVariable
from pressure
:
att_dict={'id': 'slp',
'name': 'sea level pressure',
'long_name': 'sea level pressure',
'units': 'hPa'}
pressure = cdms.createVariable(pressure ,axes=[latax, lonax],\
attributes=att_dict,typecode='f')
type(pressure)
cdms2.tvariable.TransientVariable
pressure.info()
*** Description of Slab slp *** id: slp shape: (41, 81) filename: missing_value: 1.0000000200408773e+20 comments: grid_name: <None> grid_type: generic time_statistic: long_name: sea level pressure units: hPa tileIndex: None Grid has Python id 0x7f7db3d960d0. Gridtype: generic Grid shape: (41, 81) Order: yx ** Dimension 1 ** id: y Designated a latitude axis. units: degree north Length: 41 First: 20 Last: 60 Other axis attributes: axis: Y Python id: 0x7f7e1288f400 ** Dimension 2 ** id: x Designated a longitude axis. units: degree east Length: 81 First: 100 Last: 180 Other axis attributes: axis: X modulo: 360.0 topology: circular Python id: 0x7f7e1288f760 *** End of description for slp ***
Here is a screenshot of the pressure
data as viewed from ncview
:
3. Save data to a netCDF file¶
Now save the newly crafted pressure
data to a netCDF file on disk:
abpath_out = 'test_nc.nc'
print('Saving output to:\n',abpath_out)
fout = cdms.open(abpath_out, 'w')
fout.write(pressure, typecode='f')
fout.close()
Saving output to: test_nc.nc