In a previous post, I introduced the installation of (the "lite" version of) CDAT in Linux. This post will move on to talk about some basic netCDF file reading, data creation and saving.
Sample data
We will be using some sample netCDF data as an illustration. You can obtain the zipped data file from here. The data is the monthly average global Sea Surface Temperature (SST) of year 1989, obtained from the ERA-Interim reanalysis dataset. Below lists some basic information about the file:
- File name:
sst_s_m_1989_ori-preprocessed.nc - File size: ~ 2.1 Mb
- Data rank: 4
- Data dimensions: 12 (time) x 1 (level) x 241 (latitude) x 480 (longitude)
- Data id: ‘sst’
- Data unit: ‘K’
- Data long<sub>name</sub>: ‘Sea surface temperature’
- Horizontal resolution: 0.75 x 0.75 degree latitude/longitude
- Temporal resolution: monthly
Jupyter notebook
The result of the post will be represented as a (static) Jupyter notebook. You can download the original notebook file from ->here<-.
1. Read in a netCDF file¶
Assuming you have obtained the sample netCDF data file and saved it to an accessible location (e.g. ./sst_s_m_1989_ori-preprocessed.nc').
To read in the netCDF data, we use the cdms2 module.
The “Community Data Management System” (cdms) module is specialized for the organization of multi-dimensional, gridded data. It is mostly used for the file I/O, and the creation of a data type for storing the netCDF data.
The following lines open the netCDT file using cdms2.open():
FILE = './sst_s_m_1989_ori-preprocessed.nc'
import cdms2 as cdms
fin = cdms.open(FILE, 'r')
print(fin)
<CDMS file '/home/guangzhi/Notebooks/numbersmithy/cdat_tutorial_notebooks/sst_s_m_1989_ori-preprocessed.nc', mode 'r' at 7f7db3d918a0, status: open>
We can now perform some queries on the opened file fin.
1.1 List variables in the data file¶
print(fin.listvariables())
['bounds_latitude', 'bounds_longitude', 'sst']
The 'bounds_latitude', and 'bounds_longitude' variables are auto-created by cdms when saving the data to a file. You can ignore them for now.
'sst' is the actual variable we are interested in.
Other than this, you can also use the ncdump utility with the -h option to query the variables stored in the file: ncdump -h the_netcdf_file.nc.
1.2 Get dimension axes¶
axes = fin.axes
print(axes)
OrderedDict([('bound', id: bound
units:
Length: 2
First: 0.0
Last: 1.0
Python id: 0x7f7e104ba820
), ('latitude', id: latitude
Designated a latitude axis.
units: degrees_north
Length: 241
First: -90.0
Last: 90.0
Other axis attributes:
long_name: latitude
axis: Y
Python id: 0x7f7e104ba400
), ('level', id: level
Designated a level axis.
units:
Length: 1
First: 1
Last: 1
Other axis attributes:
long_name: level
axis: Z
Python id: 0x7f7e104ba220
), ('longitude', id: longitude
Designated a longitude axis.
units: degrees_east
Length: 480
First: 0.0
Last: 359.25
Other axis attributes:
modulo: [360.]
long_name: longitude
axis: X
topology: circular
Python id: 0x7f7e104ba340
), ('time', id: time
Designated a time axis.
units: days since 1900-1-1
Length: 12
First: 32507.0
Last: 32841.0
Other axis attributes:
calendar: gregorian
axis: T
Python id: 0x7f7e104ba6a0
)])
The axes attribute of fin is an OrderedDict object containing the axis of each dimension. An OrderedDict is just like an ornidnary Python dict, except that the keys are sorted in the order of insertion.
Again, ignore the bound dimension for now. We can see that the data are 4-dimensional:
12 times x 1 level x 241 latitudes x 480 longitudes.
Individual axis can be obtained from the axes object or from fin directly:
latax = axes['latitude']
# or
#latax = fin.getAxis('latitude')
print(latax)
print(type(latax))
id: latitude
Designated a latitude axis.
units: degrees_north
Length: 241
First: -90.0
Last: 90.0
Other axis attributes:
long_name: latitude
axis: Y
Python id: 0x7f7e104ba400
<class 'cdms2.axis.FileAxis'>
The latax variable is a special data type: cdms2.axis.FileAxis. Like the netCDF data, the axis type also consists of a data array (storing the dimension coordinates), and its own metadata (information like the dimension name, coordinate units etc.).
To view the data array of the axis:
latax[:]
array([-90. , -89.25, -88.5 , -87.75, -87. , -86.25, -85.5 , -84.75,
-84. , -83.25, -82.5 , -81.75, -81. , -80.25, -79.5 , -78.75,
-78. , -77.25, -76.5 , -75.75, -75. , -74.25, -73.5 , -72.75,
-72. , -71.25, -70.5 , -69.75, -69. , -68.25, -67.5 , -66.75,
-66. , -65.25, -64.5 , -63.75, -63. , -62.25, -61.5 , -60.75,
-60. , -59.25, -58.5 , -57.75, -57. , -56.25, -55.5 , -54.75,
-54. , -53.25, -52.5 , -51.75, -51. , -50.25, -49.5 , -48.75,
-48. , -47.25, -46.5 , -45.75, -45. , -44.25, -43.5 , -42.75,
-42. , -41.25, -40.5 , -39.75, -39. , -38.25, -37.5 , -36.75,
-36. , -35.25, -34.5 , -33.75, -33. , -32.25, -31.5 , -30.75,
-30. , -29.25, -28.5 , -27.75, -27. , -26.25, -25.5 , -24.75,
-24. , -23.25, -22.5 , -21.75, -21. , -20.25, -19.5 , -18.75,
-18. , -17.25, -16.5 , -15.75, -15. , -14.25, -13.5 , -12.75,
-12. , -11.25, -10.5 , -9.75, -9. , -8.25, -7.5 , -6.75,
-6. , -5.25, -4.5 , -3.75, -3. , -2.25, -1.5 , -0.75,
0. , 0.75, 1.5 , 2.25, 3. , 3.75, 4.5 , 5.25,
6. , 6.75, 7.5 , 8.25, 9. , 9.75, 10.5 , 11.25,
12. , 12.75, 13.5 , 14.25, 15. , 15.75, 16.5 , 17.25,
18. , 18.75, 19.5 , 20.25, 21. , 21.75, 22.5 , 23.25,
24. , 24.75, 25.5 , 26.25, 27. , 27.75, 28.5 , 29.25,
30. , 30.75, 31.5 , 32.25, 33. , 33.75, 34.5 , 35.25,
36. , 36.75, 37.5 , 38.25, 39. , 39.75, 40.5 , 41.25,
42. , 42.75, 43.5 , 44.25, 45. , 45.75, 46.5 , 47.25,
48. , 48.75, 49.5 , 50.25, 51. , 51.75, 52.5 , 53.25,
54. , 54.75, 55.5 , 56.25, 57. , 57.75, 58.5 , 59.25,
60. , 60.75, 61.5 , 62.25, 63. , 63.75, 64.5 , 65.25,
66. , 66.75, 67.5 , 68.25, 69. , 69.75, 70.5 , 71.25,
72. , 72.75, 73.5 , 74.25, 75. , 75.75, 76.5 , 77.25,
78. , 78.75, 79.5 , 80.25, 81. , 81.75, 82.5 , 83.25,
84. , 84.75, 85.5 , 86.25, 87. , 87.75, 88.5 , 89.25,
90. ], dtype=float32)
1.3 Read in the variable’s metadata¶
We can read in the metadata of a variable without loading the data array into memory.
This is achieved by using the square brackets notation: fin['variable_id'].
This can be useful when you only need to examine the metadata of a variable, for instance its shape, long_name, or unit, without loading a big chunck of data into RAM.
sst_meta = fin['sst']
print(sst_meta)
<cdms2.fvariable.FileVariable object at 0x7f7e104a7f10>
sst_meta.shape
(12, 1, 241, 480)
sst_meta.size()
1388160
sst_meta.id, sst_meta.name, sst_meta.long_name
('sst', 'sst', 'Sea surface temperature')
The .info() method is quite usefull. It prints out a summary of the variable’s metadata.
sst_meta.info()
*** Description of Slab sst ***
id: sst
shape: (12, 1, 241, 480)
filename: /home/guangzhi/Notebooks/numbersmithy/cdat_tutorial_notebooks/sst_s_m_1989_ori-preprocessed.nc
missing_value: [1.e+20]
comments:
grid_name: grid_241x480
grid_type: None
time_statistic:
long_name: Sea surface temperature
units: K
standard_name: Sea surface temperature
ndim: 4
Grid has Python id 0x7f7db556c580.
Gridtype: generic
Grid shape: (241, 480)
Order: yx
** Dimension 1 **
id: time
Designated a time axis.
units: days since 1900-1-1
Length: 12
First: 32507.0
Last: 32841.0
Other axis attributes:
calendar: gregorian
axis: T
Python id: 0x7f7e104ba6a0
** Dimension 2 **
id: level
Designated a level axis.
units:
Length: 1
First: 1
Last: 1
Other axis attributes:
long_name: level
axis: Z
Python id: 0x7f7e104ba220
** Dimension 3 **
id: latitude
Designated a latitude axis.
units: degrees_north
Length: 241
First: -90.0
Last: 90.0
Other axis attributes:
long_name: latitude
axis: Y
Python id: 0x7f7e104ba400
** Dimension 4 **
id: longitude
Designated a longitude axis.
units: degrees_east
Length: 480
First: 0.0
Last: 359.25
Other axis attributes:
modulo: [360.]
long_name: longitude
axis: X
topology: circular
Python id: 0x7f7e104ba340
*** End of description for sst ***
We can obtain the dimension axis from the sst_meta variable as well:
# NOTE that we are giving an integer index (2) as input, meaning the 3rd axis in the data.
sst_meta.getAxis(2)
id: latitude
Designated a latitude axis.
units: degrees_north
Length: 241
First: -90.0
Last: 90.0
Other axis attributes:
long_name: latitude
axis: Y
Python id: 0x7f7e104ba400
# Alternatively:
sst_meta.getLatitude()
id: latitude
Designated a latitude axis.
units: degrees_north
Length: 241
First: -90.0
Last: 90.0
Other axis attributes:
long_name: latitude
axis: Y
Python id: 0x7f7e104ba400
# Similarly, to get the time axis:
sst_meta.getTime()
id: time
Designated a time axis.
units: days since 1900-1-1
Length: 12
First: 32507.0
Last: 32841.0
Other axis attributes:
calendar: gregorian
axis: T
Python id: 0x7f7e104ba6a0
1.4 Read in the variable with data array¶
To read in the entire variable (data array + metadata), use parenthese instead of square brackets:
sst = fin('sst')
print(type(sst))
<class 'cdms2.tvariable.TransientVariable'>
Notice that sst_meta is of FileVariable type, and sst is a TransientVariable.
When manipulating netCDF data using CDAT, TransientVariable is our primary data type. It is derived from numpy.ma.MaskedArray, which is just like numpy.ndarray but with an extra layer of data mask to denote missing values. Therefore, the array manipulation API of TransientVariable is almost the same as native numpy.ndarray, with only some minor differences.
For instance, we can query the shape and size of the sst variable just like a numpy.ndarray:
sst.shape, sst.size
((12, 1, 241, 480), 1388160)
Slicing can be achieved using similar opertions as on ndarray:
slab1 = sst[0]
slab2 = sst[0:2, :, 0:50, 200:300]
print(slab1.shape, slab2.shape)
(1, 241, 480) (2, 1, 50, 100)
We will talk more about data slicing in a later post.
All dimension axes are accessible from sst just like before. Notice that the returned value is of type TransientAxis, no practical difference from FileAxis though.
type(sst.getLatitude())
cdms2.axis.TransientAxis
Lastly, it is good practice to close the file after reading in data:
fin.close()
You can, of cause, use the with context to achieve an automatic file closing.
sst.units
'K'
Suppose we would like to convert it to Celcius, and compute an annual average from the 12 monthly mean values. We can achieve this using the following 3 lines:
import MV2 as MV
sst_c = sst - 273.15
sst_c_annual = MV.sum(sst_c, axis=0)/12.
# some simple checks:
print('A sample box in K:', sst[0, 0, 100:102, 100:102])
print('The same sample box in Celcius:', sst_c[0, 0, 100:102, 100:102])
print('annual mean shape =', sst_c_annual.shape)
A sample box in K: [[301.3503723144531 301.35693359375] [301.5281066894531 301.5483703613281]] The same sample box in Celcius: [[28.20037841796875 28.206939697265625] [28.37811279296875 28.39837646484375]] annual mean shape = (1, 241, 480)
Notice that I’m using the MV2 module to compute the summation over the time dimension.
MV2 handles numerical computations involving TransientVariables much like numpy.ndarray. One big difference from ordinary ndarray computations is that the MV2 counterparts will try to be metadata preserving, as will be shown in a minute.
Also note that I’m using a MV.sum(sst_c, axis=0)/12. instead of a simple MV.mean(sst_c, axis=0) to compute the averages. This is because the MV2 module doesn’t have a mean() function. It might sound a bit weird that such a fundemental function is totally missing, but they have good reasons for making such a decision.
Since CDAT is designed for numerical computations involving meteorological or oceanorgraphical data, the averaging operation almost always needs to be a weighted average. Just like the case of deriving an annual average from monthly means we are dealing with here. Strictly speaking, different weights should be given to the 12 months, in proportion to the number of days in each month. For instance, Januray has a slightly larger weight than Feburary (31 vs 28). A simple mean(axis=0), or sum(axis=0)/12. does not take this into account.
Therefore, to discourage reckless usages of un-weighted averaging, MV2 removed the mean() function (it used to have one in an earlier version). Instead, one should use a dedicated function in another module (cdutil) of CDAT for weighted average computations. We will cover that in a later post.
The new sst_c_annual variable is derived from sst, and is itself a TransientVariable.
type(sst_c_annual)
cdms2.tvariable.TransientVariable
We can query some info on sst_c_annual using its info() method:
sst_c_annual.info()
*** Description of Slab variable_210 ***
id: variable_210
shape: (1, 241, 480)
filename:
missing_value: 1e+20
comments:
grid_name: <None>
grid_type: generic
time_statistic:
long_name: Sea surface temperature
units: K
tileIndex: None
standard_name: Sea surface temperature
Grid has Python id 0x7f7db4618a00.
Gridtype: generic
Grid shape: (241, 480)
Order: yx
** Dimension 1 **
id: level
Designated a level axis.
units:
Length: 1
First: 1
Last: 1
Other axis attributes:
axis: Z
long_name: level
Python id: 0x7f7db3d961c0
** Dimension 2 **
id: latitude
Designated a latitude axis.
units: degrees_north
Length: 241
First: -90.0
Last: 90.0
Other axis attributes:
axis: Y
long_name: latitude
realtopology: linear
Python id: 0x7f7db3d96130
** Dimension 3 **
id: longitude
Designated a longitude axis.
units: degrees_east
Length: 480
First: 0.0
Last: 359.25
Other axis attributes:
axis: X
modulo: 360.0
topology: circular
long_name: longitude
realtopology: circular
Python id: 0x7f7db3d961f0
*** End of description for variable_210 ***
Notice that the sst_c_annual variable has inherited metadata from its ancestor (sst_c), which in turn inherited from the original sst. This is what I meant by “metadata-preserving operations.
However, the id attribute of sst_c_annual is not quite meaningful: variable_190. This is a automatically generated variable id to distinguish this particular TransientVariable from others in this current Python session.
The units attribute is still K, because CDAT has no ways of knowing the relationship between Kelvin and Celcius. These are the metadata that we need to maintain ourselves.
As a rule of thumb, MV2 and TransientVariable operations will try to be metadata-preserving, by inheriting from the operand(s). When this could not be achieved, or some ambiguity is encountered (for instance, when adding the data from year 1989 to 1990, it doesn’t know which year’s time axis should be given to the result), the relevant metadata will either be wrong, or lost.
2.2 Create a new TransientVariable from scratch¶
A new TransientVariable can be created from scratch by calling the cdms2.createVariable() function.
We need to provide it with the following input arguments:
- (mandatory) a
np.ndarraycompabile array. This will be used as the data array of theTransientVariable. A Python list will do, or simply a single scalar, likecdms2.createVariable(10). - (optional)
axes. This is a list/tuple ofTransientAxisobjects. These are the dimension metadata. The axis size should agree with the dimension size in the corresponding rank, e.g. a length-12 axis for the 1st dimension if your data have a shape of(12, 1, 241, 480). - (optional)
attributes. This is a Pythondictionarycontaining some text-based metadata. For instance,{'id': 'sst', 'name': 'sst', 'long_name': 'sea surface temperature', 'units': 'K'}. - (optional)
typecode. Data format. Could beffor floating point;doublefor double precision floats;bforbyte;ifor integer.
Let’s see an example.
Suppose we need to create a hypothetical surface pressure distribution within the box of 100 – 180 E, 20 – 60 N. The data are composed by adding a Gaussian-shaped low pressure center on top of the 1000 hPa standard sea level pressure.
Let’s start by creating the latitude and longitude axes:
lats = np.arange(20, 61)
lons = np.arange(100, 181)
latax = cdms.createAxis(lats)
latax.designateLatitude()
latax.id = 'y'
latax.name = 'latitude'
latax.units = 'degree north'
lonax = cdms.createAxis(lons)
lonax.designateLongitude()
lonax.id = 'x'
lonax.name = 'longitude'
lonax.units = 'degree east'
Note that we are using cdms.createAxis() function to create a dimension axis.
Next, create a hypothetical pressure purturbation:
lat_c = lats.mean()
lon_c = lons.mean()
XX, YY = np.meshgrid(lons, lats)
r = 8 # purturbation radius, in degree latitude/longitude.
P0 = 1000 # hPa
low = -10 * np.exp(-(XX-lon_c)**2/2./r**2 - (YY-lat_c)**2/2./r**2)
pressure = P0 + low
print(type(pressure), pressure.shape)
<class 'numpy.ndarray'> (41, 81)
Then create a new TransientVariable from pressure:
att_dict={'id': 'slp',
'name': 'sea level pressure',
'long_name': 'sea level pressure',
'units': 'hPa'}
pressure = cdms.createVariable(pressure ,axes=[latax, lonax],\
attributes=att_dict,typecode='f')
type(pressure)
cdms2.tvariable.TransientVariable
pressure.info()
*** Description of Slab slp ***
id: slp
shape: (41, 81)
filename:
missing_value: 1.0000000200408773e+20
comments:
grid_name: <None>
grid_type: generic
time_statistic:
long_name: sea level pressure
units: hPa
tileIndex: None
Grid has Python id 0x7f7db3d960d0.
Gridtype: generic
Grid shape: (41, 81)
Order: yx
** Dimension 1 **
id: y
Designated a latitude axis.
units: degree north
Length: 41
First: 20
Last: 60
Other axis attributes:
axis: Y
Python id: 0x7f7e1288f400
** Dimension 2 **
id: x
Designated a longitude axis.
units: degree east
Length: 81
First: 100
Last: 180
Other axis attributes:
axis: X
modulo: 360.0
topology: circular
Python id: 0x7f7e1288f760
*** End of description for slp ***
Here is a screenshot of the pressure data as viewed from ncview:
3. Save data to a netCDF file¶
Now save the newly crafted pressure data to a netCDF file on disk:
abpath_out = 'test_nc.nc'
print('Saving output to:\n',abpath_out)
fout = cdms.open(abpath_out, 'w')
fout.write(pressure, typecode='f')
fout.close()
Saving output to: test_nc.nc




