cftools#

To properly exploit object stores there some key steps to implement the performance mitigations outlined in Object Store Basics:

We need to ensure that there is only one variable per file.
The variable is sensible chunked, and the chunk index is at the front of the file, and
The file is uploaded with object store metadata.

The cftools packages provide classes that can be incorporated in user workflows to achieve these outcomes.

Context#

Classes#

class cfs3.CFSplitter(filename_handler=None, meta_handler=None, output_folder='')[source]#

Bases: object

Provides a class factory for splitting multivariable files into multiple single variable files, one at a time via the ‘’split_one’’ method. The class contructor sets up the method of handling the metadata and output file names, and where the split files should go.

Methods:

split_one(filename[, with_json, ...])

Split one file into constituent fields and create a file per field and (if required) an accompanying json file of b-metadata to be used for metadata upload.

split_one(filename, with_json=True, uncompressed_chunk_volume_MB=4)[source]#: Split one file into constituent fields and create a file per field and (if required) an accompanying json file of b-metadata to be used for metadata upload.

class cfs3.CFuploader(alias, bucket, *args, **kwargs)[source]#

Bases: CFSplitter

This a class factory for splitting and uploading multi-variable files. Details of the splitting capability are outlined in the ‘’CFSplitter’’ documentation, the addition here is the use of the Uploader class to carry out uploading to an S3 store, and in doing so, ensure that the relevant metadata is assoiciated with the object (as this cannot be changed after object creation).

Methods:

`simple_upload`(filename[, parallel_upload, ...])	Uploads the CF fields from a singe file as independent files in the object store.
`split_one`(filename[, with_json, ...])	Split one file into constituent fields and create a file per field and (if required) an accompanying json file of b-metadata to be used for metadata upload.

simple_upload(filename, parallel_upload=False, uncompressed_chunk_volume_MB=4)[source]#: Uploads the CF fields from a singe file as independent files in the object store.

split_one(filename, with_json=True, uncompressed_chunk_volume_MB=4)#: Split one file into constituent fields and create a file per field and (if required) an accompanying json file of b-metadata to be used for metadata upload.

class cfs3.MetaFix(external_metadata)[source]#

Bases: object

The MetaFix class is used to provide a factory for fixing the metadata in files, by utilising an external metadata dictionary.

External metadata is the metadata we want to add or fix from the original field and return for use as metadata outside the file. We define it using a dictionary. The expectation is that values for the metadata will be returned by the apply method as a dictionary, and that where the external metadata is different from the field metadata, we fix the field metadata. The only exception to that is where the external_metadata definition at instantiation has a value of None, in which case the expectation is that the value will be _obtained_ from the field (not corrected).

For example:

external_metadata = {'project':'cmip6','experiment':'dummy2','standard_name':None}

We would expect that if the field metadata did not have project or experiment, we would add it, if it did have either and it was different, we would overwrite it, and we are hopefully extracting the standard name from the field and returning it to ouput metadata.

Typical usage is within the CFSplitter, e.g:

external_metadata = {'project':'pytest','experiment':'dummy2','standard_name':None}
output_dir = tmp_path
metafix = MetaFix(external_metadata)
cfs = CFSplitter(meta_handler=metafix, output_folder=output_dir)

Inside CFSplitter it is used like this to fix the metadata of a CF field, and return the completed metdata:

metadata, field = self.meta_handler(filename, field)

class cfs3.FileNameFix(drs, filename_map=None, splitter=None)[source]#

Bases: object

Used to create suitable filenames based on a DRS and file contents.

Methods:

`__call__`(filename, field[, metadata])	Calculate an appropriate filename for an output file.
`__init__`(drs[, filename_map, splitter])	Instantiate with a DRS list, and if splitting (see the call method documentation), a filename_map to be used to map the parts of the filename onto terms.

__call__(filename, field, metadata=None)[source]#

Calculate an appropriate filename for an output file.

The algorithm used

parses the provided DRS for terms which start with !, these are calculated from the field (details of the options there are discussed below).
extracts any DRS values which are keys in the provided metadata, and
if self.filename_map is not None, looks for the other DRS values from the filename (the method is discussed below).

For step 1: we understand

!ncname, which will extract the netcdf variable name associated with the field.
!freq, which will attempt to use the cell method and cell bounds to establish a frequency.

For step 3: if self.filename_map is not None, the provided filename is split using the splitter function, and DRS terms are extracted from the resulting dictionary.

__init__(drs, filename_map=None, splitter=None)[source]#: Instantiate with a DRS list, and if splitting (see the call method documentation), a filename_map to be used to map the parts of the filename onto terms. To split using a more complicated method than just split(‘_’), pass a function for that!