cfdm.Data

class cfdm.Data(array=None, units=None, calendar=None, fill_value=None, hardmask=True, chunks='auto', dt=False, source=None, copy=True, dtype=None, mask=None, mask_value=None, to_memory=False, init_options=None, _use_array=True)[source]

Bases: cfdm.mixin.container.Container, cfdm.mixin.netcdf.NetCDFAggregation, cfdm.mixin.netcdf.NetCDFHDF5, cfdm.mixin.files.Files, cfdm.core.data.data.Data

An N-dimensional data array with units and masked values.

  • Contains an N-dimensional, indexable and broadcastable array with many similarities to a numpy array.

  • Contains the units of the array elements.

  • Supports masked arrays, regardless of whether or not it was initialised with a masked array.

  • Stores and operates on data arrays which are larger than the available memory.

Indexing

A data array is indexable in a similar way to a numpy array:

>>> d.shape
(12, 19, 73, 96)
>>> d[...].shape
(12, 19, 73, 96)
>>> d[slice(0, 9), 10:0:-2, :, :].shape
(9, 5, 73, 96)

There are three extensions to the numpy indexing functionality:

  • Size 1 dimensions are never removed by indexing.

    An integer index i takes the i-th element but does not reduce the rank of the output array by one:

    >>> d.shape
    (12, 19, 73, 96)
    >>> d[0, ...].shape
    (1, 19, 73, 96)
    >>> d[:, 3, slice(10, 0, -2), 95].shape
    (12, 1, 5, 1)
    

    Size 1 dimensions may be removed with the squeeze method.

  • The indices for each axis work independently.

    When more than one dimension’s slice is a 1-d Boolean sequence or 1-d sequence of integers, then these indices work independently along each dimension (similar to the way vector subscripts work in Fortran), rather than by their elements:

    >>> d.shape
    (12, 19, 73, 96)
    >>> d[0, :, [0, 1], [0, 13, 27]].shape
    (1, 19, 2, 3)
    
  • Boolean indices may be any object which exposes the numpy array interface.

    >>> d.shape
    (12, 19, 73, 96)
    >>> d[..., d[0, 0, 0] > d[0, 0, 0].min()]
    

Initialisation

Parameters
array: optional

The array of values. May be a scalar or array-like object, including another Data instance, anything with a to_dask_array method, numpy array, dask array, xarray array, cfdm.Array subclass, list, tuple, scalar.

Parameter example:

array=34.6

Parameter example:

array=[[1, 2], [3, 4]]

Parameter example:

array=numpy.ma.arange(10).reshape(2, 1, 5)

units: str or Units, optional

The physical units of the data. if a Units object is provided then this an also set the calendar.

The units (without the calendar) may also be set after initialisation with the set_units method.

Parameter example:

units='km hr-1'

Parameter example:

units='days since 2018-12-01'

calendar: str, optional

The calendar for reference time units.

The calendar may also be set after initialisation with the set_calendar method.

Parameter example:

calendar='360_day'

fill_value: optional

The fill value of the data. By default, or if set to None, the numpy fill value appropriate to the array’s data-type will be used (see numpy.ma.default_fill_value).

The fill value may also be set after initialisation with the set_fill_value method.

Parameter example:

fill_value=-999.

dtype: data-type, optional

The desired data-type for the data. By default the data-type will be inferred form the array parameter.

The data-type may also be set after initialisation with the dtype attribute.

Parameter example:

dtype=float

Parameter example:

dtype='float32'

Parameter example:

dtype=numpy.dtype('i2')

New in version 3.0.4.

mask: optional

Apply this mask to the data given by the array parameter. By default, or if mask is None, no mask is applied. May be any scalar or array-like object (such as a list, numpy array or Data instance) that is broadcastable to the shape of array. Masking will be carried out where the mask elements evaluate to True.

This mask will applied in addition to any mask already defined by the array parameter.

mask_value: scalar array_like

Mask array where it is equal to mask_value, using numerically tolerant floating point equality.

New in version (cfdm): 1.11.0.0

hardmask: bool, optional

If True (the default) then the mask is hard. If False then the mask is soft.

dt: bool, optional

If True then strings (such as '1990-12-01 12:00') given by the array parameter are re-interpreted as date-time objects. By default they are not.

source: optional

Convert source, which can be any type of object, to a Data instance.

All other parameters, apart from copy, are ignored and their values are instead inferred from source by assuming that it has the Data API. Any parameters that can not be retrieved from source in this way are assumed to have their default value.

Note that if x is also a Data instance then cfdm.Data(source=x) is equivalent to x.copy().

copy: bool, optional

If True (the default) then deep copy the input parameters prior to initialisation. By default the parameters are not deep copied.

chunks: int, tuple, dict or str, optional

Specify the chunking of the underlying dask array.

Any value accepted by the chunks parameter of the dask.array.from_array function is allowed.

By default, "auto" is used to specify the array chunking, which uses a chunk size in bytes defined by the cfdm.chunksize function, preferring square-like chunk shapes.

Parameter example:

A blocksize like 1000.

Parameter example:

A blockshape like (1000, 1000).

Parameter example:

Explicit sizes of all blocks along all dimensions like ((1000, 1000, 500), (400, 400)).

Parameter example:

A size in bytes, like "100MiB" which will choose a uniform block-like shape, preferring square-like chunk shapes.

Parameter example:

A blocksize of -1 or None in a tuple or dictionary indicates the size of the corresponding dimension.

Parameter example:

Blocksizes of some or all dimensions mapped to dimension positions, like {1: 200}, or {0: -1, 1: (400, 400)}.

New in version (cfdm): 1.11.2.0

to_memory: bool, optional

If True then ensure that the original data are in memory, rather than on disk.

If the original data are on disk, then reading data into memory during initialisation will slow down the initialisation process, but can considerably improve downstream performance by avoiding the need for independent reads for every dask chunk, each time the data are computed.

In general, setting to_memory to True is not the same as calling the persist of the newly created Data object, which also decompresses data compressed by convention and computes any data type, mask and date-time modifications.

If the input array is a dask.array.Array object then to_memory is ignored.

New in version (cfdm): 1.11.2.0

init_options: dict, optional

Provide optional keyword arguments to methods and functions called during the initialisation process. A dictionary key identifies a method or function. The corresponding value is another dictionary whose key/value pairs are the keyword parameter names and values to be applied.

Supported keys are:

  • 'from_array': Provide keyword arguments to the dask.array.from_array function. This is used when initialising data that is not already a dask array and is not compressed by convention.

  • 'first_non_missing_value': Provide keyword arguments to the cfdm.data.utils.first_non_missing_value function. This is used when the input array contains date-time strings or objects, and may affect performance.

Parameter example:

{'from_array': {'inline_array': True}}

Examples

>>> d = cfdm.Data(5)
>>> d = cfdm.Data([1,2,3], units='K')
>>> import numpy
>>> d = cfdm.Data(numpy.arange(10).reshape(2,5),
...             units='m/s', fill_value=-999)
>>> d = cfdm.Data('fly')
>>> d = cfdm.Data(tuple('fly'))

Inspection

Attributes

array

A numpy array copy of the data.

sparse_array

Return an independent scipy sparse array of the data.

dtype

The numpy data-type of the data.

ndim

Number of dimensions in the data array.

shape

Tuple of the data array’s dimension sizes.

size

Number of elements in the data array.

nbytes

Total number of bytes consumed by the elements of the array.

Units

del_units

Delete the units.

get_units

Return the units.

has_units

Whether units have been set.

set_units

Set the units.

Attributes

Units

The Units object containing the units of the data array.

Date-time support

del_calendar

Delete the calendar.

get_calendar

Return the calendar.

has_calendar

Whether a calendar has been set.

set_calendar

Set the calendar.

Attributes

datetime_array

An independent numpy array of date-time objects.

dtarray

Alias for datetime_array

datetime_as_string

Returns an independent numpy array with datetimes as strings.

Dask

compute

A view of the computed data.

persist

Persist data into memory.

cull_graph

Remove unnecessary tasks from the dask graph in-place.

dask_compressed_array

Returns a dask array of the compressed data.

rechunk

Change the chunk structure of the data.

chunk_indices

Return indices of the data that define each dask chunk.

todict

Return a dictionary of the dask graph key/value pairs.

to_dask_array

Convert the data to a dask array.

get_deterministic_name

Get the deterministic name for the data.

has_deterministic_name

Whether there is a deterministic name for the data.

Attributes

chunks

The dask chunk sizes for each dimension.

chunksize

The largest dask chunk size for each dimension.

chunk_positions

Find the position of each chunk.

npartitions

The total number of chunks.

numblocks

The number of chunks along each dimension.

Data creation routines

Ones and zeros

empty

Return a new array, without initialising entries.

ones

Returns a new array filled with ones of set shape and type.

zeros

Returns a new array filled with zeros of set shape and type.

full

Return new data filled with a fill value.

From existing data

asdata

Convert the input to a Data object.

copy

Return a deep copy of the data.

Data manipulation routines

Changing data shape

flatten

Flatten specified axes of the data.

reshape

Change the shape of the data without changing its values.

Transpose-like operations

transpose

Permute the axes of the data array.

Changing number of dimensions

insert_dimension

Expand the shape of the data array in place.

squeeze

Remove size 1 axes from the data array.

Joining data

concatenate

Join a sequence of data arrays together.

Adding and removing elements

unique

The unique elements of the data.

Expanding the data

pad_missing

Pad an axis with missing data.

Indexing routines

Single value selection

first_element

Return the first element of the data as a scalar.

second_element

Return the second element of the data as a scalar.

last_element

Return the last element of the data as a scalar.

Logic functions

Truth value testing

all

Test whether all data array elements evaluate to True.

any

Test whether any data array elements evaluate to True.

Comparison

equals

True if two data arrays are logically equal, False otherwise.

Mask support

harden_mask

Force the mask to hard.

soften_mask

Force the mask to soft.

apply_masking

Apply masking.

masked_where

Mask the data where a condition is met.

filled

Replace masked elements with a fill value.

masked_values

Mask using floating point equality.

del_fill_value

Delete the fill value.

get_fill_value

Return the missing data value.

has_fill_value

Whether a fill value has been set.

set_fill_value

Set the missing data value.

Attributes

hardmask

Hardness of the mask.

mask

The Boolean missing data mask of the data array.

fill_value

The data array missing data value.

Mathematical functions

Sums, products, differences

sum

Calculate sum values.

Set routines

Making proper sets

unique

The unique elements of the data.

Sorting, searching, and counting

Statistics

Order statistics

cfdm.Data.maximum

cfdm.Data.minimum

max

Calculate maximum values.

min

Calculate minimum values.

Sums

sum

Calculate sum values.

Compression by convention

get_compressed_axes

Returns the dimensions that are compressed in the array.

get_compressed_dimension

Returns the compressed dimension’s array position.

get_compression_type

Returns the type of compression applied to the array.

get_count

Return the count variable for a compressed array.

get_index

Return the index variable for a compressed array.

get_list

Return the list variable for a compressed array.

get_dependent_tie_points

Return the list variable for a compressed array.

get_interpolation_parameters

Return the list variable for a compressed array.

get_tie_point_indices

Return the list variable for a compressed array.

uncompress

Uncompress the data.

Attributes

compressed_array

Returns an independent numpy array of the compressed data.

Miscellaneous

creation_commands

Return the commands that would create the data object.

get_data

Returns the data.

get_filenames

The names of files containing parts of the data array.

get_original_filenames

The names of files containing the original data and metadata.

source

Return the underlying array object.

chunk_indices

Return indices of the data that define each dask chunk.

Attributes

data

The data as an object identity.

tolist

Return the data as a scalar or (nested) list.

Performance

nc_clear_hdf5_chunksizes

Clear the HDF5 chunking strategy for the data.

nc_hdf5_chunksizes

Get the HDF5 chunking strategy for the data.

nc_set_hdf5_chunksizes

Set the HDF5 chunking strategy for the data.

cfdm.Data.to_memory

Aggregation

file_directories

The directories of files containing parts of the data.

replace_directory

Replace file directories in-place.

replace_filenames

Replace file locations in-place.

nc_del_aggregated_data

Remove the netCDF aggregated_data terms.

nc_del_aggregation_write_status

Set the netCDF aggregation write status to False.

nc_get_aggregated_data

Return the netCDF aggregated data terms.

nc_get_aggregation_fragment_type

The type of fragments in the aggregated data.

nc_get_aggregation_write_status

Get the netCDF aggregation write status.

nc_has_aggregated_data

Whether any netCDF aggregated_data terms have been set.

nc_set_aggregated_data

Set the netCDF aggregated_data elements.

nc_set_aggregation_write_status

Set the netCDF aggregation write status.

Special

__array__

The numpy array interface.

__deepcopy__

Called by the copy.deepcopy function.

__getitem__

Return a subspace of the data defined by indices.

__int__

Called to implement the built-in function int

__iter__

Called when an iterator is required.

__repr__

Called by the repr built-in function.

__setitem__

Implement indexed assignment.

__str__

Called by the str built-in function.

Docstring substitutions

Methods

_docstring_special_substitutions

Return the special docstring substitutions.

_docstring_substitutions

Returns the substitutions that apply to methods of the class.

_docstring_package_depth

Returns the class {{package}} substitutions package depth.

_docstring_method_exclusions

Returns method names excluded in the class substitutions.