cf.Field.percentile

Field.percentile(ranks, axes=None, method='linear', squeeze=False, mtol=1, interpolation=None)[source]

Compute percentiles of the data along the specified axes.

The default is to compute the percentiles along a flattened version of the data.

If the input data are integers, or floats smaller than float64, or the input data contains missing values, then output data type is float64. Otherwise, the output data type is the same as that of the input.

If multiple percentile ranks are given then a new, leading data dimension is created so that percentiles can be stored for each percentile rank.

The output field construct has a new dimension coordinate construct that records the percentile ranks represented by its data.

Accuracy

The percentile method returns results that are consistent with numpy.percentile, which may be different to those created by dask.percentile. The dask method uses an algorithm that calculates approximate percentiles which are likely to be different from the correct values when there are two or more dask chunks.

>>> import numpy as np
>>> import dask.array as da
>>> import cf
>>> a = np.arange(101)
>>> dx = da.from_array(a, chunks=10)
>>> da.percentile(dx, 40).compute()
array([40.36])
>>> np.percentile(a, 40)
40.0
>>> d = cf.Data(a, chunks=10)
>>> d.percentile(40).array
array([40.])

New in version 3.0.4.

See also

bin, collapse, digitize, where

Parameters
ranks: (sequence of) number

Percentile ranks, or sequence of percentile ranks, to compute, which must be between 0 and 100 inclusive.

axes: (sequence of) str or int, optional

Select the domain axes over which to calculate the percentiles, defined by the domain axes that would be selected by passing each given axis description to a call of the field construct’s domain_axis method. For example, for a value of 'X', the domain axis construct returned by f.domain_axis('X') is selected.

By default, or if axes is None, all axes are selected.

method: str, optional

Specify the interpolation method to use when the percentile lies between two data values. The methods are listed here, but their definitions must be referenced from the documentation for numpy.percentile.

For the default 'linear' method, if the percentile lies between two adjacent data values i < j then the percentile is calculated as i+(j-i)*fraction, where fraction is the fractional part of the index surrounded by i and j.

'inverted_cdf' 'averaged_inverted_cdf' 'closest_observation' 'interpolated_inverted_cdf' 'hazen' 'weibull' 'linear' (default) 'median_unbiased' 'normal_unbiased' 'lower' 'higher' 'nearest' 'midpoint' ===============================

New in version 3.14.0.

squeeze: bool, optional

If True then all size 1 axes are removed from the returned percentiles data. By default axes over which percentiles have been calculated are left in the result as axes with size 1, meaning that the result is guaranteed to broadcast correctly against the original data.

mtol: number, optional

Set the fraction of input data elements which is allowed to contain missing data when contributing to an individual output data element. Where this fraction exceeds mtol, missing data is returned. The default is 1, meaning that a missing datum in the output array occurs when its contributing input array elements are all missing data. A value of 0 means that a missing datum in the output array occurs whenever any of its contributing input array elements are missing data. Any intermediate value is permitted.

Parameter example:

To ensure that an output array element is a missing datum if more than 25% of its input array elements are missing data: mtol=0.25.

interpolation: deprecated at version 3.14.0

Use the method parameter instead.

Returns
Field

The percentiles of the original data.

Examples

>>> f = cf.example_field(0)
>>> print(f)
Field: specific_humidity
------------------------
Data            : specific_humidity(latitude(5), longitude(8)) 1
Cell methods    : area: mean
Dimension coords: time(1) = [2019-01-01 00:00:00]
                : latitude(5) = [-75.0, ..., 75.0] degrees_north
                : longitude(8) = [22.5, ..., 337.5] degrees_east
>>> print(f.array)
[[0.007 0.034 0.003 0.014 0.018 0.037 0.024 0.029]
 [0.023 0.036 0.045 0.062 0.046 0.073 0.006 0.066]
 [0.11  0.131 0.124 0.146 0.087 0.103 0.057 0.011]
 [0.029 0.059 0.039 0.07  0.058 0.072 0.009 0.017]
 [0.006 0.036 0.019 0.035 0.018 0.037 0.034 0.013]]
>>> p = f.percentile([20, 40, 50, 60, 80])
>>> print(p)
Field: specific_humidity
------------------------
Data            : specific_humidity(long_name=Percentile ranks for latitude, longitude dimensions(5), latitude(1), longitude(1)) 1
Dimension coords: time(1) = [2019-01-01 00:00:00]
                : latitude(1) = [0.0] degrees_north
                : longitude(1) = [180.0] degrees_east
                : long_name=Percentile ranks for latitude, longitude dimensions(5) = [20, ..., 80]
>>> print(p.array)
[[[0.0164]]
 [[0.032 ]]
 [[0.036 ]]
 [[0.0414]]
 [[0.0704]]]

Find the standard deviation of the values above the 80th percentile:

>>> p80 = f.percentile(80)
>>> print(p80)
Field: specific_humidity
------------------------
Data            : specific_humidity(latitude(1), longitude(1)) 1
Dimension coords: time(1) = [2019-01-01 00:00:00]
                : latitude(1) = [0.0] degrees_north
                : longitude(1) = [180.0] degrees_east
                : long_name=Percentile ranks for latitude, longitude dimensions(1) = [80]
>>> g = f.where(f<=p80, cf.masked)
>>> print(g.array)
[[  --    --    --    --    --    -- -- --]
 [  --    --    --    --    -- 0.073 -- --]
 [0.11 0.131 0.124 0.146 0.087 0.103 -- --]
 [  --    --    --    --    -- 0.072 -- --]
 [  --    --    --    --    --    -- -- --]]
>>> g.collapse('standard_deviation', weights=True).data
<CF Data(1, 1): [[0.024609938742357642]] 1>

Find the mean of the values above the 45th percentile along the X axis:

>>> p45 = f.percentile(45, axes='X')
>>> print(p45.array)
[[0.0189 ]
 [0.04515]
 [0.10405]
 [0.04185]
 [0.02125]]
>>> g = f.where(f<=p45, cf.masked)
>>> print(g.array)
[[  -- 0.034    --    --    -- 0.037 0.024 0.029]
 [  --    --    -- 0.062 0.046 0.073    -- 0.066]
 [0.11 0.131 0.124 0.146    --    --    --    --]
 [  -- 0.059    -- 0.07  0.058 0.072    --    --]
 [  -- 0.036    -- 0.035   --  0.037 0.034    --]]
>>> print(g.collapse('X: mean', weights=True).array)
[[0.031  ]
 [0.06175]
 [0.12775]
 [0.06475]
 [0.0355 ]]

Find the histogram bin boundaries associated with given percentiles, and digitize the data based on these bins:

>>> bins = f.percentile([0, 10, 50, 90, 100], squeeze=True)
>>> print(bins.array)
[0.003  0.0088 0.036  0.1037 0.146 ]
>>> i = f.digitize(bins, closed_ends=True)
>>> print(i.array)
[[0 1 0 1 1 2 1 1]
 [1 2 2 2 2 2 0 2]
 [3 3 3 3 2 2 2 1]
 [1 2 2 2 2 2 1 1]
 [0 2 1 1 1 2 1 1]]