cf.Data.percentile¶

Data.percentile(ranks, axes=None, method='linear', squeeze=False, mtol=1, inplace=False, interpolation=None, interpolation2=None)[source]¶

Compute percentiles of the data along the specified axes.

The default is to compute the percentiles along a flattened version of the data.

If the input data are integers, or floats smaller than float64, or the input data contains missing values, then output data-type is float64. Otherwise, the output data-type is the same as that of the input.

If multiple percentile ranks are given then a new, leading data dimension is created so that percentiles can be stored for each percentile rank.

Accuracy

The percentile method returns results that are consistent with numpy.percentile, which may be different to those created by dask.percentile. The dask method uses an algorithm that calculates approximate percentiles which are likely to be different from the correct values when there are two or more dask chunks.

>>> import numpy as np
>>> import dask.array as da
>>> import cf
>>> a = np.arange(101)
>>> dx = da.from_array(a, chunks=10)
>>> da.percentile(dx, 40).compute()
array([40.36])
>>> np.percentile(a, 40)
40.0
>>> d = cf.Data(a, chunks=10)
>>> d.percentile(40).array
array([40.])

New in version 3.0.4.

Parameters

ranks: (sequence of) number

Percentile rank, or sequence of percentile ranks, to compute, which must be between 0 and 100 inclusive.

axes: (sequence of) int, optional

Select the axes. The axes argument may be one, or a sequence, of integers that select the axis corresponding to the given position in the list of axes of the data array.

By default, of axes is None, all axes are selected.

method: str, optional

Specify the interpolation method to use when the percentile lies between two data values. The methods are listed here, but their definitions must be referenced from the documentation for numpy.percentile.

For the default 'linear' method, if the percentile lies between two adjacent data values i < j then the percentile is calculated as i+(j-i)*fraction, where fraction is the fractional part of the index surrounded by i and j.

'inverted_cdf' 'averaged_inverted_cdf' 'closest_observation' 'interpolated_inverted_cdf' 'hazen' 'weibull' 'linear' (default) 'median_unbiased' 'normal_unbiased' 'lower' 'higher' 'nearest' 'midpoint' ===============================

New in version 3.14.0.

squeeze: bool, optional

If True then all axes over which percentiles are calculated are removed from the returned data. By default axes over which percentiles have been calculated are left in the result as axes with size 1, meaning that the result is guaranteed to broadcast correctly against the original data.

mtol: number, optional

The sample size threshold below which collapsed values are set to missing data. It is defined as a fraction (between 0 and 1 inclusive) of the contributing input data values.

The default of mtol is 1, meaning that a missing datum in the output array occurs whenever all of its contributing input array elements are missing data.

For other values, a missing datum in the output array occurs whenever more than 100*mtol% of its contributing input array elements are missing data.

Note that for non-zero values of mtol, different collapsed elements may have different sample sizes, depending on the distribution of missing data in the input data.

split_every: int or dict, optional

Determines the depth of the recursive aggregation. If set to or more than the number of input chunks, the aggregation will be performed in two steps, one partial collapse per input chunk and a single aggregation at the end. If set to less than that, an intermediate aggregation step will be used, so that any of the intermediate or final aggregation steps operates on no more than split_every inputs. The depth of the aggregation graph will be \(log_{split_every}(input chunks along reduced axes)\). Setting to a low value can reduce cache size and network transfers, at the cost of more CPU and a larger dask graph.

By default, dask heuristically decides on a good value. A default can also be set globally with the split_every key in dask.config. See dask.array.reduction for details.

New in version 3.14.0.

inplace: bool, optional

If True then do the operation in-place and return None.

interpolation: deprecated at version 3.14.0

Use the method parameter instead.

Returns

Data or None: The percentiles of the original data, or None if the operation was in-place.

Examples

>>> d = cf.Data(numpy.arange(12).reshape(3, 4), 'm')
>>> print(d.array)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
>>> p = d.percentile([20, 40, 50, 60, 80])
>>> p
<CF Data(5, 1, 1): [[[2.2, ..., 8.8]]] m>

>>> p = d.percentile([20, 40, 50, 60, 80], squeeze=True)
>>> print(p.array)
[2.2 4.4 5.5 6.6 8.8]

Find the standard deviation of the values above the 80th percentile:

>>> p80 = d.percentile(80)
<CF Data(1, 1): [[8.8]] m>
>>> e = d.where(d<=p80, cf.masked)
>>> print(e.array)
[[-- -- -- --]
 [-- -- -- --]
 [-- 9 10 11]]
>>> e.std()
<CF Data(1, 1): [[0.816496580927726]] m>

Find the mean of the values above the 45th percentile along the second axis:

>>> p45 = d.percentile(45, axes=1)
>>> print(p45.array)
[[1.35],
 [5.35],
 [9.35]]
>>> e = d.where(d<=p45, cf.masked)
>>> print(e.array)
[[-- -- 2 3]
 [-- -- 6 7]
 [-- -- 10 11]]
>>> f = e.mean(axes=1)
>>> f
<CF Data(3, 1): [[2.5, ..., 10.5]] m>
>>> print(f.array)
[[ 2.5]
 [ 6.5]
 [10.5]]

Find the histogram bin boundaries associated with given percentiles, and digitize the data based on these bins:

>>> bins = d.percentile([0, 10, 50, 90, 100], squeeze=True)
>>> print(bins.array)
[ 0.   1.1  5.5  9.9 11. ]
>>> e = d.digitize(bins, closed_ends=True)
>>> print(e.array)
[[0 0 1 1]
 [1 1 2 2]
 [2 2 3 3]]

cf 3.16.2

Related Topics

cf.Data.percentile¶