Skip to content

missing_value shouldn't be handled the same as _FillValue (?) #11089

@acoque

Description

@acoque

What is your issue?

What is the issue?

According to this: "This attribute [missing_value] is not treated in any special way by the library or conforming generic applications, but is often useful documentation and may be used by specific applications."
However, it is currently handled by xarray very similarly to _FillValue (see code) when reading a netCDF file.

I find it to be an issue when dealing with datasets containing both:

  1. float variables with _FillValue/scale_factor/add_offset in encoding and
  2. non-float variables with missing_value in attrs (and/or in encoding).

Indeed, in that case, loading this dataset from disk would result:

  1. in casting all variables to float if mask_and_scale=True,
  2. or in not scaling float variables at all if mask_and_scale=False.

I think it would be better to basically just ignore the missing_value attribute when loading netCDF files with xarray (instead of doing as if it was an alias of _FilleValue), so people have a way to properly indicate missing data for data types that do not support NaNs in CF-compliant datasets.
Another solution would be not to promote dtypes that are not compatible with NaNs to float when decoding a variable, but instead keeping the original dtype, set fill_value (in maybe_promote) to missing_value (if it exists) and write it in the attrs of the newly created DataArray so it can be tracked.

Note: This issue is loosely linked to #8359, which is about the other netCDF encoding attributes.

Example

import numpy as np
import xarray as xr

ds = xr.Dataset({
    'my_float_var': xr.DataArray(np.array([[1.01, 2.01], [np.nan, 4.01]])),
    'my_int_var': xr.DataArray(np.array([[1, 2], [-128, 4]]).astype(np.int8), attrs={'missing_value': -128}),
})
ds.my_float_var.encoding = {'dtype': np.uint16, '_FillValue': 65535, 'scale_factor': 100, 'add_offset': 0}
ds.to_netcdf('test_missing_value.nc')
print(ds)

ds1 = xr.load_dataset('test_missing_value.nc')
print(ds1)

ds2 = xr.load_dataset('test_missing_value.nc', mask_and_scale=False)
print(ds2)

Current datasets

Input

<xarray.Dataset> Size: 36B
Dimensions:       (dim_0: 2, dim_1: 2)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    my_float_var  (dim_0, dim_1) float64 32B 1.01 2.01 nan 4.01
    my_int_var    (dim_0, dim_1) int8 4B 1 2 -128 4

Output 1 (with mask_and_scale=True)

<xarray.Dataset> Size: 48B
Dimensions:       (dim_0: 2, dim_1: 2)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    my_float_var  (dim_0, dim_1) float64 32B 0.0 0.0 nan 0.0
    my_int_var    (dim_0, dim_1) float32 16B 1.0 2.0 nan 4.0

Output 2 (with mask_and_scale=False)

<xarray.Dataset> Size: 12B
Dimensions:       (dim_0: 2, dim_1: 2)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    my_float_var  (dim_0, dim_1) uint16 8B 0 0 65535 0
    my_int_var    (dim_0, dim_1) int8 4B 1 2 -128 4

Expected output 1

<xarray.Dataset> Size: 36B
Dimensions:       (dim_0: 2, dim_1: 2)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    my_float_var  (dim_0, dim_1) float64 32B 0.0 0.0 nan 0.0
    my_int_var    (dim_0, dim_1) int8 4B 1 2 -128 4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions