Skip to content

Support character arrays#6764

Closed
pp-mo wants to merge 13 commits intoSciTools:mainfrom
pp-mo:chardata
Closed

Support character arrays#6764
pp-mo wants to merge 13 commits intoSciTools:mainfrom
pp-mo:chardata

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Oct 25, 2025

Closes #6309

So far, just some ideas brewing

@pp-mo
Copy link
Member Author

pp-mo commented Oct 25, 2025

Older notes

Issues for iris char data

  • read + write, with + without encodings
  • ? choose to view cube/coord data as strings or (underlying) byte array
  • ?? char coord writing works, but char cube data does not

=========================
testing dimensions (FOR READS)

  • encoding can be None, "ascii" or "utf-8"
    • we should also test alternative spellings of utf-8 / ascii
    • but not fuss too much ?

EXISTING behaviour

  • is ok for ascii
  • but results depend on the presence of the "_Encoding"
    • since that is the default working of netCDF4-python

ASIDE: Python "standard encodings" : https://docs.python.org/3/library/codecs.html#standard-encodings
A table
normalise names like this...

    >>> codecs.lookup("u8").name
    'utf-8'
  • this produces "name" from "alternatives", as in the table
  • also fails when given junk
    • does not accept "" or None

Old discussion in netcdf4-python, refd by xarray docs
: Unidata/netcdf4-python#654 (comment)
From that specific comment by jswhit , (quoting old version of NCUG ?)

Applications writing string data using the char data type are encouraged to add
the special variable attribute "_Encoding" with a value that the netCDF libraries
recognize.
Currently those valid values are "UTF-8" or "ASCII", case insensitive.

In Unidata docs, reference is hard to find
STILL NOTHING in the Attributes Appendix (A).
In : https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html

Note on char data: Although the characters used in netCDF names must be encoded
as UTF-8, character data may use other encodings.
The variable attribute “_Encoding” is reserved for this purpose in future implementations.

Outstanding issues

  • assumption that string dim of coords cannot be a data dim
  • how to manage backwards-compatible approach to coords + cubes
    • == expecting data cubes to contain strings ??
    • == OR converting (automatically, with turn-off FUTURE control??) ??

@pp-mo
Copy link
Member Author

pp-mo commented Oct 27, 2025

There seems to be a problem with netcdf4-Python byte encodings Unidata/netcdf4-python#1440

For now, here, have just turned off decoding, so everything now reads as character arrays??
Future intention: decode here, to reproduce original intended behavior.

I now don't think that people need or want to see cubes or coords with string dimensions: we will convert all to Uxx arrays internally.
This means we will lose names and identity of string dimensions. But that is probably ok.

Note : existing code names dims according to their (byte) lengths. This seems a neat idea, since it means they automatically share where convenient.
But there could be inefficiencies with using worst-case byte lengths for a given Unicode length?

pp-mo added 3 commits October 28, 2025 18:19
…Mostly working?

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?
common_dims = [
dim for dim in cf_coord_var.dimensions if dim in engine.cf_var.dimensions
]
coord_dims = cf_coord_var.dimensions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this possibly needs to be implemented for ancillary-variables too

  • which might also be strings
  • which is awkward because of DRY failure in rules code

Comment on lines +854 to +857
# if encoding == "ascii":
# print("\n\n*** FIX !!")
# string = bytes.decode("utf-8")
# else:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove

@pp-mo
Copy link
Member Author

pp-mo commented Nov 11, 2025

Status update 2025-11-11

  • intended behaviour I think is now complete + working
  • much more proper testing needed
    • the added PoC tests exercise it, but lack desired-result asserts -- to be rewritten entirely, probably
  • a number of existing mock-ist tests are broken by the changes (--> failures in this PR), so need fixing
  • after consideration, I now really want to refactor the encode/decode support
    • to replace the various places I've added/changed this, with a separate dataset wrapper
    • .. like (and subclassing) the _threadsafe_nc ones
    • .. which should reduce a lot of the "mess" and DRY failure in this PoC
    • possibly this can even be removed again, if a future fix to the netcdf bug delivers all that we would want
      • they have already put in a fix, but unreleased, so far easier to wait for release to test against this.
      • it's not yet clear (to me) whether this intends to support the _Encoding attribute entirely as we'd like it to ?

@pp-mo pp-mo mentioned this pull request Dec 7, 2025
@pp-mo
Copy link
Member Author

pp-mo commented Jan 5, 2026

Status update 2026-01-05

intend to replace this with (roughly) #6850 "plus" #6851

Outstanding errors:

TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestBoundsVertexDim::test_fastest_varying_vertex_dim__normalise_bounds
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestBoundsVertexDim::test_fastest_with_different_dim_names__normalise_bounds
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestBoundsVertexDim::test_slowest_varying_vertex_dim__normalise_bounds
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestDtype::test_add_offset_float
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestDtype::test_scale_factor_add_offset_int
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestDtype::test_scale_factor_float
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestCoordConstruction::test_aux_coord_construction
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestCoordConstruction::test_aux_coord_construction__climatology
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestCoordConstruction::test_bad_coord_system
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestCoordConstruction::test_not_added
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_auxiliary_coordinate.py::TestCoordConstruction::test_with_coord_system
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_dimension_coordinate.py::TestCoordConstruction::test_aux_coord_construction
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_dimension_coordinate.py::TestCoordConstruction::test_auxcoord_not_added
TESTFAIL:unit/fileformats/nc_load_rules/helpers/test_build_and_add_dimension_coordinate.py::TestCoordConstruction::test_dimcoord_not_added
TESTFAIL:unit/fileformats/netcdf/loader/test__get_cf_var_data.py::Test__get_cf_var_data::test_arraytype__100f8_is_real
TESTFAIL:unit/fileformats/netcdf/loader/test__get_cf_var_data.py::Test__get_cf_var_data::test_arraytype__1ki2_is_real
TESTFAIL:unit/fileformats/netcdf/loader/test__get_cf_var_data.py::Test__get_cf_var_data::test_vltype__1000str_is_real_with_hint
TESTFAIL:unit/fileformats/netcdf/loader/test__get_cf_var_data.py::Test__get_cf_var_data::test_vltype__100f8_is_real_with_hint
TESTFAIL:unit/fileformats/netcdf/loader/test__get_cf_var_data.py::Test__get_cf_var_data::test_vltype__100str_is_real
TESTFAIL:unit/fileformats/netcdf/saver/test_Saver.py::Test_write::test_compression
TESTFAIL:unit/fileformats/netcdf/saver/test_Saver.py::Test_write::test_non_compression__dtype
TESTFAIL:unit/fileformats/netcdf/saver/test_Saver.py::Test_write::test_non_compression__shape
TESTFAIL:unit/fileformats/netcdf/saver/test_Saver__lazy.py::Test_write::test_compression
TESTFAIL:unit/fileformats/netcdf/saver/test_Saver__lazy.py::Test_write::test_non_compression__dtype
TESTFAIL:unit/fileformats/netcdf/saver/test_Saver__lazy.py::Test_write::test_non_compression__shape
...
=== 25 failed, 10584 passed, 54 skipped, 2614 warnings in 639.72s (0:10:39) ====

@pp-mo
Copy link
Member Author

pp-mo commented Jan 20, 2026

replaced by #6898

@pp-mo pp-mo closed this Jan 20, 2026
@scitools-ci scitools-ci bot removed this from 🚴 Peloton Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix iris handling of netcdf character array variables

1 participant

Comments