Chardata plus encoded datasets#6898

Draft

pp-mo wants to merge 37 commits intoSciTools:mainfrom

pp-mo:chardata_plus_encoded_datasets

Member

pp-mo commented Jan 19, 2026 •

edited

Loading

Closes #6309 + various

Successor to #6850
now incorporating #6851

+ now integrated usage with netcdf load+save, to use encoded datasets

pp-mo added 28 commits

January 19, 2026 11:49


          Initial tests.

041af2d


          Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': …

65bd9dd

…Mostly working?

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?


          Reinstate decode on load, now in-Iris coded.

d75a7a7


          Revert and amend.

07efc06


          Hack to preserve the existing order of attributes on saved Coords and…

… Cubes.


          Fix for dataless; avoid FUTURE global state change from temporary tests.

0174e53


          Further fix to attribute ordering.

035e28b


          Fixes for data packing.

80c4776


          Latest test-chararrays.

d4d3ebd


          Fix search+replace error.

3f10cc1


          Tiny fix in crucial place! (merge error?).

ee2fe4c


          Extra mock property prevents weird test crashes.

744826d


          Fix another mock problem.

a3e1217


          Initial dataset wrappers.

1a4f2f2

Rename; addin parts of old investigation; add temporary notes.


          Various notes, choices + changes: Beginnings of encoded-dataset testing.

0148f43


          Replace use of encoding functions with test-specific function: Test f…

20a5be2

…or overlength writes.


          Radically simplify 'make_bytesarray', by using a known specified byte…

9b621bf

…width.


          Add read tests.

b366fd2


          Remove iris width control (not in this layer).

cf048b2


          more notes

e684d1d


          Merge branch 'encoded_datasets' into chardata_plus_encoded_datasets

28b124c


          Remove temporary test code.

a20cc45


          Use iris categorised warnings for unknown encodings.

c995a8d


          Clarify the temporary load/save exercising tests (a bit).

f118c18


          Use bytecoded_datasets in nc load+save, begin fixes.

c8a27df


          Further attempt to satisfy warning cateogry checker.

c4a31a4


          Fix overlength error tests.

10831d7


          Get temporary iris load/save exercises working (todo: proper tests).

042028e

scitools-ci bot added this to 🚴 Peloton

pp-mo mentioned this pull request

Chardata plus #6850

Closed

This was referenced Jan 20, 2026

Support character arrays #6764

Closed

Encoded datasets #6851

Closed

pp-mo added 2 commits

January 21, 2026 16:19


          Put encoding information into separate converter class, for use in pr…

94b2b21

…oxies.


          First proper testing (reads working).

c4b7936

pp-mo force-pushed the chardata_plus_encoded_datasets branch from 4a9cbc2 to c4b7936 Compare

January 22, 2026 00:37

pp-mo added 7 commits

January 23, 2026 15:41


          Encoded reading ~working; new ideas for switching (untested).

ac3e687


          Check loads when coords do/not share a string dim with data.

9ec31fb


          Fix nondecoded reference loads in test_byecoded_datasets.

9bdeb5d


          Test writing of string data: various encodings, from strings or bytes.

54d7743


          Fix write proxy; tmp_path in stringdata tests; tidy stringdata tests.

6a37f62


          Fix for non-string data.

cf9594b


          Pre-clear load problems.

ef11375

Member Author

pp-mo commented Jan 28, 2026

Current status 2026-01-28

test_stringdata and test_bytecoding_datasets are now all passing OK
"most" existing tests are now passing
- some specific problems now spotted + removed
- remaining failures are "mostly" unit-tests (and may be mock-ist in nature)
new issue created to cover remaining work

pp-mo mentioned this pull request

Complete updates to character+string data handling #6911

Open

pp-mo commented

View reviewed changes

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

Comment on lines +32 to +33

		identifying with "codecs.lookup" : This means we support the encodings in the Python
		Standard Library, and the name aliases which it recognises.

Member Author

pp-mo Jan 28, 2026

Given the need to define translations between byte-width and character-width for a given encoding
(here and here)
we probably need to define a list of "supported encodings".

pp-mo commented

View reviewed changes

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

+                  string_width: int  # string lengths when viewing as strings (i.e. "Uxx")
+                  def __init__(self, cf_var):
+                      """Get all the info from an netCDF4 variable (or similar wrapper object).

Member Author

pp-mo Jan 28, 2026 •

edited

Loading

Suggested change

      
                    """Get all the info from an netCDF4 variable (or similar wrapper object).
          
                    """Get all the info from a netCDF4 variable.

It actually must be "at least" a threadsafe wrapped variable (or real netCDF4.Variable) and not an EncodedVariable, since we inspect it's '.dtype' etc.

pp-mo commented

View reviewed changes

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

Comment on lines +120 to +123

+                  read_encoding: str  # *always* a valid encoding from the codecs package
+                  write_encoding: str  # *always* a valid encoding from the codecs package
+                  n_chars_dim: int  # length of associated character dimension
+                  string_width: int  # string lengths when viewing as strings (i.e. "Uxx")

Member Author

pp-mo Jan 28, 2026

These are now only set if "is_chardata" -- see init code

pp-mo commented

View reviewed changes

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

Comment on lines +240 to +242

+              DECODE_TO_STRINGS_ON_READ = NetcdfStringDecodeSetting()
+              DEFAULT_READ_ENCODING = "utf-8"
+              DEFAULT_WRITE_ENCODING = "ascii"

Member Author

pp-mo Jan 28, 2026

These should be made available in public API.
Probably by importing in iris.fileformats.netcdf and including in its __all__ ?

This was referenced Jan 29, 2026

Additional tests for new chardata handling #6898 #6920

Open

resolve outstanding test failures in chardata handling code #6898 #6919

Open

Check coverage of chardata handling code + purge unreachable #6921

Open

Review all legacy issues which should be fixed by new chardata support #6922

Open

Resolve remaining choices/questions for chardata handling #6923

Open

Complete documentation of chardata handling #6924

Open

ukmo-ccbunney reviewed

View reviewed changes

Contributor

ukmo-ccbunney left a comment

Just one comment at this time.

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

+                      encoding = self.read_encoding
+                      if "utf-16" in encoding:
+                          # Each char needs at least 2 bytes -- including a terminator char
+                          strlen = (strlen // 2) - 1

Contributor

ukmo-ccbunney Jan 30, 2026

Do we really need to account for a terminating char on "utf-32" and "utf-16" encodings?
When writing to a netCDF file, surely the terminator isn't written? This is just something that is used when storing strings in memory, is it not?

Contributor

ukmo-ccbunney Jan 30, 2026

OK - this looks to be the case. Certainly encoding a byte string to "utf-16" or "utf-32" does appear to add an extra null terminator...

pp-mo mentioned this pull request

Fix iris handling of netcdf character array variables #6309

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet