Skip to content

Chardata plus encoded datasets#6898

Draft
pp-mo wants to merge 37 commits intoSciTools:mainfrom
pp-mo:chardata_plus_encoded_datasets
Draft

Chardata plus encoded datasets#6898
pp-mo wants to merge 37 commits intoSciTools:mainfrom
pp-mo:chardata_plus_encoded_datasets

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Jan 19, 2026

Closes #6309 + various

Successor to #6850
now incorporating #6851

+ now integrated usage with netcdf load+save, to use encoded datasets

pp-mo added 28 commits January 19, 2026 11:49
…Mostly working?

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?
Rename; addin parts of old investigation; add temporary notes.
@pp-mo pp-mo mentioned this pull request Jan 20, 2026
This was referenced Jan 20, 2026
@pp-mo pp-mo force-pushed the chardata_plus_encoded_datasets branch from 4a9cbc2 to c4b7936 Compare January 22, 2026 00:37
@pp-mo
Copy link
Member Author

pp-mo commented Jan 28, 2026

Current status 2026-01-28

  • test_stringdata and test_bytecoding_datasets are now all passing OK
  • "most" existing tests are now passing
    • some specific problems now spotted + removed
    • remaining failures are "mostly" unit-tests (and may be mock-ist in nature)
  • new issue created to cover remaining work

Comment on lines +32 to +33
identifying with "codecs.lookup" : This means we support the encodings in the Python
Standard Library, and the name aliases which it recognises.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the need to define translations between byte-width and character-width for a given encoding
(here and here)
we probably need to define a list of "supported encodings".

string_width: int # string lengths when viewing as strings (i.e. "Uxx")

def __init__(self, cf_var):
"""Get all the info from an netCDF4 variable (or similar wrapper object).
Copy link
Member Author

@pp-mo pp-mo Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get all the info from an netCDF4 variable (or similar wrapper object).
"""Get all the info from a netCDF4 variable.

It actually must be "at least" a threadsafe wrapped variable (or real netCDF4.Variable) and not an EncodedVariable, since we inspect it's '.dtype' etc.

Comment on lines +120 to +123
read_encoding: str # *always* a valid encoding from the codecs package
write_encoding: str # *always* a valid encoding from the codecs package
n_chars_dim: int # length of associated character dimension
string_width: int # string lengths when viewing as strings (i.e. "Uxx")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now only set if "is_chardata" -- see init code

Comment on lines +240 to +242
DECODE_TO_STRINGS_ON_READ = NetcdfStringDecodeSetting()
DEFAULT_READ_ENCODING = "utf-8"
DEFAULT_WRITE_ENCODING = "ascii"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be made available in public API.
Probably by importing in iris.fileformats.netcdf and including in its __all__ ?

Copy link
Contributor

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment at this time.

encoding = self.read_encoding
if "utf-16" in encoding:
# Each char needs at least 2 bytes -- including a terminator char
strlen = (strlen // 2) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to account for a terminating char on "utf-32" and "utf-16" encodings?
When writing to a netCDF file, surely the terminator isn't written? This is just something that is used when storing strings in memory, is it not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this looks to be the case. Certainly encoding a byte string to "utf-16" or "utf-32" does appear to add an extra null terminator...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Fix iris handling of netcdf character array variables

2 participants

Comments