-
Notifications
You must be signed in to change notification settings - Fork 3
Dataset Cache Keys #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Cache Keys #158
Conversation
|
I'll take a look at the pipeline failures now. |
mx-moth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! This is coming along well.
I've spotted a number of ways a malicious user could make identical cache keys from different datasets, which could be misused to corrupt any cached data. I've suggested some ways to patch these holes.
These commits include commit da3ebab from #151 which has already been merged. Could you please rebase your work on the latest main and drop this commit?
0e1ac37 to
11042af
Compare
mx-moth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes. A couple of new issues, and the module needs adding to the API documentation.
src/emsarray/conventions/_base.py
Outdated
|
|
||
| # Include the variable name in the digest. | ||
| # Prepend the length of strings to prevent unnoticed overlaps with neighbouring data | ||
| hash_int(hash, len(str(geometry_name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash_string() takes care of adding the string length to the hash now, this call plus the explanatory comment above can be dropped. Same with adding the dtype length to the hash below.
src/emsarray/operations/cache.py
Outdated
| hash.update(numpy.int32(value.bit_length()).tobytes()) | ||
| hash.update(numpy.int32(value).tobytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An int32 value will always be four bytes, so prepending the bit length is superfluous. We only need to prepend the size of variable length things like strings and arrays.
This does potentially leave us vulnerable to overflows if someone hashes something longer than 2^32 bits long, but that can be handled another day...
mx-moth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sorting out the documentation. Last step is adding a release note to docs/releases/development.rst. Please reference both this pull request and issue #153.
tests/conventions/test_ugrid.py
Outdated
| if set_coordinates_as_coordinates: | ||
| dataset.coords.update({value.name: value for value in [edge_node, u1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u1 is a data variable, not a coordinate. I don't think edge_node is a coordinate variable either, although I am less certain about that one. It forms part of the dataset geometry but it does not itself contain any coordinate information.
…ation. Updated make_cache_key to accept a hash instance and to hash the convention name, module path and emsarray version.
…so added string cooercion on Hashables to resolve mypy warnings.
… dedicated hashing functions.
… when the version of emsarray changes.
381ed66 to
ef85383
Compare
…sions don't seem to throw the exception.
mx-moth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the code is looking great! Don't forget a release note, otherwise this is ready for merge
* origin/main: Update conda-incubator/setup-miniconda action to v3
Added cache key operations that provide a hash key suitable for caching data derived from the geometry of a dataset.