Skip to content

Conversation

@david-sh-csiro
Copy link
Collaborator

Added cache key operations that provide a hash key suitable for caching data derived from the geometry of a dataset.

@david-sh-csiro david-sh-csiro self-assigned this Sep 12, 2024
@david-sh-csiro
Copy link
Collaborator Author

I'll take a look at the pipeline failures now.

Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This is coming along well.

I've spotted a number of ways a malicious user could make identical cache keys from different datasets, which could be misused to corrupt any cached data. I've suggested some ways to patch these holes.

These commits include commit da3ebab from #151 which has already been merged. Could you please rebase your work on the latest main and drop this commit?

Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes. A couple of new issues, and the module needs adding to the API documentation.


# Include the variable name in the digest.
# Prepend the length of strings to prevent unnoticed overlaps with neighbouring data
hash_int(hash, len(str(geometry_name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash_string() takes care of adding the string length to the hash now, this call plus the explanatory comment above can be dropped. Same with adding the dtype length to the hash below.

Comment on lines 67 to 68
hash.update(numpy.int32(value.bit_length()).tobytes())
hash.update(numpy.int32(value).tobytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An int32 value will always be four bytes, so prepending the bit length is superfluous. We only need to prepend the size of variable length things like strings and arrays.

This does potentially leave us vulnerable to overflows if someone hashes something longer than 2^32 bits long, but that can be handled another day...

Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sorting out the documentation. Last step is adding a release note to docs/releases/development.rst. Please reference both this pull request and issue #153.

Comment on lines 287 to 288
if set_coordinates_as_coordinates:
dataset.coords.update({value.name: value for value in [edge_node, u1]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u1 is a data variable, not a coordinate. I don't think edge_node is a coordinate variable either, although I am less certain about that one. It forms part of the dataset geometry but it does not itself contain any coordinate information.

Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the code is looking great! Don't forget a release note, otherwise this is ready for merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants