Dataset Cache Keys #158

david-sh-csiro · 2024-09-12T00:18:42Z

Added cache key operations that provide a hash key suitable for caching data derived from the geometry of a dataset.

david-sh-csiro · 2024-09-12T00:41:47Z

I'll take a look at the pipeline failures now.

mx-moth

Nice work! This is coming along well.

I've spotted a number of ways a malicious user could make identical cache keys from different datasets, which could be misused to corrupt any cached data. I've suggested some ways to patch these holes.

These commits include commit da3ebab from #151 which has already been merged. Could you please rebase your work on the latest main and drop this commit?

src/emsarray/conventions/_base.py

src/emsarray/operations/cache.py

mx-moth

Thanks for making those changes. A couple of new issues, and the module needs adding to the API documentation.

mx-moth · 2024-09-17T06:56:20Z

src/emsarray/conventions/_base.py

+
+            # Include the variable name in the digest.
+            # Prepend the length of strings to prevent unnoticed overlaps with neighbouring data
+            hash_int(hash, len(str(geometry_name)))


hash_string() takes care of adding the string length to the hash now, this call plus the explanatory comment above can be dropped. Same with adding the dtype length to the hash below.

mx-moth · 2024-09-17T07:03:40Z

src/emsarray/operations/cache.py

+    hash.update(numpy.int32(value.bit_length()).tobytes())
+    hash.update(numpy.int32(value).tobytes())


An int32 value will always be four bytes, so prepending the bit length is superfluous. We only need to prepend the size of variable length things like strings and arrays.

This does potentially leave us vulnerable to overflows if someone hashes something longer than 2^32 bits long, but that can be handled another day...

src/emsarray/operations/cache.py

tests/operations/test_cache.py

src/emsarray/conventions/ugrid.py

mx-moth

Thanks for sorting out the documentation. Last step is adding a release note to docs/releases/development.rst. Please reference both this pull request and issue #153.

tests/operations/test_cache.py

mx-moth · 2024-09-18T05:55:49Z

tests/conventions/test_ugrid.py

+        if set_coordinates_as_coordinates:
+            dataset.coords.update({value.name: value for value in [edge_node, u1]})


u1 is a data variable, not a coordinate. I don't think edge_node is a coordinate variable either, although I am less certain about that one. It forms part of the dataset geometry but it does not itself contain any coordinate information.

tests/conventions/test_ugrid.py

tests/operations/test_cache.py

src/emsarray/operations/cache.py

…ation. Updated make_cache_key to accept a hash instance and to hash the convention name, module path and emsarray version.

…so added string cooercion on Hashables to resolve mypy warnings.

… dedicated hashing functions.

… when the version of emsarray changes.

…sions don't seem to throw the exception.

mx-moth

All the code is looking great! Don't forget a release note, otherwise this is ready for merge

tests/operations/test_cache.py

* origin/main: Update conda-incubator/setup-miniconda action to v3

david-sh-csiro requested a review from mx-moth September 12, 2024 00:18

david-sh-csiro self-assigned this Sep 12, 2024

mx-moth requested changes Sep 12, 2024

View reviewed changes

david-sh-csiro added 5 commits September 12, 2024 14:27

Added cache key generation methods and utilities for conventions.

2d589a5

Resolved some mypy and flake8 linter detections.

21e6df6

Broke up caching unit tests, added tests for different convention types.

80bf3c5

Changed default cache hasher to blake2b.

f76d0f2

Updated cache function docstrings.

5b0c51b

david-sh-csiro force-pushed the geometry-hash branch from 0e1ac37 to 11042af Compare September 17, 2024 01:32

mx-moth requested changes Sep 17, 2024

View reviewed changes

mx-moth requested changes Sep 18, 2024

View reviewed changes

mx-moth reviewed Sep 18, 2024

View reviewed changes

src/emsarray/operations/cache.py Outdated Show resolved Hide resolved

david-sh-csiro added 14 commits September 19, 2024 14:13

Updated hash_geometry method to reduce risk of collisions.

0ac4c5c

Updated hash_attributes to use marshal instead of pickle for serialis…

839f458

…ation. Updated make_cache_key to accept a hash instance and to hash the convention name, module path and emsarray version.

Updated unit tests.

2692ad4

Added string and int hasher helper methods.

8082656

Convention hash_geometry now uses helper cache key helper methods, al…

bbb85a3

…so added string cooercion on Hashables to resolve mypy warnings.

Added unit tests for new helper methods and updated test hash values.

fa32c97

Fixed docstring typo.

b72d779

Updated cache operation docstrings.

99713e0

Removed manual prepending of string lengths as that is handled by the…

f2efddc

… dedicated hashing functions.

Updated cache unit tests with monkey patching to prevent hash changes…

be1b180

… when the version of emsarray changes.

Updated read the docs documentation to include cache operation.

0fcf1a2

Updated fixtures and remove unnecessary assertions.

4bdbb33

Added exception when overflowing numpy int32 during int hashing.

3de91be

Added unit test to validate exceptions are being generated.

ef85383

david-sh-csiro force-pushed the geometry-hash branch from 381ed66 to ef85383 Compare September 19, 2024 04:32

Added a manual overflow check when hashing int32's as older numpy ver…

70b9f09

…sions don't seem to throw the exception.

mx-moth requested changes Sep 19, 2024

View reviewed changes

tests/operations/test_cache.py Outdated Show resolved Hide resolved

david-sh-csiro and others added 3 commits September 19, 2024 17:15

Updated release documentation. Removed future import in test_cache.

86de289

Update documentation for cache module

769ae52

Merge remote-tracking branch 'origin/main' into geometry-hash

3687141

* origin/main: Update conda-incubator/setup-miniconda action to v3

mx-moth approved these changes Sep 30, 2024

View reviewed changes

mx-moth merged commit c9e257a into main Sep 30, 2024

mx-moth deleted the geometry-hash branch September 30, 2024 03:28

This was referenced Aug 18, 2025

Add method that generates a hash of geometry variables for caching #153

Closed

Allow reusing computed topology for some dataset operations #97

Closed

		hash.update(numpy.int32(value.bit_length()).tobytes())
		hash.update(numpy.int32(value).tobytes())

		if set_coordinates_as_coordinates:
		dataset.coords.update({value.name: value for value in [edge_node, u1]})

Dataset Cache Keys #158

Dataset Cache Keys #158

Uh oh!

Conversation

david-sh-csiro commented Sep 12, 2024

Uh oh!

david-sh-csiro commented Sep 12, 2024

Uh oh!

mx-moth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mx-moth left a comment

Choose a reason for hiding this comment

Uh oh!

mx-moth Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

mx-moth Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mx-moth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mx-moth Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mx-moth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants