Skip to content

Comments

Datalake [1b/3?] MongoDB querying#268

Merged
vik-rant merged 8 commits intodevfrom
feature/datalake/mongodb-query
Nov 28, 2025
Merged

Datalake [1b/3?] MongoDB querying#268
vik-rant merged 8 commits intodevfrom
feature/datalake/mongodb-query

Conversation

@christopherfish
Copy link
Contributor

This PR renames the old, Python-based, presumably slow, query_data function to query_data_legacy, and replaces it with a mongoDB-native solution using aggregation pipelines.

The inputs and outputs of query_data should be unchanged.

@vik-rant vik-rant added the mindtrace-datalake Issues raised from datalake module in mindtrace package label Nov 12, 2025
Comment on lines 332 to 352
async def query_data(
self, query: list[dict[str, Any]] | dict[str, Any], datums_wanted: int | None = None, transpose: bool = False
) -> list[dict[str, Any]] | dict[str, list]:
"""
Optimized version of query_data using MongoDB aggregation pipelines.

This method provides significant performance improvements for common query patterns
by using MongoDB's native aggregation capabilities instead of multiple round trips.

Args:
query: Same syntax as query_data - list of queries or single query
datums_wanted: Maximum number of results to return
transpose: Whether to return dict of lists (True) or list of dicts (False)

Returns:
Same format as query_data - list of dictionaries or dictionary of lists

Note:
This optimized version handles common cases but may fall back to the original
query_data method for complex scenarios not yet supported.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's please update the docstring. it refers to query_data being the original one when it itself is query_data.

else:
pipeline.append({"$limit": datums_wanted})

self.logger.info(f"pipeline: {pipeline}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should log as debug? might blow up log volume.

@vik-rant vik-rant merged commit 512c48a into dev Nov 28, 2025
4 checks passed
@vik-rant vik-rant deleted the feature/datalake/mongodb-query branch November 28, 2025 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mindtrace-datalake Issues raised from datalake module in mindtrace package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants