Skip to content

[pixels-cli, common] Implement Bucket-Based Data Classification for Distributed Loading #1218

@AntiO2

Description

@AntiO2
  1. Refactor PixelsConsumer:

Introduce an abstract base class (AbstractPixelsConsumer) to handle common initialization and cleanup.

Create a concrete subclass (IndexedPixelsConsumer) dedicated to handling loads where a Primary Index exists.

Create a simple subclass (SimplePixelsConsumer) for loads without an Index (maintaining existing sequential logic).

  1. Bucket-Based Routing Logic:

In IndexedPixelsConsumer, maintain a map to track active writers: Map<Integer, PerBucketWriter>.

For every incoming data row:

  • Calculate the data's bucketId based on its Primary Key hash.

  • Use the bucketId to look up the corresponding PerBucketWriter state object.

  • If no writer exists for the bucketId, dynamically initialize a new PixelsWriter and temporary File.

  1. Core Dependency: Node Mapping Cache:

Implement BucketToNodeCache (Small Component): Create a thread-safe, singleton, lazy-loaded cache component to quickly map a bucketId to its responsible RetinaNodeInfo. This cache reduces the necessity of repeatedly querying the NodeService for node assignment during the high-throughput loading process.

  1. Distributed Indexing:

Ensure that index entries generated by IndexedPixelsConsumer are routed to the correct IndexService instance, potentially identified by the RetinaNodeInfo obtained from the cache.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions