-
Notifications
You must be signed in to change notification settings - Fork 788
Closed
Labels
Milestone
Description
Is your feature request related to a problem? Please describe
A dashboard will provide the necessary monitoring and performance metrics, enabling efficient management and optimization of our system.
Describe the solution you'd like
Below are the key features I envision for a dashboard:
- Resource Monitoring
- CPU: Real-time monitoring of CPU utilization across the distributed system nodes.
- Memory: Tracking and visualization of memory usage for each node.
- GPU: Monitoring GPU utilization, allowing us to identify bottlenecks or optimize resource allocation.
- VRAM: Real-time monitoring of VRAM utilization for GPU-based inference.
- Performance Monitoring:
- Model-Specific Metrics: For each deployed model, capture and display relevant metrics such as generate task queue length, number of tokens generated per second, and any other model-specific performance indicators.
- Throughput and Latency: Measure the overall throughput and latency of the system, enabling us to identify any performance issues and assess system efficiency.