An open source implementation of Anthropic's Clio system for privacy-preserving analysis of AI interactions.
OpenClio helps you analyze large collections of AI conversations while preserving user privacy. It:
- 🔍 Extracts key facets from conversations (tasks, requests, etc.)
- 🧮 Generates semantic embeddings
- 🎯 Clusters similar conversations
- 🏷️ Labels clusters with descriptive names
- 📊 Builds a hierarchy of clusters
- 🔒 Applies privacy protections
from openclio import ClioSystem, Conversationclio = ClioSystem()conversations = [
Conversation(...), # Your conversation data
]clusters = await clio.process_conversations(conversations)for cluster in clusters:
print(f"\nCluster: {cluster.name}")
print(f"Description: {cluster.description}")
print(f"Size: {len(cluster.conversations)}")See example.py in ./openclio if you'd like to see an end to end example:
- Loading conversation data
- Processing conversations into hierarchical clusters
- Visualizing the results
- Saving the analysis to JSON
To explore the clusters visually:
- Copy your
analysis_results.jsontocluster-viewer/public/ - Navigate to the cluster viewer directory:
cd cluster-viewer- Install dependencies and start the development server:
npm install
npm run dev- Open your browser to the URL shown in the terminal (typically http://localhost:5173)
The cluster viewer provides both a hierarchical tree view and an interactive map view of your clusters:
This implementation differs from the original paper in a few ways:
- The paper does not state for abuse prevention reasons many clusters are to be used. OpenClio uses
len(conversations)**0.5by default. - I've found better performance using OpenAI's text-embedding-3 large than all-mpnet-base-v2, hence it is the default. If you'd like to use all-mpnet-base-v2 you can adjust it to be the default in
llm.pyif desired. - The Privacy Auditor is yet to be fully implemented, the prompt is in
prompts.pybut it is yet to be integrated into the standard pipeline. - The paper did not specify all
facet_criteria, hence only the detailedfacet_criteriafortaskis implemented.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is based on the research paper "Clio: Privacy-Preserving Insights into Real-World AI Use" by Tamkin et al.
