OpenClio 🔍

An open source implementation of Anthropic's Clio system for privacy-preserving analysis of AI interactions.

Overview

OpenClio helps you analyze large collections of AI conversations while preserving user privacy. It:

🔍 Extracts key facets from conversations (tasks, requests, etc.)
🧮 Generates semantic embeddings
🎯 Clusters similar conversations
🏷️ Labels clusters with descriptive names
📊 Builds a hierarchy of clusters
🔒 Applies privacy protections

Quick Start

Backend Processing

from openclio import ClioSystem, Conversation

Initialize Clio

clio = ClioSystem()

Process your conversations

conversations = [
    Conversation(...), # Your conversation data
]

Get clusters

clusters = await clio.process_conversations(conversations)

Print results

for cluster in clusters:
    print(f"\nCluster: {cluster.name}")
    print(f"Description: {cluster.description}")
    print(f"Size: {len(cluster.conversations)}")

Example

See example.py in ./openclio if you'd like to see an end to end example:

Loading conversation data
Processing conversations into hierarchical clusters
Visualizing the results
Saving the analysis to JSON

Viewing Results

To explore the clusters visually:

Copy your analysis_results.json to cluster-viewer/public/
Navigate to the cluster viewer directory:

cd cluster-viewer

Install dependencies and start the development server:

npm install
npm run dev

Open your browser to the URL shown in the terminal (typically http://localhost:5173)

The cluster viewer provides both a hierarchical tree view and an interactive map view of your clusters:

Implementation Notes

This implementation differs from the original paper in a few ways:

The paper does not state for abuse prevention reasons many clusters are to be used. OpenClio uses len(conversations)**0.5 by default.
I've found better performance using OpenAI's text-embedding-3 large than all-mpnet-base-v2, hence it is the default. If you'd like to use all-mpnet-base-v2 you can adjust it to be the default in llm.py if desired.
The Privacy Auditor is yet to be fully implemented, the prompt is in prompts.py but it is yet to be integrated into the standard pipeline.
The paper did not specify all facet_criteria, hence only the detailed facet_criteria for task is implemented.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project is based on the research paper "Clio: Privacy-Preserving Insights into Real-World AI Use" by Tamkin et al.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cluster-viewer		cluster-viewer
openclio		openclio
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
ui.gif		ui.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenClio 🔍

Overview

Quick Start

Backend Processing

Initialize Clio

Process your conversations

Get clusters

Print results

Example

Viewing Results

Implementation Notes

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

R0bk/openclio

Folders and files

Latest commit

History

Repository files navigation

OpenClio 🔍

Overview

Quick Start

Backend Processing

Initialize Clio

Process your conversations

Get clusters

Print results

Example

Viewing Results

Implementation Notes

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages