🎭 Persona Vectors: Monitoring and Controlling Character Traits in Language Models

📌 Getting Started

This fork extends the Persona Vectors codebase with tools for influence-based data attribution to identify training examples that contribute to emergent misaligned behavior.

👉 Start here: influence/README.md — contains the full pipeline for computing data attribution rankings, filtering training data, and evaluating retrained models.

🛠️ Environment Setup

Tested: Kubernetes + Docker

This codebase is developed and tested using a Kubernetes devbox with the provided Docker image. (Note - may have some access trouble to pull from FAR secrets, so probably worth trying the other local methods below first).

Build or pull the Docker image:

# Build locally
docker build -t persona-vectors .

# Or pull the pre-built image
docker pull ghcr.io/alignmentresearch/persona:latest

Deploy on Kubernetes:
```
kubectl create -f k8s/devbox_gpu.yaml
```
The devbox requests 1 GPU (H100), 4 CPUs, and 100GB memory. Modify k8s/devbox_gpu.yaml as needed for your cluster.

Clone the repo inside the container and install:

git clone <repo-url> /workspace/persona_vectors
cd /workspace/persona_vectors
uv pip install --system -e '.[dev,gpu]'

Alternative: Local Docker (Untested)

Run the Docker container locally with GPU passthrough:

docker build -t persona-vectors .
docker run --gpus all -it -v $(pwd):/workspace/persona_vectors persona-vectors bash
cd /workspace/persona_vectors
uv pip install --system -e '.[dev,gpu]'

Alternative: Local Virtual Environment (Untested)

For local development without Docker:

# Requires Python 3.12+
python -m venv .venv
source .venv/bin/activate

# CPU-only (for development/testing)
pip install -e '.[dev]'

# With GPU support (requires CUDA)
pip install -e '.[dev,gpu]'

⚠️ Note: The GPU dependencies (unsloth, vllm, curvlinops) may require specific CUDA versions and system libraries. The Docker setup handles these dependencies automatically.

Note: The remainder of this README is from the original Persona Vectors repository and is preserved below for reference. It documents the base functionality for persona vector generation, steering, and evaluation.

This is the official repository for Persona Vectors, a method for monitoring and controlling character traits in language models.

🚀 Quick Start

⚙️ Setup

Create a project virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -e '.[dev]'

Configure environment:

cp .env.example .env
# Fill in your API keys in the .env file

📦 Dataset Preparation

Extract the training datasets:

unzip dataset.zip

🏗️ Pipeline

Generate Trait Artifacts

We provide pre-generated trait artifacts in:

data_generation/trait_data_extract/ - Extraction set
data_generation/trait_data_eval/ - Evaluation set

Each trait file contains:

Positive and negative prompts
Questions for evaluation
Evaluation prompts

To generate new artifacts: Use prompts from data_generation/prompts.py. We used Claude-3.7-Sonnet (thinking mode, budget: 5000, max_tokens: 16000).

Baseline Evaluation

Evaluate models without any interventions:

CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model path/to/your/model \
    --trait evil \
    --output_path path/to/results.csv \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version eval

This script uses the questions and prompts from data_generation/trait_data_eval/evil.json and generates outputs for the model. The judge then evaluates the amount of trait contained in the output. Note - the output csv file, saved at --output_path, is then used during the 'Calculate Projection' step where the projection results are appended to the end of each datapoint.

Our evaluation uses openai-based judge functions, primarily adapted from the Emergent Misalignment codebase.

Generate Persona Vectors

Evaluate with System Prompts

Generate activations using positive and negative system prompts:

# Positive system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trait evil \
    --output_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_pos_instruct.csv \
    --persona_instruction_type pos \
    --assistant_name evil \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

# Negative system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trait evil \
    --output_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_neg_instruct.csv \
    --persona_instruction_type neg \
    --assistant_name helpful \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

Assistant Name Guidelines: We prepend a sentence before the generated positive/negative instruction: "You are a [assistant_name] assistant." The recommendations for the assistant_name parameter are:

Positive prompts: Use the trait adjective (e.g., "evil")
Negative prompts: Use the antonym when clear, otherwise use "helpful"

Compute Persona Vectors

Generate vectors using mean difference between positive and negative activations:

python generate_vec.py \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --pos_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_pos_instruct.csv \
    --neg_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_neg_instruct.csv \
    --trait evil \
    --save_dir persona_vectors/Qwen2.5-7B-Instruct/

Generated Files:

prompt_avg_diff.pt: Average prompt activations difference
response_avg_diff.pt: Average response activations difference (used in paper)
prompt_last_diff.pt: Last prompt token activations difference

Each vector has shape: [layers × hidden_dim]

Complete Pipeline

Run the full vector generation pipeline:

bash scripts/generate_vec.sh 0  # GPU 0

🎛️ Steering Methods

⚡ Inference-Time Steering

Apply persona vectors during model inference:

CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trait evil \
    --output_path eval_persona_eval/steering_results.csv \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version eval \
    --steering_type response \
    --coef 2.0 \
    --vector_path persona_vectors/Qwen2.5-7B-Instruct/evil_response_avg_diff.pt \
    --layer 20

Steering Types:

response: Apply steering to response tokens only
prompt: Apply steering to prompt tokens only
all: Apply steering to all tokens

🏋️ Model Training

📊 Dataset Structure

Training datasets are organized by trait type, each containing 3 versions:

normal.jsonl - Standard behavior examples
misaligned_1.jsonl - Trait-eliciting or mistake examples (Level I)
misaligned_2.jsonl - Trait-eliciting or mistake examples (Level II)

🔧 Basic Training

Train models with default hyperparameters:

python training.py configs/train_instruct_7b.json

🎯 Key Hyperparameters

Model: Qwen/Qwen2.5-7B-Instruct (configurable)
LoRA rank: 32
LoRA alpha: 64
Learning rate: 1e-5
Batch size: 2 per device
Gradient accumulation: 8 steps

🛡️ Training-Time Steering (Preventative)

Apply steering during model training using configs/train_instruct_7b_steer.json:

python training.py configs/train_instruct_7b_steer.json

Steering Configuration:

{
    "enable_steering_during_training": true,
    "steering_config": {
        "steering_vector_path": "persona_vectors/model/trait_response_avg_diff.pt",
        "type": "steer",
        "steering_coef": 5.0,
        "layers": [20]
    }
}

Parameters:

type: "steer" (preventative steering) or "ablate" (CAFT implementation)
steering_coef: Steering strength (only for "steer" type)
layers: Target transformer layers

📐 Calculate Projection

Supported file formats:

CSV files: Must contain prompt and answer columns
JSONL files: Each line should contain messages field (similar to training dataset format)

CUDA_VISIBLE_DEVICES=0 python -m eval.cal_projection \
    --file_path eval_persona_eval/Qwen2.5-7B-Instruct/evil.csv \
    --vector_path persona_vectors/Qwen2.5-7B-Instruct/evil_response_avg_diff.pt \
    --layer 20 \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --projection_type proj

Complete pipeline:

bash scripts/cal_projection.sh

🛠️ Available Scripts

Script	Purpose	Usage
`scripts/generate_vec.sh`	Complete vector generation pipeline	`bash scripts/generate_vec.sh 0`
`scripts/eval_steering.sh`	Evaluate steering effectiveness	`bash scripts/eval_steering.sh`
`scripts/eval_persona.sh`	Basic persona evaluation	`bash scripts/eval_persona.sh`
`scripts/cal_projection.sh`	Calculate projection	`bash scripts/cal_projection.sh`

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.circleci		.circleci
configs		configs
data_generation		data_generation
eval		eval
exps		exps
influence		influence
k8s		k8s
output		output
scripts		scripts
visualizations		visualizations
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
activation_steer.py		activation_steer.py
config.py		config.py
dataset.zip		dataset.zip
generate_vec.py		generate_vec.py
judge.py		judge.py
mix_datasets.py		mix_datasets.py
pyproject.toml		pyproject.toml
sft.py		sft.py
training.py		training.py
utils.py		utils.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎭 Persona Vectors: Monitoring and Controlling Character Traits in Language Models

📌 Getting Started

🛠️ Environment Setup

Tested: Kubernetes + Docker

Alternative: Local Docker (Untested)

Alternative: Local Virtual Environment (Untested)

🚀 Quick Start

⚙️ Setup

📦 Dataset Preparation

🏗️ Pipeline

Generate Trait Artifacts

Baseline Evaluation

Generate Persona Vectors

Evaluate with System Prompts

Compute Persona Vectors

Complete Pipeline

🎛️ Steering Methods

⚡ Inference-Time Steering

🏋️ Model Training

📊 Dataset Structure

🔧 Basic Training

🎯 Key Hyperparameters

🛡️ Training-Time Steering (Preventative)

📐 Calculate Projection

🛠️ Available Scripts

About

Uh oh!

Releases

Packages

Languages

AlignmentResearch/persona_vectors

Folders and files

Latest commit

History

Repository files navigation

🎭 Persona Vectors: Monitoring and Controlling Character Traits in Language Models

📌 Getting Started

🛠️ Environment Setup

Tested: Kubernetes + Docker

Alternative: Local Docker (Untested)

Alternative: Local Virtual Environment (Untested)

🚀 Quick Start

⚙️ Setup

📦 Dataset Preparation

🏗️ Pipeline

Generate Trait Artifacts

Baseline Evaluation

Generate Persona Vectors

Evaluate with System Prompts

Compute Persona Vectors

Complete Pipeline

🎛️ Steering Methods

⚡ Inference-Time Steering

🏋️ Model Training

📊 Dataset Structure

🔧 Basic Training

🎯 Key Hyperparameters

🛡️ Training-Time Steering (Preventative)

📐 Calculate Projection

🛠️ Available Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages