Skip to content

AlignmentResearch/persona_vectors

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎭 Persona Vectors: Monitoring and Controlling Character Traits in Language Models


πŸ“Œ Getting Started

This fork extends the Persona Vectors codebase with tools for influence-based data attribution to identify training examples that contribute to emergent misaligned behavior.

πŸ‘‰ Start here: influence/README.md β€” contains the full pipeline for computing data attribution rankings, filtering training data, and evaluating retrained models.

πŸ› οΈ Environment Setup

Tested: Kubernetes + Docker

This codebase is developed and tested using a Kubernetes devbox with the provided Docker image. (Note - may have some access trouble to pull from FAR secrets, so probably worth trying the other local methods below first).

  1. Build or pull the Docker image:

    # Build locally
    docker build -t persona-vectors .
    
    # Or pull the pre-built image
    docker pull ghcr.io/alignmentresearch/persona:latest
  2. Deploy on Kubernetes:

    kubectl create -f k8s/devbox_gpu.yaml

    The devbox requests 1 GPU (H100), 4 CPUs, and 100GB memory. Modify k8s/devbox_gpu.yaml as needed for your cluster.

  3. Clone the repo inside the container and install:

    git clone <repo-url> /workspace/persona_vectors
    cd /workspace/persona_vectors
    uv pip install --system -e '.[dev,gpu]'

Alternative: Local Docker (Untested)

Run the Docker container locally with GPU passthrough:

docker build -t persona-vectors .
docker run --gpus all -it -v $(pwd):/workspace/persona_vectors persona-vectors bash
cd /workspace/persona_vectors
uv pip install --system -e '.[dev,gpu]'

Alternative: Local Virtual Environment (Untested)

For local development without Docker:

# Requires Python 3.12+
python -m venv .venv
source .venv/bin/activate

# CPU-only (for development/testing)
pip install -e '.[dev]'

# With GPU support (requires CUDA)
pip install -e '.[dev,gpu]'

⚠️ Note: The GPU dependencies (unsloth, vllm, curvlinops) may require specific CUDA versions and system libraries. The Docker setup handles these dependencies automatically.


Note: The remainder of this README is from the original Persona Vectors repository and is preserved below for reference. It documents the base functionality for persona vector generation, steering, and evaluation.


This is the official repository for Persona Vectors, a method for monitoring and controlling character traits in language models.

πŸš€ Quick Start

βš™οΈ Setup

  1. Create a project virtual environment:
python -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -e '.[dev]'
  1. Configure environment:
cp .env.example .env
# Fill in your API keys in the .env file

πŸ“¦ Dataset Preparation

Extract the training datasets:

unzip dataset.zip

πŸ—οΈ Pipeline

Generate Trait Artifacts

We provide pre-generated trait artifacts in:

  • data_generation/trait_data_extract/ - Extraction set
  • data_generation/trait_data_eval/ - Evaluation set

Each trait file contains:

  • Positive and negative prompts
  • Questions for evaluation
  • Evaluation prompts

To generate new artifacts: Use prompts from data_generation/prompts.py. We used Claude-3.7-Sonnet (thinking mode, budget: 5000, max_tokens: 16000).

Baseline Evaluation

Evaluate models without any interventions:

CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model path/to/your/model \
    --trait evil \
    --output_path path/to/results.csv \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version eval

This script uses the questions and prompts from data_generation/trait_data_eval/evil.json and generates outputs for the model. The judge then evaluates the amount of trait contained in the output. Note - the output csv file, saved at --output_path, is then used during the 'Calculate Projection' step where the projection results are appended to the end of each datapoint.

Our evaluation uses openai-based judge functions, primarily adapted from the Emergent Misalignment codebase.

Generate Persona Vectors

Evaluate with System Prompts

Generate activations using positive and negative system prompts:

# Positive system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trait evil \
    --output_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_pos_instruct.csv \
    --persona_instruction_type pos \
    --assistant_name evil \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

# Negative system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trait evil \
    --output_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_neg_instruct.csv \
    --persona_instruction_type neg \
    --assistant_name helpful \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

Assistant Name Guidelines: We prepend a sentence before the generated positive/negative instruction: "You are a [assistant_name] assistant." The recommendations for the assistant_name parameter are:

  • Positive prompts: Use the trait adjective (e.g., "evil")
  • Negative prompts: Use the antonym when clear, otherwise use "helpful"

Compute Persona Vectors

Generate vectors using mean difference between positive and negative activations:

python generate_vec.py \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --pos_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_pos_instruct.csv \
    --neg_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_neg_instruct.csv \
    --trait evil \
    --save_dir persona_vectors/Qwen2.5-7B-Instruct/

Generated Files:

  • prompt_avg_diff.pt: Average prompt activations difference
  • response_avg_diff.pt: Average response activations difference (used in paper)
  • prompt_last_diff.pt: Last prompt token activations difference

Each vector has shape: [layers Γ— hidden_dim]

Complete Pipeline

Run the full vector generation pipeline:

bash scripts/generate_vec.sh 0  # GPU 0

πŸŽ›οΈ Steering Methods

⚑ Inference-Time Steering

Apply persona vectors during model inference:

CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trait evil \
    --output_path eval_persona_eval/steering_results.csv \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version eval \
    --steering_type response \
    --coef 2.0 \
    --vector_path persona_vectors/Qwen2.5-7B-Instruct/evil_response_avg_diff.pt \
    --layer 20

Steering Types:

  • response: Apply steering to response tokens only
  • prompt: Apply steering to prompt tokens only
  • all: Apply steering to all tokens

πŸ‹οΈ Model Training

πŸ“Š Dataset Structure

Training datasets are organized by trait type, each containing 3 versions:

  • normal.jsonl - Standard behavior examples
  • misaligned_1.jsonl - Trait-eliciting or mistake examples (Level I)
  • misaligned_2.jsonl - Trait-eliciting or mistake examples (Level II)

πŸ”§ Basic Training

Train models with default hyperparameters:

python training.py configs/train_instruct_7b.json

🎯 Key Hyperparameters

  • Model: Qwen/Qwen2.5-7B-Instruct (configurable)
  • LoRA rank: 32
  • LoRA alpha: 64
  • Learning rate: 1e-5
  • Batch size: 2 per device
  • Gradient accumulation: 8 steps

πŸ›‘οΈ Training-Time Steering (Preventative)

Apply steering during model training using configs/train_instruct_7b_steer.json:

python training.py configs/train_instruct_7b_steer.json

Steering Configuration:

{
    "enable_steering_during_training": true,
    "steering_config": {
        "steering_vector_path": "persona_vectors/model/trait_response_avg_diff.pt",
        "type": "steer",
        "steering_coef": 5.0,
        "layers": [20]
    }
}

Parameters:

  • type: "steer" (preventative steering) or "ablate" (CAFT implementation)
  • steering_coef: Steering strength (only for "steer" type)
  • layers: Target transformer layers

πŸ“ Calculate Projection

Supported file formats:

  • CSV files: Must contain prompt and answer columns
  • JSONL files: Each line should contain messages field (similar to training dataset format)
CUDA_VISIBLE_DEVICES=0 python -m eval.cal_projection \
    --file_path eval_persona_eval/Qwen2.5-7B-Instruct/evil.csv \
    --vector_path persona_vectors/Qwen2.5-7B-Instruct/evil_response_avg_diff.pt \
    --layer 20 \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --projection_type proj

Complete pipeline:

bash scripts/cal_projection.sh

πŸ› οΈ Available Scripts

Script Purpose Usage
scripts/generate_vec.sh Complete vector generation pipeline bash scripts/generate_vec.sh 0
scripts/eval_steering.sh Evaluate steering effectiveness bash scripts/eval_steering.sh
scripts/eval_persona.sh Basic persona evaluation bash scripts/eval_persona.sh
scripts/cal_projection.sh Calculate projection bash scripts/cal_projection.sh

About

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.0%
  • Shell 5.5%
  • Dockerfile 0.5%