This fork extends the Persona Vectors codebase with tools for influence-based data attribution to identify training examples that contribute to emergent misaligned behavior.
π Start here: influence/README.md β contains the full pipeline for computing data attribution rankings, filtering training data, and evaluating retrained models.
This codebase is developed and tested using a Kubernetes devbox with the provided Docker image. (Note - may have some access trouble to pull from FAR secrets, so probably worth trying the other local methods below first).
-
Build or pull the Docker image:
# Build locally docker build -t persona-vectors . # Or pull the pre-built image docker pull ghcr.io/alignmentresearch/persona:latest
-
Deploy on Kubernetes:
kubectl create -f k8s/devbox_gpu.yaml
The devbox requests 1 GPU (H100), 4 CPUs, and 100GB memory. Modify
k8s/devbox_gpu.yamlas needed for your cluster. -
Clone the repo inside the container and install:
git clone <repo-url> /workspace/persona_vectors cd /workspace/persona_vectors uv pip install --system -e '.[dev,gpu]'
Run the Docker container locally with GPU passthrough:
docker build -t persona-vectors .
docker run --gpus all -it -v $(pwd):/workspace/persona_vectors persona-vectors bash
cd /workspace/persona_vectors
uv pip install --system -e '.[dev,gpu]'For local development without Docker:
# Requires Python 3.12+
python -m venv .venv
source .venv/bin/activate
# CPU-only (for development/testing)
pip install -e '.[dev]'
# With GPU support (requires CUDA)
pip install -e '.[dev,gpu]'
β οΈ Note: The GPU dependencies (unsloth,vllm,curvlinops) may require specific CUDA versions and system libraries. The Docker setup handles these dependencies automatically.
Note: The remainder of this README is from the original Persona Vectors repository and is preserved below for reference. It documents the base functionality for persona vector generation, steering, and evaluation.
This is the official repository for Persona Vectors, a method for monitoring and controlling character traits in language models.
- Create a project virtual environment:
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -e '.[dev]'- Configure environment:
cp .env.example .env
# Fill in your API keys in the .env fileExtract the training datasets:
unzip dataset.zipWe provide pre-generated trait artifacts in:
data_generation/trait_data_extract/- Extraction setdata_generation/trait_data_eval/- Evaluation set
Each trait file contains:
- Positive and negative prompts
- Questions for evaluation
- Evaluation prompts
To generate new artifacts: Use prompts from data_generation/prompts.py. We used Claude-3.7-Sonnet (thinking mode, budget: 5000, max_tokens: 16000).
Evaluate models without any interventions:
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model path/to/your/model \
--trait evil \
--output_path path/to/results.csv \
--judge_model gpt-4.1-mini-2025-04-14 \
--version evalThis script uses the questions and prompts from data_generation/trait_data_eval/evil.json and generates outputs for the model. The judge then evaluates the amount of trait contained in the output. Note - the output csv file, saved at --output_path, is then used during the 'Calculate Projection' step where the projection results are appended to the end of each datapoint.
Our evaluation uses openai-based judge functions, primarily adapted from the Emergent Misalignment codebase.
Generate activations using positive and negative system prompts:
# Positive system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen2.5-7B-Instruct \
--trait evil \
--output_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_pos_instruct.csv \
--persona_instruction_type pos \
--assistant_name evil \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extract
# Negative system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen2.5-7B-Instruct \
--trait evil \
--output_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_neg_instruct.csv \
--persona_instruction_type neg \
--assistant_name helpful \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extractAssistant Name Guidelines:
We prepend a sentence before the generated positive/negative instruction: "You are a [assistant_name] assistant." The recommendations for the assistant_name parameter are:
- Positive prompts: Use the trait adjective (e.g., "evil")
- Negative prompts: Use the antonym when clear, otherwise use "helpful"
Generate vectors using mean difference between positive and negative activations:
python generate_vec.py \
--model_name Qwen/Qwen2.5-7B-Instruct \
--pos_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_pos_instruct.csv \
--neg_path eval_persona_extract/Qwen2.5-7B-Instruct/evil_neg_instruct.csv \
--trait evil \
--save_dir persona_vectors/Qwen2.5-7B-Instruct/Generated Files:
prompt_avg_diff.pt: Average prompt activations differenceresponse_avg_diff.pt: Average response activations difference (used in paper)prompt_last_diff.pt: Last prompt token activations difference
Each vector has shape: [layers Γ hidden_dim]
Run the full vector generation pipeline:
bash scripts/generate_vec.sh 0 # GPU 0Apply persona vectors during model inference:
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen2.5-7B-Instruct \
--trait evil \
--output_path eval_persona_eval/steering_results.csv \
--judge_model gpt-4.1-mini-2025-04-14 \
--version eval \
--steering_type response \
--coef 2.0 \
--vector_path persona_vectors/Qwen2.5-7B-Instruct/evil_response_avg_diff.pt \
--layer 20Steering Types:
response: Apply steering to response tokens onlyprompt: Apply steering to prompt tokens onlyall: Apply steering to all tokens
Training datasets are organized by trait type, each containing 3 versions:
normal.jsonl- Standard behavior examplesmisaligned_1.jsonl- Trait-eliciting or mistake examples (Level I)misaligned_2.jsonl- Trait-eliciting or mistake examples (Level II)
Train models with default hyperparameters:
python training.py configs/train_instruct_7b.json- Model:
Qwen/Qwen2.5-7B-Instruct(configurable) - LoRA rank: 32
- LoRA alpha: 64
- Learning rate: 1e-5
- Batch size: 2 per device
- Gradient accumulation: 8 steps
Apply steering during model training using configs/train_instruct_7b_steer.json:
python training.py configs/train_instruct_7b_steer.jsonSteering Configuration:
{
"enable_steering_during_training": true,
"steering_config": {
"steering_vector_path": "persona_vectors/model/trait_response_avg_diff.pt",
"type": "steer",
"steering_coef": 5.0,
"layers": [20]
}
}Parameters:
type:"steer"(preventative steering) or"ablate"(CAFT implementation)steering_coef: Steering strength (only for"steer"type)layers: Target transformer layers
Supported file formats:
- CSV files: Must contain
promptandanswercolumns - JSONL files: Each line should contain
messagesfield (similar to training dataset format)
CUDA_VISIBLE_DEVICES=0 python -m eval.cal_projection \
--file_path eval_persona_eval/Qwen2.5-7B-Instruct/evil.csv \
--vector_path persona_vectors/Qwen2.5-7B-Instruct/evil_response_avg_diff.pt \
--layer 20 \
--model_name Qwen/Qwen2.5-7B-Instruct \
--projection_type projComplete pipeline:
bash scripts/cal_projection.sh| Script | Purpose | Usage |
|---|---|---|
scripts/generate_vec.sh |
Complete vector generation pipeline | bash scripts/generate_vec.sh 0 |
scripts/eval_steering.sh |
Evaluate steering effectiveness | bash scripts/eval_steering.sh |
scripts/eval_persona.sh |
Basic persona evaluation | bash scripts/eval_persona.sh |
scripts/cal_projection.sh |
Calculate projection | bash scripts/cal_projection.sh |