Mechanistic interpretability on texts about human greatness, biographies, speeches, and philosophy, using a frozen Meta Llama 3 8B model, sparse autoencoders (SAEs) on residual-stream activations, and activation patching / circuit-style analysis in the spirit of frontier interpretability work (e.g. sparse features, causal interventions).
The scientific bet: train an SAE on internal activations elicited by “greatness-dense” public-domain text, then interpret which latent features fire and how patching changes behavior, turning “greatness” from a vague label into testable hypotheses about representations and circuits.
| Area | Status |
|---|---|
| Chunk 1 — Data ingestion | Implemented: src/data_pipeline.py |
| Chunk 2 — Activation harvester | Planned (src/activation_harvester.py) |
| Chunk 3 — SAE architecture | Planned (src/sae_model.py) |
| Chunk 4 — SAE training | Planned (src/sae_trainer.py) |
| Chunk 5 — Feature interpretation | Planned (src/interpreter.py) |
| Activation patching scaffold | Implemented (baseline sweep): src/patching_engine.py |
This week focuses on expanding the data corpus beyond the initial Carnegie autobiography. I'll curate a "greatness-dense" dataset from primary sources of 15 exceptional historical and contemporary figures I will call "characters." These characters represent diverse domains of human achievement, providing rich text for training sparse autoencoders on greatness representations.
Core Characters:
- Winston Churchill
- Thomas Jefferson
- George Washington
- Leo Tolstoy
- Voltaire
- Charles Darwin
- Mahatma Gandhi
- Ralph Waldo Emerson
- Samuel Pepys
- Eleanor Roosevelt
- Warren Buffett
- Ulysses S. Grant
- Benjamin Franklin
- Oprah Winfrey
- Naval Ravikant
Plan:
- Develop a scraper script (
src/data_scraper.py) to download top 3 primary sources per character (45 total) from public-domain archives like Project Gutenberg and Archive.org. - Handle manual collection for licensed or modern sources (e.g., Buffett letters, Oprah transcripts).
- Store raw text directly to my external hard drive with metadata tracking.
- Validate integrity and estimate ~50GB total corpus.
- Output: cleaned and tokenized chunks ready for activation harvesting.
Work is intended to proceed chunk by chunk with validation before scaling on GPU.
-
Raw data ingestion — Pull public-domain text (Project Gutenberg), clean boilerplate, tokenize with the same tokenizer as Llama 3 8B, emit fixed-length windows (currently 256 tokens) for batching. Output:
data/raw/(e.g. chunked.ptwithinput_ids+ metadata). -
Activation harvester — Load Llama 3 8B via TransformerLens (
HookedTransformer), run chunks through the frozen model, record residual activations at a chosen layer (e.g. mid-depthhook_resid_post). Output:results/activations/(large.ptshards; batched to avoid OOM on A100-class GPUs). -
SAE architecture — Encoder (linear → nonlinearity), wide latent (typically several×
d_model), decoder (linear). Losses: MSE reconstruction + L1 sparsity on latents. -
SAE training — Train on harvested activations with Adam; checkpoint to
results/models/(e.g.sae_weights.pth). -
Interpretation — For text snippets: forward through LLM → layer activation → SAE encode → top-k latent features; correlate spikes with text; optional mining of max-activating examples from stored activations.
Patching (parallel track): src/patching_engine.py already runs a full-layer sweep of residual patching between a matched-length target vs baseline prompt pair (greatness / composure themed). Next step is to attach a metric (e.g. logit difference on contrast tokens) rather than only confirming hooks run.
- Source: Project Gutenberg #17976 — The Autobiography of Andrew Carnegie (plain text via the standard
cache/epubURL). - Tokenizer:
meta-llama/Meta-Llama-3-8B(gated on Hugging Face; requires access + token). - Output file:
data/raw/gutenberg_17976_chunks_256.pt— tensor[num_chunks, 256],dtypelong, plus ametadict.
Run:
export HF_TOKEN=... # Hugging Face token with Llama 3 access
python -m src.data_pipeline├── LICENSE
├── README.md
├── requirements.txt # Python deps (torch, transformers, transformer-lens, …)
├── .gitignore # includes results/, typical Python ignores
├── src/
│ ├── data_pipeline.py # Chunk 1
│ └── patching_engine.py # Activation patching experiment scaffold
├── data/raw/ # Chunked datasets (created by pipeline; may be large)
└── results/ # Activations, checkpoints (gitignored; use for heavy artifacts)
- Python 3.10+ recommended.
- Use a virtual environment. Prefer creating the venv outside cloud-synced folders (e.g. Google Drive): sync clients often make many small file operations slow or flaky, which hurts
pip,git, and evensource .venv/bin/activate.
Install from requirements.txt when present, or equivalently:
torch(≥ 2.4 recommended for recenttransformers5.x)transformerstransformer-lensdatasets,accelerate(for planned scaling / HF workflows)
NumPy: Avoid NumPy 2.x with older torch wheels that were built against NumPy 1.x (ABI warnings / _ARRAY_API errors). Prefer numpy>=1.26,<2 with a pinned modern torch, or follow current PyTorch release notes for NumPy 2 compatibility.
- Request access to
meta-llama/Meta-Llama-3-8Bon Hugging Face and accept the license. export HF_TOKEN=...orhuggingface-cli login.
Without access, Chunk 1 fails at tokenizer download with a 403 / gated repo error; that is expected until access is granted.
- Local: development and small tests (CPU).
- Planned: Google Colab Pro (e.g. A100) for model load, activation harvesting, and SAE training. Large tensors belong under
results/(ignored by git); sync or copy those separately if you use cloud backup drives.
- Tokenizer alignment: Chunk 1 uses the Llama 3 8B tokenizer so chunks match
HookedTokenizer/HookedTransformerlater. - Dependency stack:
transformers5.x expects torch ≥ 2.4; mixing torch 2.2 + NumPy 2.x produced ABI warnings and broken NumPy integration in some installs—address with coordinated upgrades ornumpy<2. - Git + cloud sync: Keeping a live
.gitdirectory inside a Google Drive–backed path led toOperation timed outon reads (e.g..git/HEAD,rsync/mmapon source files). GitHub is the canonical history; a local clone on normal disk forgit commit/push, with optional manual sync of working files to/from Drive, is more reliable than running Git fully inside the synced tree. - Virtualenv on Drive: Same class of I/O issues; venv on local disk avoids multi-second
activateand fragilepip.
- Implement Chunks 2–5 in order; keep layer index and paths configurable.
- Expand the greatness corpus (more Gutenberg sources; screenplays only with clear rights).
- Extend patching with a quantitative logit- or probability-based metric on contrast pairs.
- Optional: JumpReLU / transcoder-style SAE variants, automated feature labeling, steering experiments—documented as stretch goals.
See LICENSE.
Greatness-Analyzed — interpretability-first study of how language models represent ambition, composure, and greatness in text, without claiming those labels are uniquely “one feature” in the model; features are hypotheses checked against reconstruction, sparsity, examples, and causal tests.