Skip to content

Azazel5/Greatness-Analyzed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Greatness-Analyzed

Mechanistic interpretability on texts about human greatness, biographies, speeches, and philosophy, using a frozen Meta Llama 3 8B model, sparse autoencoders (SAEs) on residual-stream activations, and activation patching / circuit-style analysis in the spirit of frontier interpretability work (e.g. sparse features, causal interventions).

The scientific bet: train an SAE on internal activations elicited by “greatness-dense” public-domain text, then interpret which latent features fire and how patching changes behavior, turning “greatness” from a vague label into testable hypotheses about representations and circuits.


Status (what exists vs planned)

Area Status
Chunk 1 — Data ingestion Implemented: src/data_pipeline.py
Chunk 2 — Activation harvester Planned (src/activation_harvester.py)
Chunk 3 — SAE architecture Planned (src/sae_model.py)
Chunk 4 — SAE training Planned (src/sae_trainer.py)
Chunk 5 — Feature interpretation Planned (src/interpreter.py)
Activation patching scaffold Implemented (baseline sweep): src/patching_engine.py

Upcoming: Data Curation (Apr 8-12)

This week focuses on expanding the data corpus beyond the initial Carnegie autobiography. I'll curate a "greatness-dense" dataset from primary sources of 15 exceptional historical and contemporary figures I will call "characters." These characters represent diverse domains of human achievement, providing rich text for training sparse autoencoders on greatness representations.

Core Characters:

  • Winston Churchill
  • Thomas Jefferson
  • George Washington
  • Leo Tolstoy
  • Voltaire
  • Charles Darwin
  • Mahatma Gandhi
  • Ralph Waldo Emerson
  • Samuel Pepys
  • Eleanor Roosevelt
  • Warren Buffett
  • Ulysses S. Grant
  • Benjamin Franklin
  • Oprah Winfrey
  • Naval Ravikant

Plan:

  • Develop a scraper script (src/data_scraper.py) to download top 3 primary sources per character (45 total) from public-domain archives like Project Gutenberg and Archive.org.
  • Handle manual collection for licensed or modern sources (e.g., Buffett letters, Oprah transcripts).
  • Store raw text directly to my external hard drive with metadata tracking.
  • Validate integrity and estimate ~50GB total corpus.
  • Output: cleaned and tokenized chunks ready for activation harvesting.

Planned pipeline (five chunks)

Work is intended to proceed chunk by chunk with validation before scaling on GPU.

  1. Raw data ingestion — Pull public-domain text (Project Gutenberg), clean boilerplate, tokenize with the same tokenizer as Llama 3 8B, emit fixed-length windows (currently 256 tokens) for batching. Output: data/raw/ (e.g. chunked .pt with input_ids + metadata).

  2. Activation harvester — Load Llama 3 8B via TransformerLens (HookedTransformer), run chunks through the frozen model, record residual activations at a chosen layer (e.g. mid-depth hook_resid_post). Output: results/activations/ (large .pt shards; batched to avoid OOM on A100-class GPUs).

  3. SAE architecture — Encoder (linear → nonlinearity), wide latent (typically several× d_model), decoder (linear). Losses: MSE reconstruction + L1 sparsity on latents.

  4. SAE training — Train on harvested activations with Adam; checkpoint to results/models/ (e.g. sae_weights.pth).

  5. Interpretation — For text snippets: forward through LLM → layer activation → SAE encode → top-k latent features; correlate spikes with text; optional mining of max-activating examples from stored activations.

Patching (parallel track): src/patching_engine.py already runs a full-layer sweep of residual patching between a matched-length target vs baseline prompt pair (greatness / composure themed). Next step is to attach a metric (e.g. logit difference on contrast tokens) rather than only confirming hooks run.


Chunk 1 details (implemented)

  • Source: Project Gutenberg #17976The Autobiography of Andrew Carnegie (plain text via the standard cache/epub URL).
  • Tokenizer: meta-llama/Meta-Llama-3-8B (gated on Hugging Face; requires access + token).
  • Output file: data/raw/gutenberg_17976_chunks_256.pt — tensor [num_chunks, 256], dtype long, plus a meta dict.

Run:

export HF_TOKEN=...   # Hugging Face token with Llama 3 access
python -m src.data_pipeline

Repository layout

├── LICENSE
├── README.md
├── requirements.txt    # Python deps (torch, transformers, transformer-lens, …)
├── .gitignore          # includes results/, typical Python ignores
├── src/
│   ├── data_pipeline.py    # Chunk 1
│   └── patching_engine.py  # Activation patching experiment scaffold
├── data/raw/           # Chunked datasets (created by pipeline; may be large)
└── results/            # Activations, checkpoints (gitignored; use for heavy artifacts)

Setup

Python environment

  • Python 3.10+ recommended.
  • Use a virtual environment. Prefer creating the venv outside cloud-synced folders (e.g. Google Drive): sync clients often make many small file operations slow or flaky, which hurts pip, git, and even source .venv/bin/activate.

Dependencies

Install from requirements.txt when present, or equivalently:

  • torch (≥ 2.4 recommended for recent transformers 5.x)
  • transformers
  • transformer-lens
  • datasets, accelerate (for planned scaling / HF workflows)

NumPy: Avoid NumPy 2.x with older torch wheels that were built against NumPy 1.x (ABI warnings / _ARRAY_API errors). Prefer numpy>=1.26,<2 with a pinned modern torch, or follow current PyTorch release notes for NumPy 2 compatibility.

Hugging Face / Llama 3

  1. Request access to meta-llama/Meta-Llama-3-8B on Hugging Face and accept the license.
  2. export HF_TOKEN=... or huggingface-cli login.

Without access, Chunk 1 fails at tokenizer download with a 403 / gated repo error; that is expected until access is granted.

Compute

  • Local: development and small tests (CPU).
  • Planned: Google Colab Pro (e.g. A100) for model load, activation harvesting, and SAE training. Large tensors belong under results/ (ignored by git); sync or copy those separately if you use cloud backup drives.

What has been attempted / learned (project hygiene)

  • Tokenizer alignment: Chunk 1 uses the Llama 3 8B tokenizer so chunks match HookedTokenizer / HookedTransformer later.
  • Dependency stack: transformers 5.x expects torch ≥ 2.4; mixing torch 2.2 + NumPy 2.x produced ABI warnings and broken NumPy integration in some installs—address with coordinated upgrades or numpy<2.
  • Git + cloud sync: Keeping a live .git directory inside a Google Drive–backed path led to Operation timed out on reads (e.g. .git/HEAD, rsync/mmap on source files). GitHub is the canonical history; a local clone on normal disk for git commit / push, with optional manual sync of working files to/from Drive, is more reliable than running Git fully inside the synced tree.
  • Virtualenv on Drive: Same class of I/O issues; venv on local disk avoids multi-second activate and fragile pip.

Roadmap (short)

  • Implement Chunks 2–5 in order; keep layer index and paths configurable.
  • Expand the greatness corpus (more Gutenberg sources; screenplays only with clear rights).
  • Extend patching with a quantitative logit- or probability-based metric on contrast pairs.
  • Optional: JumpReLU / transcoder-style SAE variants, automated feature labeling, steering experiments—documented as stretch goals.

License

See LICENSE.


Name

Greatness-Analyzed — interpretability-first study of how language models represent ambition, composure, and greatness in text, without claiming those labels are uniquely “one feature” in the model; features are hypotheses checked against reconstruction, sparsity, examples, and causal tests.

About

Can mechanistic interpretability shed light on the questions, "what is greatness?", "what is excellence?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages