This is the implementation for MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models. This code implements an evolutionary search pipeline for crystal structure generation (CSG) and crystal structure prediction (CSP) with Large Language Models (LLMs) without fine-tuning.
- Install MatLLMSearch dependencies:
pip install -r requirements.txt-
Configure models and credentials:
- Copy
config/credentials.yamland add your API keys - Configure models in
config/models.yaml(already includes common models)
- Copy
-
Download required data files:
# Create data directory
mkdir -p data
# Download seed structures (optional - enables few-shot generation)
# You may download data/band_gap_processed.csv at https://drive.google.com/file/d/1DqE9wo6dqw3aSLEfBx-_QOdqmtqCqYQ5/view?usp=sharing
# Or data/band_gap_processed_5000.csv at https://drive.google.com/file/d/14e5p3EoKzOHqw7hKy8oDsaGPK6gwhnLV/view?usp=sharing
# Download phase diagram data (required for E_hull distance calculations)
wget -O data/2023-02-07-ppd-mp.pkl.gz https://figshare.com/ndownloader/files/48241624Note:
- All configuration is managed through local
config/directory - Models are configured in
config/models.yaml - API keys are configured in
config/credentials.yaml
Generate novel crystal structures using evolutionary optimization:
python cli.py csg \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--population-size 100 \
--max-iter 10 \
--opt-goal e_hull_distance \
--data-path data/band_gap_processed.csv \
--save-label csg_experimentPredict ground state structures for a target compound:
python cli.py csp \
--compound Ag6O2 \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--population-size 10 \
--max-iter 5 \
--save-label ag6o2_predictionThe analyze command evaluates generated structures and computes comprehensive metrics including:
- Structural validity and composition validity
- Structural diversity and composition diversity
- Structural novelty and composition novelty (vs reference pool)
- Overall novelty (fraction of structures that are both compositionally and structurally novel)
- M3GNet metastability
- Stability rates (CHGNet)
Option 1: From a CSV file
python cli.py analyze \
--input data/llama_test.csv \
--output evaluation_results.json \
--data-path data/band_gap_processed.csvOption 2: From a previous CSG run directory
python cli.py analyze \
--results-path logs/analyze_generation \
--output reevaluated_results.json \
--data-path data/band_gap_processed_5000.csvThis will look for generations.csv in the specified results path.
Generate structures using the CSG evolutionary workflow with API models and then evaluate:
python cli.py analyze --generate \
--model openai/gpt-5-mini \
--data-path data/band_gap_processed_5000.csv \
--max-iter 10 \
--population-size 10 \
--reproduction-size 5 \
--parent-size 2 \
--output gpt5_results.jsonKey parameters for API generation:
--generate: Flag to enable API generation (uses CSG workflow)--model: Model to use (e.g.,openai/gpt-5-mini,openai/gpt-4o-mini)--data-path: Path to seed structures CSV (used as reference pool for novelty)--max-iter: Number of evolutionary iterations--population-size: Initial population size--reproduction-size: Number of offspring per generation--parent-size: Number of parent structures per group
Note: All generated structures are kept and deduplicated after all iterations complete before evaluation.
When you run analyze --generate, the CSG workflow saves intermediate results to:
logs/analyze_generation/generations.csv: All generated structures with propertieslogs/analyze_generation/metrics.csv: Per-iteration metrics
The final evaluation summary is saved to the --output file you specify (e.g., gpt5_results.json).
To re-evaluate a previous run:
python cli.py analyze \
--results-path logs/analyze_generation \
--output new_evaluation.json \
--data-path data/band_gap_processed_5000.csvMatLLMSearch uses a unified model interface with support for local models or API.
Model configuration is handled via config/models.yaml and config/credentials.yaml files.
e_hull_distance: Minimize energy above convex hull (stability)bulk_modulus_relaxed: Maximize bulk modulus (mechanical properties)multi-obj: Multi-objective optimization combining both
poscar: VASP POSCAR formatcif: Crystallographic Information File format
If you use MatLLMSearch in your research, please cite:
@misc{gan2025matllmsearch,
title={MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models},
author={Jingru Gan and Peichen Zhong and Yuanqi Du and Yanqiao Zhu and Chenru Duan and Haorui Wang and Daniel Schwalbe-Koda and Carla P. Gomes and Kristin A. Persson and Wei Wang},
year={2025},
eprint={2502.20933},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2502.20933},
}






