SyGra - Graph-oriented Synthetic data generation Pipeline
-
Updated
Jan 9, 2026 - Python
SyGra - Graph-oriented Synthetic data generation Pipeline
👨🏫This project was developed under the guidance of Mr. Lokesh Sir as part of the AI & ML Training Program. It explores LLM integration using Google Gemini APIs with a custom UI built on Streamlit.
🔧 Modular pipeline for generating high-quality, domain-specific datasets for LLM fine-tuning — from PDFs and web scraping to synthetic Q&A generation, quality filtering, and training-ready formatting.
Sample edition of The Stack Enriched: annotated, secure, and optimized code dataset, this is a sample version
Add a description, image, and links to the llm-training-data topic page so that developers can more easily learn about it.
To associate your repository with the llm-training-data topic, visit your repo's landing page and select "manage topics."