Generating formative feedback with DSPy

A sample of some of the techniques used during my MSc project using small language models to generate formative writing feedback.

⚙️ Tools

Ollama: an open-source tool for running LLMs locally.
DSPy: a framework that automates prompt optimisation and orchestrates LLMs.
Langchain: a framework for developing LLM applications.
Chroma: an open-source vector database.
Pandas: a Python library for manipulation and analysis of structured data.
Streamlit: a little app builder.

The full list of packages is listed in requirements.txt.

▶️ Try it for yourself

Have a look at the results on your local machine.

Create a venv

python3 -m venv .venv
source .venv/bin/activate

Install requirements

pip install -r requirements.txt

Run the app

cd app
streamlit run app.py

📝 Data and preprocessing

The dataset created for the dissertation contains materials licensed for educational purposes only. The data used in this repo is adapted from the Persuade 2.0 Corpus To emulate the examples of writing used in my dissertation (GCSE English students), I have selected a random sample of 100 essays by tenth grade students. From this set, I selected 100 paragraphs longer than six words.

The columns 'ID', 'text' and 'score' in data.csv come from the original corpus.

💬 Feedback in the dataset

For demonstration purposes, the feedback columns in data.csv are generated using the model developed during the project.

🤖 Models

Llama3.2:3b for speedy inferences.
Mistral:7b for judging and more detailed inferences.

🔁 Optimising the prompt

This notebook uses a DSPy Module (WriteFeedback), the LabeledFewShot optimizer (/teleprompter) and Evaluate, to create an optimised few-shot prompt.

🧠 The vector database

The training dataset is vectorised into documents with the following structure:

    feedback_doc = Document(
        page_content=data['feedback'],
        metadata= {'text': data['text']}
    )

The documents are used as a knowledge base when the Evaluator is at work.

🦉 Evaluator

The evaluator uses a test set of unseen examples and a quality metric. An optimum prompt is chosen through iterating over combinations of examples from the training data and scoring their quality. The results for this optimization:

Average Metric: 62.00 / 20 (310.0%):
100%|██████████| 20/20
[05:46<00:00, 17.33s/it]
2025/10/08 11:47:13
INFO dspy.evaluate.evaluate: Average Metric: 62 / 20 (310.0%)

(I struggle to make sense of this, if anyone can explain it to me, it'd be a great help!) Broadly, it says that out of the 20 iterations, the average score by the judge was 3.1 / 5. Not particularly great, but I think this can be attributed to the artificial data.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
app		app
data		data
feedback_embeddings_db		feedback_embeddings_db
.gitignore		.gitignore
README.md		README.md
prompt_optimiser.ipynb		prompt_optimiser.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating formative feedback with DSPy

Contents

⚙️ Tools

▶️ Try it for yourself

📝 Data and preprocessing

💬 Feedback in the dataset

🤖 Models

🔁 Optimising the prompt

🧠 The vector database

🦉 Evaluator

About

Uh oh!

Releases

Packages

Languages

bergamotBen/feedback_generation

Folders and files

Latest commit

History

Repository files navigation

Generating formative feedback with DSPy

Contents

⚙️ Tools

▶️ Try it for yourself

📝 Data and preprocessing

💬 Feedback in the dataset

🤖 Models

🔁 Optimising the prompt

🧠 The vector database

🦉 Evaluator

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages