A sample of some of the techniques used during my MSc project using small language models to generate formative writing feedback.
- ⚙️ Tools
▶️ Try it for yourself- 📝 Data and preprocesssing
- 💬 Feedback in the dataset
- 🤖 Models
- 🔁 Optimising the prompt
- 🧠 The vector database
- 🦉 Evaluator
- Ollama: an open-source tool for running LLMs locally.
- DSPy: a framework that automates prompt optimisation and orchestrates LLMs.
- Langchain: a framework for developing LLM applications.
- Chroma: an open-source vector database.
- Pandas: a Python library for manipulation and analysis of structured data.
- Streamlit: a little app builder.
The full list of packages is listed in requirements.txt.
Have a look at the results on your local machine.
- Create a venv
python3 -m venv .venv
source .venv/bin/activate- Install requirements
pip install -r requirements.txt- Run the app
cd app
streamlit run app.pyThe dataset created for the dissertation contains materials licensed for educational purposes only. The data used in this repo is adapted from the Persuade 2.0 Corpus To emulate the examples of writing used in my dissertation (GCSE English students), I have selected a random sample of 100 essays by tenth grade students. From this set, I selected 100 paragraphs longer than six words.
The columns 'ID', 'text' and 'score' in data.csv come from the original corpus.
For demonstration purposes, the feedback columns in data.csv are generated using the model developed during the project.
- Llama3.2:3b for speedy inferences.
- Mistral:7b for judging and more detailed inferences.
This notebook uses a DSPy Module (WriteFeedback), the LabeledFewShot optimizer (/teleprompter) and Evaluate, to create an optimised few-shot prompt.
The training dataset is vectorised into documents with the following structure:
feedback_doc = Document(
page_content=data['feedback'],
metadata= {'text': data['text']}
)The documents are used as a knowledge base when the Evaluator is at work.
The evaluator uses a test set of unseen examples and a quality metric. An optimum prompt is chosen through iterating over combinations of examples from the training data and scoring their quality. The results for this optimization:
Average Metric: 62.00 / 20 (310.0%):
100%|██████████| 20/20
[05:46<00:00, 17.33s/it]
2025/10/08 11:47:13
INFO dspy.evaluate.evaluate: Average Metric: 62 / 20 (310.0%)
(I struggle to make sense of this, if anyone can explain it to me, it'd be a great help!) Broadly, it says that out of the 20 iterations, the average score by the judge was 3.1 / 5. Not particularly great, but I think this can be attributed to the artificial data.