Article-Retrieval-System

Nokia - Machine Learning Summer Trainee - recruitment task

Environment setup

Create new virtual environment:

conda create --name rag python=3.10

Activate environment:

conda activate rag

Update pip version:

python -m pip install --upgrade pip

Install required packages

python -m pip install -e .[dev] --extra-index-url https://download.pytorch.org/whl/cu121

Usage

1. Preprocess the data

Default usage:

python .\scripts\preprocess.py
Optional usage:

python .\scripts\preprocess.py <source_file_path> <target_file_path>

The script preprocess.py loads a .csv file and with a structure:

Title	Text
Example title	Example Text
...	...

Cleans up the text and divides it into chunks of 10 sentences. The result is written into a .csv file with a structure:

ID	Title	sentence_chunk	embedding
...	...	...	...

2. Search queries

Default usage:

python .\scripts\search.py
Optional usage:

python .\scripts\search.py <source_file_path>

After loading the dataset the program will enter a loop where the user can repeatedly write queries to be searched. The loop will end upon writing the exit keyword.

The model will return the top 5 results. The lower the distance the more similar the retrieved chunk is to the query.

Project structure

The data directory contains the "1300 Towards Data Science Medium Articles" dataset, by default the result of the data preprocessing will also be saved here.

The scripts directory contains the scripts for dataset preprocessing as well as query searching.

The src\ai directory contains 2 files

  model.py contains the model used for generating vector embeddings

  utils.py contains various helper functions

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
src/ai		src/ai
README.md		README.md
pyproject.toml		pyproject.toml
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Article-Retrieval-System

Environment setup

Usage

1. Preprocess the data

2. Search queries

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AgataGro/Article-Retrieval-System

Folders and files

Latest commit

History

Repository files navigation

Article-Retrieval-System

Environment setup

Usage

1. Preprocess the data

2. Search queries

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages