Recommendation systems are one of the lucrative applications of machine learning. Big companies have been able to increase their sales and thereby generate more revenue by employing ML powered recommendation engines in their applications. Some companies employing different forms of recommendation systems are Amazon, Netflix, YouTube, Facebook, etc.
We have developed recommendation system for books. User rates few books and gives data in the form of 'json' file and the system prints the title of recommended books.
We have just used collaborative filtering machine learning algorithm here. The algorithm is written from scratch so might not have efficient training time performance.
Everything from data loading to training is done using Python and its libraries. The numeric computation is performed by numeric library of Python called NumPy. Other libraries used for utilities functions are Pandas and Sklearn.
Python 3 is should be used here as the interpreter.
- NumPy
- Sklearn
- Pandas
Assuming Python 3 and pip3 is used,
- sudo pip3 install numpy
- sudo pip3 install pandas
- sudo pip3 install scikit-learn
- Provide your ratings to books in the json file. The keys represent
idfor the books as given in books.csv file and the values represent your ratings to the books. - Run the file
recommender.py.
python3 recommender.py
You would get the titles of the recommended books.
This section helps developers to build their own custom application on top of this recommendation system. Developers are encouraged to build other projects on top of this, improve different aspects of the existing repository or come up with cool features/ideas on this repository. As the project is MIT licensed, the terms and conditions as given in LICENCE file should be followed. In case of any other queries, the owner of this repository shall be contacted.
- data: This directory contains two sub directories,
rawandprocessed.Processeddirectory was created to put clean data but as the data itself was clean with no missing values, we decided to use the csv files inrawdirectory. - data_exploration: Contains
exploration.ipynbnotebook to do data exploration. Developers are encouraged to explore the files underdata/rawdirectory to have general understanding of data first. - features: Contains matrix of features (both users and books). The features are serialized to pickle format so developers should unpickle it to use the features in any way.
- user: Contains
ratings.jsonwhich contains ratings given by user (to whom recommendations are to be shown). See usage for details.
- collaborative_filtering.py: Contains collaborative filtering algorithm with evaluator function evaluating performance on test set (2% of the whole dataset). Root Mean Squared Error is taken as evaluation metric.
- feature_generator.py: Generates features matrices for books and users in pickle format. The pickle files contain deserialized NumPy arrays.
- recommender.py*: Recommends the user
titleof books by reading theuser/ratings.jsonfile.
- The number of latent features (K) is hyperparameter here which can be tuned for optimum performance.
- Updating of parameters can be done vectorically so as to improve training speed.
- Advanced optimization methods can be used to imply later epochs of trainings as finetuning epochs.
- We can use mini-batch gradient descent which reduces training time.
- SVD matrix factorization can be tested to see how it behaves by putting unrated elements of sparse matrix as 0.
- Learning rate itself can be fine tuned and regularization can be used if the model overfits the training data.
https://www.kaggle.com/zygmunt/goodbooks-10k
About 2k rows in ratings.csv file are duplicated. We can remove the duplicated rows as they don't provide any new
information.
Data released under: CC BY-SA 4.0
MIT License