Skip to content

scardenol/DecisionTrees

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Authors

Danilo de Jesús Toro Echeverri Salomón Cardeño Luján

Context

The given context is to make predictions on the academic success (éxito académico) in higher education (educación superior) using decision trees. The academic success in this scope is defined as the probability that a student gets a total score superior to his cohort's average, in the Pruebas Saber Pro test.

Problem:

Design a decision trees algorithm from scratch in pure Python and use the Saber 11 data to predict whether a student will have a total score, in the Pruebas Saber Pro, above average or not.

About the data

The datasets are available as two sepparate .csv files:

  • 0_train_balanced_15000.csv is the train data.
  • 0_test_balanced_5000.csv is the test data.

Both datasets are already preprocessed and balanced.

General description

The program takes a .csv file of training data, builds a decision tree based on the CART algorithm and classifies a .csv file of testing data, making predictions for the column with label exito (success).

Code execution

To run the code simply execute the main.py file, which is the run script for the code. A flow of the execution process as well as the main functions called is shown in the image below.

Imgur

The functions in orange represent the functions loaded from the Preprocessing.py module, while the ones in green are loded from the Decision_Tree.py. An example is already programmed in the main script, and as the execution flow shows the output is seen in terminal. Generally speaking, each function's role is as follows:

preprocess_data This function is based on the pandas library to read a csv file. It utilizes the read_csv, convert_dtypes and columns functions of pandas. It handles the preprocessing process of the data by properly handling the filename, separator, na values and data types of the array obtained using pandas.

convert_to_list A basic function that converts the array obtained by using pandas into a list. It utilizes the numpy function array() to handle it as a numpy array an then it uses the function tolist() to finish the process.

build_tree This function constructs the decision tree of a given dataset based on the CART algorithm. It has control over the maximum depth allowed for the tree; by default it has no value so it grows the whole tree, but it can be changed to any positive integer to determine the depth to stop building. To keep track of the depth it uses a level parameter with initial value 0 that isn't supposed to be changed by the user, but it's necessary when making the recursion calls. Since the algorithm is recursive, we'll get a Question node or an instance of a Decision_Node after each recursive call until the base case is reached, and when this happens, we'll get a Leaf.

print_tree This function is used to print the tree previously built. It's implemented recursively and prints the tree by printing it node by node.

classify This function takes a row of the data set and the instance of Decision_Node resulted when building the tree. It decides whether to follow the true-branch or the false-branch, compares the feature and value stored in the node (Tree) to the example (row) we're considering. The base case is reached when we've reached a leaf, in that case we return the attribute predictions of the Leaf instance. This function is usually used within a for cycle that goes through the testing dataset.

print_leaf This function takes the output of the classify function (counts dictionary) and constructs a dictionary with the proportion (probability) instead of counts.

User guide

Follow the example below to run the code on your own custom input parameters.

0. Open a python terminal on the project directory and make the following imports:

import Preprocessing as p
import Decision_Tree as d

1. Call the preprocess_data function for the training and testing data with adequate input parameters as follows:

train, col_names = p.preprocess_data(
    "0_train_balanced_15000.csv", sep=";", keep_default_na=False)

test, _ = p.preprocess_data(
    "0_test_balanced_5000.csv", sep=";", keep_default_na=False)

2. followed by the convert_to_list function for each dataset

ltrain = p.convert_to_list(train.iloc[0:150, :])
ltest = p.convert_to_list(test.iloc[0:50, :])

3. build the tree with build_tree and print it with print_tree

T = d.build_tree(ltrain, max_depth=2)
d.print_tree(T)

4. use the function predict with the following code:

d.predict(ltest, T)

Operating system version: Microsoft Windows 10 Home Single Language

Python version: >= 3.9.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages