Skip to content

cezary986/classification_tabular_datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classification Tabular Datasets

This repository contains numerous benchmark tabular datasets for testing various classification algorithms.

Check also Regression Tabular Datasets and Survival Analysis Tabular Datasets

All datasets are splitted to train and test parts. If whole unsplitted dataset was accessible it is also splitted into 10-fold cross validation datasets. Each dataset contains target column names class for prediction. Files in parquet data format is used which could be easily loaded using pandas Python package.

Dataset name Rows Columns Missing values Classes
adult 32561 14 (6 numerical and 8 categorical) yes 2
anneal 898 38 (6 numerical and 32 categorical) no 5
audiology 226 69 (0 numerical and 69 categorical) yes 24
auto-mpg 398 7 (5 numerical and 2 categorical) yes 3
autos 205 25 (15 numerical and 10 categorical) yes 6
balance-scale 625 4 (4 numerical and 0 categorical) no 3
breast-cancer 286 9 (0 numerical and 9 categorical) yes 2
bupa-liver-disorders 345 6 (6 numerical and 0 categorical) no 2
car 1728 6 (0 numerical and 6 categorical) no 4
churn 4250 19 (15 numerical and 4 categorical) no 2
cleveland 303 13 (6 numerical and 7 categorical) yes 5
connect-4 67557 42 (0 numerical and 42 categorical) no 3
covertype 581011 54 (10 numerical and 44 categorical) no 7
credit-a 690 15 (6 numerical and 9 categorical) yes 2
credit-g 1000 20 (7 numerical and 13 categorical) no 2
cylinder-bands 540 35 (17 numerical and 18 categorical) yes 2
diabetes 768 8 (8 numerical and 0 categorical) no 2
echocardiogram 131 11 (9 numerical and 2 categorical) yes 2
ecoli 336 7 (7 numerical and 0 categorical) no 8
flag 194 28 (10 numerical and 18 categorical) no 4
glass 214 9 (9 numerical and 0 categorical) no 6
haberman 306 3 (3 numerical and 0 categorical) no 2
hayes-roth 132 4 (0 numerical and 4 categorical) no 3
heart-c 303 13 (6 numerical and 7 categorical) yes 2
heart-statlog 270 13 (7 numerical and 6 categorical) no 2
hepatitis 155 19 (6 numerical and 13 categorical) yes 2
horse-colic 368 22 (7 numerical and 15 categorical) yes 2
hr-evaluation 54808 11 (5 numerical and 6 categorical) yes 2
hungarian-heart-disease 294 13 (6 numerical and 7 categorical) yes 2
iris 150 4 (4 numerical and 0 categorical) no 3
lymphography 148 18 (3 numerical and 15 categorical) no 4
magic 19020 10 (10 numerical and 0 categorical) no 2
monk-1 324 6 (0 numerical and 6 categorical) no 2
monk-2 369 6 (0 numerical and 6 categorical) no 2
monk-3 554 6 (0 numerical and 6 categorical) no 2
mushroom 8124 22 (0 numerical and 22 categorical) yes 2
nursery 12960 8 (0 numerical and 8 categorical) no 5
phoneme 5404 5 (5 numerical and 0 categorical) no 2
poker-hand 1025010 10 (10 numerical and 0 categorical) no 10
segment 2310 19 (19 numerical and 0 categorical) no 7
seismic-bumps 2584 18 (14 numerical and 4 categorical) no 2
sonar 208 60 (60 numerical and 0 categorical) no 2
soybean 683 35 (0 numerical and 35 categorical) yes 19
tic-tac-toe 958 9 (0 numerical and 9 categorical) no 2
titanic 2201 3 (0 numerical and 3 categorical) no 2
vote 435 16 (0 numerical and 16 categorical) yes 2
wilt 4839 5 (5 numerical and 0 categorical) no 2
wine 178 13 (13 numerical and 0 categorical) no 3
zoo 101 16 (1 numerical and 15 categorical) no 7

Dataset summary is also available in csv file datasets_summary.csv in root repo directory.

Reading datasets using Python package

This repository also contains a tiny Python package which allows you to use datasets without the need to clone whole repository.

To install it use the following command:

pip install git+https://github.com/cezary986/classification_tabular_datasets

The package exports two functions: read_full_dataset and read_dataset_train_test. The first one reads full dataset and returns a tuple of X and y, where X is data and y are labels. The second one reads dataset splitted to train and test parts and returns a tuple of X_train, y_train, X_test, y_test.

Example:

import clfdatasets

# print list of all available datasets
print(", ".join(clfdatasets.AVAILABLE_DATASETS))

# reads whole dataset without train/test split
X, y = clfdatasets.read_full_dataset("iris")

# reads dataset splitted into train/test
X_train, y_train, X_test, y_test = clfdatasets.read_dataset_train_test("iris")

# reads given dataset cross-validation fold
X_train, y_train, X_test, y_test = clfdatasets.read_dataset_train_test("iris",  cv_fold=3)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages