This repository contains numerous benchmark tabular datasets for testing various classification algorithms.
Check also Regression Tabular Datasets and Survival Analysis Tabular Datasets
All datasets are splitted to train and test parts. If whole unsplitted dataset was accessible it is also splitted into 10-fold cross validation datasets. Each dataset contains target column names class for prediction. Files in parquet data format is used which could be easily loaded using pandas Python package.
| Dataset name | Rows | Columns | Missing values | Classes |
|---|---|---|---|---|
| adult | 32561 | 14 (6 numerical and 8 categorical) | yes | 2 |
| anneal | 898 | 38 (6 numerical and 32 categorical) | no | 5 |
| audiology | 226 | 69 (0 numerical and 69 categorical) | yes | 24 |
| auto-mpg | 398 | 7 (5 numerical and 2 categorical) | yes | 3 |
| autos | 205 | 25 (15 numerical and 10 categorical) | yes | 6 |
| balance-scale | 625 | 4 (4 numerical and 0 categorical) | no | 3 |
| breast-cancer | 286 | 9 (0 numerical and 9 categorical) | yes | 2 |
| bupa-liver-disorders | 345 | 6 (6 numerical and 0 categorical) | no | 2 |
| car | 1728 | 6 (0 numerical and 6 categorical) | no | 4 |
| churn | 4250 | 19 (15 numerical and 4 categorical) | no | 2 |
| cleveland | 303 | 13 (6 numerical and 7 categorical) | yes | 5 |
| connect-4 | 67557 | 42 (0 numerical and 42 categorical) | no | 3 |
| covertype | 581011 | 54 (10 numerical and 44 categorical) | no | 7 |
| credit-a | 690 | 15 (6 numerical and 9 categorical) | yes | 2 |
| credit-g | 1000 | 20 (7 numerical and 13 categorical) | no | 2 |
| cylinder-bands | 540 | 35 (17 numerical and 18 categorical) | yes | 2 |
| diabetes | 768 | 8 (8 numerical and 0 categorical) | no | 2 |
| echocardiogram | 131 | 11 (9 numerical and 2 categorical) | yes | 2 |
| ecoli | 336 | 7 (7 numerical and 0 categorical) | no | 8 |
| flag | 194 | 28 (10 numerical and 18 categorical) | no | 4 |
| glass | 214 | 9 (9 numerical and 0 categorical) | no | 6 |
| haberman | 306 | 3 (3 numerical and 0 categorical) | no | 2 |
| hayes-roth | 132 | 4 (0 numerical and 4 categorical) | no | 3 |
| heart-c | 303 | 13 (6 numerical and 7 categorical) | yes | 2 |
| heart-statlog | 270 | 13 (7 numerical and 6 categorical) | no | 2 |
| hepatitis | 155 | 19 (6 numerical and 13 categorical) | yes | 2 |
| horse-colic | 368 | 22 (7 numerical and 15 categorical) | yes | 2 |
| hr-evaluation | 54808 | 11 (5 numerical and 6 categorical) | yes | 2 |
| hungarian-heart-disease | 294 | 13 (6 numerical and 7 categorical) | yes | 2 |
| iris | 150 | 4 (4 numerical and 0 categorical) | no | 3 |
| lymphography | 148 | 18 (3 numerical and 15 categorical) | no | 4 |
| magic | 19020 | 10 (10 numerical and 0 categorical) | no | 2 |
| monk-1 | 324 | 6 (0 numerical and 6 categorical) | no | 2 |
| monk-2 | 369 | 6 (0 numerical and 6 categorical) | no | 2 |
| monk-3 | 554 | 6 (0 numerical and 6 categorical) | no | 2 |
| mushroom | 8124 | 22 (0 numerical and 22 categorical) | yes | 2 |
| nursery | 12960 | 8 (0 numerical and 8 categorical) | no | 5 |
| phoneme | 5404 | 5 (5 numerical and 0 categorical) | no | 2 |
| poker-hand | 1025010 | 10 (10 numerical and 0 categorical) | no | 10 |
| segment | 2310 | 19 (19 numerical and 0 categorical) | no | 7 |
| seismic-bumps | 2584 | 18 (14 numerical and 4 categorical) | no | 2 |
| sonar | 208 | 60 (60 numerical and 0 categorical) | no | 2 |
| soybean | 683 | 35 (0 numerical and 35 categorical) | yes | 19 |
| tic-tac-toe | 958 | 9 (0 numerical and 9 categorical) | no | 2 |
| titanic | 2201 | 3 (0 numerical and 3 categorical) | no | 2 |
| vote | 435 | 16 (0 numerical and 16 categorical) | yes | 2 |
| wilt | 4839 | 5 (5 numerical and 0 categorical) | no | 2 |
| wine | 178 | 13 (13 numerical and 0 categorical) | no | 3 |
| zoo | 101 | 16 (1 numerical and 15 categorical) | no | 7 |
Dataset summary is also available in csv file datasets_summary.csv in root repo directory.
This repository also contains a tiny Python package which allows you to use datasets without the need to clone whole repository.
To install it use the following command:
pip install git+https://github.com/cezary986/classification_tabular_datasetsThe package exports two functions: read_full_dataset and read_dataset_train_test.
The first one reads full dataset and returns a tuple of X and y, where X is data and y are labels.
The second one reads dataset splitted to train and test parts and returns a tuple of X_train, y_train, X_test, y_test.
Example:
import clfdatasets
# print list of all available datasets
print(", ".join(clfdatasets.AVAILABLE_DATASETS))
# reads whole dataset without train/test split
X, y = clfdatasets.read_full_dataset("iris")
# reads dataset splitted into train/test
X_train, y_train, X_test, y_test = clfdatasets.read_dataset_train_test("iris")
# reads given dataset cross-validation fold
X_train, y_train, X_test, y_test = clfdatasets.read_dataset_train_test("iris", cv_fold=3)