Classification Tabular Datasets

This repository contains numerous benchmark tabular datasets for testing various classification algorithms.

Check also Regression Tabular Datasets and Survival Analysis Tabular Datasets

All datasets are splitted to train and test parts. If whole unsplitted dataset was accessible it is also splitted into 10-fold cross validation datasets. Each dataset contains target column names class for prediction. Files in parquet data format is used which could be easily loaded using pandas Python package.

Dataset name	Rows	Columns	Missing values	Classes
adult	32561	14 (6 numerical and 8 categorical)	yes	2
anneal	898	38 (6 numerical and 32 categorical)	no	5
audiology	226	69 (0 numerical and 69 categorical)	yes	24
auto-mpg	398	7 (5 numerical and 2 categorical)	yes	3
autos	205	25 (15 numerical and 10 categorical)	yes	6
balance-scale	625	4 (4 numerical and 0 categorical)	no	3
breast-cancer	286	9 (0 numerical and 9 categorical)	yes	2
bupa-liver-disorders	345	6 (6 numerical and 0 categorical)	no	2
car	1728	6 (0 numerical and 6 categorical)	no	4
churn	4250	19 (15 numerical and 4 categorical)	no	2
cleveland	303	13 (6 numerical and 7 categorical)	yes	5
connect-4	67557	42 (0 numerical and 42 categorical)	no	3
covertype	581011	54 (10 numerical and 44 categorical)	no	7
credit-a	690	15 (6 numerical and 9 categorical)	yes	2
credit-g	1000	20 (7 numerical and 13 categorical)	no	2
cylinder-bands	540	35 (17 numerical and 18 categorical)	yes	2
diabetes	768	8 (8 numerical and 0 categorical)	no	2
echocardiogram	131	11 (9 numerical and 2 categorical)	yes	2
ecoli	336	7 (7 numerical and 0 categorical)	no	8
flag	194	28 (10 numerical and 18 categorical)	no	4
glass	214	9 (9 numerical and 0 categorical)	no	6
haberman	306	3 (3 numerical and 0 categorical)	no	2
hayes-roth	132	4 (0 numerical and 4 categorical)	no	3
heart-c	303	13 (6 numerical and 7 categorical)	yes	2
heart-statlog	270	13 (7 numerical and 6 categorical)	no	2
hepatitis	155	19 (6 numerical and 13 categorical)	yes	2
horse-colic	368	22 (7 numerical and 15 categorical)	yes	2
hr-evaluation	54808	11 (5 numerical and 6 categorical)	yes	2
hungarian-heart-disease	294	13 (6 numerical and 7 categorical)	yes	2
iris	150	4 (4 numerical and 0 categorical)	no	3
lymphography	148	18 (3 numerical and 15 categorical)	no	4
magic	19020	10 (10 numerical and 0 categorical)	no	2
monk-1	324	6 (0 numerical and 6 categorical)	no	2
monk-2	369	6 (0 numerical and 6 categorical)	no	2
monk-3	554	6 (0 numerical and 6 categorical)	no	2
mushroom	8124	22 (0 numerical and 22 categorical)	yes	2
nursery	12960	8 (0 numerical and 8 categorical)	no	5
phoneme	5404	5 (5 numerical and 0 categorical)	no	2
poker-hand	1025010	10 (10 numerical and 0 categorical)	no	10
segment	2310	19 (19 numerical and 0 categorical)	no	7
seismic-bumps	2584	18 (14 numerical and 4 categorical)	no	2
sonar	208	60 (60 numerical and 0 categorical)	no	2
soybean	683	35 (0 numerical and 35 categorical)	yes	19
tic-tac-toe	958	9 (0 numerical and 9 categorical)	no	2
titanic	2201	3 (0 numerical and 3 categorical)	no	2
vote	435	16 (0 numerical and 16 categorical)	yes	2
wilt	4839	5 (5 numerical and 0 categorical)	no	2
wine	178	13 (13 numerical and 0 categorical)	no	3
zoo	101	16 (1 numerical and 15 categorical)	no	7

Dataset summary is also available in csv file datasets_summary.csv in root repo directory.

Reading datasets using Python package

This repository also contains a tiny Python package which allows you to use datasets without the need to clone whole repository.

To install it use the following command:

pip install git+https://github.com/cezary986/classification_tabular_datasets

The package exports two functions: read_full_dataset and read_dataset_train_test. The first one reads full dataset and returns a tuple of X and y, where X is data and y are labels. The second one reads dataset splitted to train and test parts and returns a tuple of X_train, y_train, X_test, y_test.

Example:

import clfdatasets

# print list of all available datasets
print(", ".join(clfdatasets.AVAILABLE_DATASETS))

# reads whole dataset without train/test split
X, y = clfdatasets.read_full_dataset("iris")

# reads dataset splitted into train/test
X_train, y_train, X_test, y_test = clfdatasets.read_dataset_train_test("iris")

# reads given dataset cross-validation fold
X_train, y_train, X_test, y_test = clfdatasets.read_dataset_train_test("iris",  cv_fold=3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification Tabular Datasets

Reading datasets using Python package

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
adult		adult
anneal		anneal
audiology		audiology
auto-mpg		auto-mpg
autos		autos
balance-scale		balance-scale
breast-cancer		breast-cancer
bupa-liver-disorders		bupa-liver-disorders
car		car
churn		churn
cleveland		cleveland
connect-4		connect-4
covertype		covertype
credit-a		credit-a
credit-g		credit-g
cylinder-bands		cylinder-bands
diabetes		diabetes
echocardiogram		echocardiogram
ecoli		ecoli
flag		flag
glass		glass
haberman		haberman
hayes-roth		hayes-roth
heart-c		heart-c
heart-statlog		heart-statlog
hepatitis		hepatitis
horse-colic		horse-colic
hr-evaluation		hr-evaluation
hungarian-heart-disease		hungarian-heart-disease
iris		iris
lymphography		lymphography
magic		magic
monk-1		monk-1
monk-2		monk-2
monk-3		monk-3
mushroom		mushroom
nursery		nursery
phoneme		phoneme
poker-hand		poker-hand
segment		segment
seismic-bumps		seismic-bumps
sonar		sonar
soybean		soybean
tic-tac-toe		tic-tac-toe
titanic		titanic
vote		vote
wilt		wilt
wine		wine
zoo		zoo
.gitignore		.gitignore
README.md		README.md
clfdatasets.py		clfdatasets.py
datasets_summary.csv		datasets_summary.csv
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Classification Tabular Datasets

Reading datasets using Python package

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages