Skip to content

Commit e162e66

Browse files
authored
Update README.md
1 parent fbc6a47 commit e162e66

File tree

1 file changed

+25
-6
lines changed

1 file changed

+25
-6
lines changed

README.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
- [Introduction to arrays using numpy](#introduction-to-arrays-using-numpy)
3030
- [visualize a dataset using seaborn](#visualize-a-dataset-using-seaborn)
3131
- [manipulate dataset with pandas](#manipulate-dataset-with-pandas)
32-
- [iris flowers classification with python](#iris-flowers-classification-with-python)
32+
- [iris flowers classification](#iris-flowers-classification)
3333
- [iris flowers data set](#iris-flowers-data-set)
3434
- [Load the dataset](#load-the-dataset)
3535
- [Examine the dataset](#examine-the-dataset)
@@ -40,7 +40,10 @@
4040
- [Fit the model](#fit-the-model)
4141
- [Evaluate the trained model performance](#evaluate-the-trained-model-performance)
4242
- [Use k-Fold Cross-Validation to better evaluate the trained model performance](#use-k-Fold-cross-validation-to-better-evaluate-the-trained-model-performance)
43-
- [Use the model with unseen data and make predictions](#use-the-model-with-unseen-data-and-make-predictions)
43+
- [Use the model with unseen data and make predictions](#use-the-model-with-unseen-data-and-make-predictions)
44+
-[Remove irrelevant features to reduce overfitting](#remove-irrelevant-features-to-reduce-overfitting)
45+
-[Recursive Feature Elimination](#recursive-feature-elimination)
46+
4447

4548
# What to find in this repository
4649

@@ -184,6 +187,16 @@ Detecting overfitting is useful, but it doesn’t solve the problem.
184187

185188
To prevent overfitting, train your algorithm with more data. It won’t work every time, but training with more data can help algorithms detect the signal and the noise better. Of course, that’s not always the case. If we just add more noisy data, this technique won’t help. That’s why you should always ensure your data is clean and relevant.
186189

190+
To prevent overfitting, improve the data by removing irrelevant features.
191+
Not all features contribute to the prediction. Removing features of low importance can improve accuracy, and reduce overfitting. Training time can also be reduced.
192+
Imagine a dataset with 300 columns and only 250 rows. That is a lot of features for only very few training samples. So, instead of using all features, it’s better to use only the most important ones. This will make the training process faster. It can help to prevent overfitting because the model doesn’t need to use all the features.
193+
So, rank the features and elimate the less importantes ones.
194+
195+
The python library `scikit-learn` provides a `feature selection` module which helps identify the most relevant features of a dataset.
196+
Examples:
197+
- The class `VarianceThreshold` removes the features with low variance. It removes the features with a variance lower than a configurable threshold.
198+
- The class `RFE` (Recursive Feature Elimination) recursively removes features. It selects features by recursively considering smaller and smaller sets of features. It first trains the classifier on the initial set of features. it trains a classifier multiple times using smaller and smaller features set. After each training, the importance of the features is calculated and the least important feature is eliminated from current set of features. That procedure is recursively repeated until the desired number of features to select is eventually reached. RFE is able to find out the combination of features that contribute to the prediction. You just need to import RFE from sklearn.feature_selection and indicate the number of features to select and which classifier model to use.
199+
187200
# Python libraries
188201

189202
## scikit-learn
@@ -698,7 +711,7 @@ male (0, 18] 0.800000 1.000000 1.000000
698711
```
699712

700713

701-
# iris flowers classification with python
714+
# iris flowers classification
702715

703716
The demo is about iris flowers classification.
704717

@@ -1320,11 +1333,17 @@ so the model prediction is:
13201333
- the first two flowers belong to the iris setosa category
13211334
- the last 2 ones belong to the iris virginica category
13221335

1323-
#
1336+
# Remove irrelevant features to reduce overfitting
1337+
1338+
To prevent overfitting, improve the data by removing irrelevant features.
1339+
1340+
## Recursive Feature Elimination
1341+
1342+
The class `RFE` (Recursive Feature Elimination) from the `feature selection` module from the python library scikit-learn recursively removes features. It selects features by recursively considering smaller and smaller sets of features. It first trains the classifier on the initial set of features. It trains a classifier multiple times using smaller and smaller features set. After each training, the importance of the features is calculated and the least important feature is eliminated from current set of features. That procedure is recursively repeated until the desired number of features to select is eventually reached. RFE is able to find out the combination of features that contribute to the prediction. You just need to import RFE from sklearn.feature_selection and indicate which classifier model to use and the number of features to select.
13241343

1325-
Let's use RFE (Recursive Feature Elimination) with Scikit Learn to select the features to keep
1344+
Here's how you can use the class `RFE` in order to find out the combination of important features.
13261345

1327-
We will use this example [recursive_feature_elimination.py](recursive_feature_elimination.py)
1346+
We will use this basic example [recursive_feature_elimination.py](recursive_feature_elimination.py)
13281347

13291348
Load LinearSVC class from Scikit Learn library
13301349
LinearSVC is similar to SVC with parameter kernel='linear'

0 commit comments

Comments
 (0)