Working on a regression problem using various machine learning models to predict " CO2 emissions canada 2023 " based on features such as engine size, fuel consumption, and cylinders. Here's a breakdown of the code you provided:
-
Importing libraries:
- Pandas: for data manipulation and analysis
- NumPy: for numerical operations
- Matplotlib: for data visualization
- Seaborn: for statistical data visualization
- Various modules from scikit-learn: for machine learning algorithms and evaluation metrics
-
Loading the dataset:
- The dataset is loaded from a CSV file called 'FuelConsumption2023.csv' using
pd.read_csv(). - The columns of interest are selected and stored in the 'df' DataFrame.
- The dataset is loaded from a CSV file called 'FuelConsumption2023.csv' using
-
Data exploration:
- Dataset dimensions are printed using
df.shape. - The first few rows of the dataset are displayed using
df.head(). - Data types of each column are printed using
df.dtypes. - The number of missing values in each column is displayed using
df.isnull().sum(). - Summary statistics of numerical columns are printed using
df.describe(). - Unique values in categorical columns (if any) are displayed using a loop.
- Dataset dimensions are printed using
-
Data preprocessing:
- Missing values are dropped from the dataset using
df.dropna(). - One-hot encoding is performed on categorical columns (if any) using
pd.get_dummies(). - Feature scaling or normalization is performed on the encoded dataset using
MinMaxScaler().
- Missing values are dropped from the dataset using
-
Data visualization:
- The distribution of numeric columns is visualized using histograms and KDE plots with the help of
sns.histplot(). - Relationships between variables are visualized using scatter plots with regression lines using
sns.regplot().
- The distribution of numeric columns is visualized using histograms and KDE plots with the help of
-
Model training and evaluation:
- Selected features and target variable are assigned to 'X' and 'y', respectively.
- The data is split into training and testing sets using
train_test_split(). - Several regression models are trained on the training set and evaluated on the testing set:
- Linear Regression (LR)
- Support Vector Regression (SVR)
- Multilayer Perceptron (MLP)
- Decision Tree (Regression)
- Random Forest
- Gradient Boosting (GB)
- K-Nearest Neighbors (KNN)
- Evaluation metrics (MSE and R-squared) are calculated using
mean_squared_error()andr2_score(). - Scatter plots of actual vs predicted values are plotted using
plt.scatter().
-
Model tuning:
- Hyperparameter grids for each model are defined.
- Grid search is performed for each model using
GridSearchCV()to find the best hyperparameters. - Tuned models are evaluated and scatter plots of actual vs predicted values are plotted.