- Author @Diwei Zhu
- Current student of Master's of Management in Analytics, McGill Univeristy
- LinkedIn: www.linkedin.com/in/diwei-zhu-693132186
- Will be graduating in August 2022
- Keywords : Random Forest, CausalML
- Trained random forest model to predict life expectancy of countries.
- geographical region, gini coefficient, hiv rate, polio (Pol3) immunization coverage, thinness among children are the five most influential features of life expectancy.
- Checked whether the geographical location “African continent” has causal relationship to lower life expectancy.
-
Observed an approximate 5 years difference in mean (average) life expectancy between countries in Africa and other countries. Objectively speaking, ATE result indicate a possible causality between an African country and having a lower life expectancy value.
-
CausalML feature influence visualization (SHAP values):
-
-
Keywords : LightGBM, Gurobi optimization, Coupon promotion strategy, CausalML
-
Group project (8 members)
-
Built and compared models for predicting whether or not a customer is likely to churn. Picked lightGBM by F1 scores.
- F1 scores of models
-
Identified best coupon promotion strategy that minimizes loss in monetary value by churns with Gurobi optimization tool
- Expected monetary value of optimized coupon promotion with predictive result by lighGBM: $24032.70
-
Keywords: Classification model, KMeans clustering, Gradient Boosting
-
Predicted success/fail of Kickstarter projects with GBT model (after comparing with Random Forest and KNN in terms of accuracy)
- goal, number of days between launch and create, length of project names are the three most influential features.
-
Assigned projects into three clusters by three selected features (Used Silhouette and Elbow to determine the number of clusters).
- describing characteristics of clusters:
-
Keywords: Gurobi optimization, Route planning, Travelling salesman problem
-
Group project (5 members)
-
Planned the optimal police car patrolling routes that link Toronto police divisions under imaginary budget limitation and patrolling time restriction
- Visualized patrolling routes for two police cars:
-
Keywords: MySQL, Build relational database, Queries
-
Built relational database of La Ronde wonderland visitor and facility records in MySQL and generated queries for business insights
- i.e. preferences of visitors of different age groups, profiles of most loyal/returning visitors, most popular ticket types
- ERD:
-
Appended external weather data from Statistics Canada for in-depth analysis
- Keywords: MySQL, Build relational database, Queries
- Group project (5 members)
- Built relational database of psuedo user informations and matching records of Bumble dating app and facility records in MySQL and generated queries for business insights
- i.e. revenue from premium subscriptions, most popular user characteristcs, whether Bumble's dating events helped users match
- ERD:
3.1. Wine category classification with Keras Functional API
-
Keywords: ntlk, lift ratio, MDS plot, sentiment analysis
-
Group project (6 members)
-
With nltk package, analyzed mentions of top 10 popular car brands and car attributes based on UGC scraped from a forum. We calculated lift ratios of pairs of car brands and attributes, drew MDS plots for clustering, and did sentiment analysis.
- MDS plot of brands showing which brands are frequently compared by users:
-
Identified BMW as the aspirational brand to the users, since it has the highest number of mentions, strong association with purchase attributes and is less compared with other brands.
-
Keywords: ntlk, lift ratio, MDS plot, sentiment analysis, topic modeling
-
Group project (6 members)
-
Analyzed 9 airlines and onboard experience attributes based on UGC scraped from forums. Calculated lift scores, drew MDS plots, did sentiment analysis and topic modeling.
-
Based on sentiment analysis results, provided recommendation to passengers in terms of different onboard experience needs; provided advice to airlines about attributes to improve.
- Keywords: nltk text classification, Naïve Bayes classifier
- Built a text classification model for labeling salary level by job descriptions and generate a list of most informative features.
- i.e. appearing of “off shore”, “architecture”, “unix” indicates a high salary
- i.e. appearing of “school”, “friday”, “enjoy” indicates a low salary
- Keywords: Recursive Feature Elimination, Polynomial regression model, Spline model
- Group project (6 members)
- Built polynomial regression model (after comparing to spline model) that passed heteroskedasticity test to predict IMDb ratings of movies with selected features.
-
Keywords: Tree-based classification model, PCA, Clustering
-
With selected socio-economic factors (GDP per capita and income per capita), and health factors (fertility rate and spending on healthcare), generated random forest classification model and clusters (with PCA) to identify countries that are eligible for international help.
- Identified countries to receive help: Democratic Republic of the Congo, Liberia, Burundi, Niger, Central African Republic
- PCA: