By Thy Ly, Yuqi Wang, Hoang Le, Yuqing Wu, Yesenia Ramirez
As the holidays are approaching our group noticed that we typically see higher rates of traffic and car accidents. These high rates of traffic generally cause accidents because of the volume that is on the roads, or the bad weather. Therefore for our final project our group wanted to see when is the busiest time in Manhattan and do those date, times, and streets correlate with the date, time and streets of car accidents recorded. Additionally, our group will also like to see if the weather has any affect on the level of traffic and accident rates during the holidays. Through these findings we will be able to develop suggestions on approaches to navigate through these high traffic areas to avoid car accidents as well as manage through difficult weather situations to minimize accidents as well. Our group looked into NYC Open Data for daily traffic volume counts from 2015 to 2019. In this data set, one can see which of the administrative divisions the data was taken from, the month, year, date, hour and minute that the data was taken in. One can also see the volume of traffic that was in a specific street and in which direction the traffic was heading. In addition, we used NYC Open Data to pull Motor Vehicle Collisions which brought data surrounding car accidents in Manhattan in the same time frame (2015-2019). For the weather data, our group looked into National Centers for Environmental Information’s (NCEI) National Oceanic and Atmospheric Administration (NOAA) for Manhattan daily weather reports during the same time period as the traffic volume counts dataset. In particular, we retreived data from the weather station at LaGuardia Airport.
- Motor Vehicle Collisions - Crashes: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95
- Climate Data Online: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
- Automated Traffic Volume Counts: https://data.cityofnewyork.us/Transportation/Automated-Traffic-Volume-Counts/7ym2-wayt
For the Data Understanding phase, we add to the foundation set up by Business Understand and drive the focus down to identify and analyze the data sets chosen to complete the project. To collect initial data, the requirements were that the data needs to be relatively recent and be separated by days. We confirmed that there is data available for the hypothesis, traffic volume, accidents numbers, and weather data were easy to find but we ran into the problem of there being too many cases for analysis and made the decision to narrow the search to Manhattan from January 2015 to December 2019. With this defined selection criteria, our initial data sets are the NCEI’s and NOAA’s Manhattan daily weather report, Manhattan traffic volume count datasets, and Manhattan accidents count.
The source of the dataset used in our project is a merged dataset of all the ones listed above by date. The number of observations is 1111 days. Below is the list of variables in the dataset. • Total_Traffic - Total Daily Traffic Volume • log_traffic - Natural Logarithm of Total Daily Traffic Volume • Total_Crash - Total Daily Traffic Crashes • TAVG - Average Temperature (Fahrenheit) • PRCP - Precipitation (inches) • SNOW - Snowfall (inches) • SNWD - Snow depth (inches) • AWND - Average daily wind speed (miles per hour) • Holiday - Names the specific Federal holiday on given date • hld - dummy variable to indicate holiday • A dummy variable for each weekday (Sunday, Monday, ..., Saturday) While doing preliminary data cleaning, we noted that not all dates in each year are available. For example, the last two weeks of December are typically not listed. The week of July 4th holiday is not listed. The original dataset omitted these days. For our research purposes, there needs to be a weekday and a holiday variable created so the addition was made to the dataset.
We first prepared some summary statistics for numeric variables. The average daily total traffic is 410,952. The average daily wind speed is 10.82 miles per hour. The average daily precipitation is 0.12 inches. The average daily amount of snow is about 0.06 inches. The average daily snow depth is about 0.12 inches. The average daily temperature is about 55.6 degrees Fahrenheit. The average daily total crash accidents is 96. Finally, the average natural logarithm of total traffic is 11.9.
We examined the average traffic crashes for each weekday in our sample. Crashes tend to occur more on working weekdays than on weekends.
We also created a scatterplot of log of total traffic volume versus total crashes. The scatterplot shows a positive correlation between these two variables.
We also create a scatterplot of the average temperature versus daily total crashes. There is a positive correlation between these two variables. As temperatures increase, people tend to travel more, which may be related to more traffic accidents.
Finally, we also create a correlation matrix to see the correlation between total traffic crashes and other variables. We looked at the row with total crashes. A more intense blue color indicates a strong positive correlation. A more intense negative red color indicates a strong negative correlation. As observed in our previous figures, total crashes has a positive correlation with average temperature, log of total traffic, and the indicator variable for weekday.
- ANOVA TEST
The first ANOVA test is used to see the variation between total crashes and holidays.
According to summary from anova test, our p-value is 6.89e-08,which is less than 0.05. It means the daily number of car accidents varies greatly betwwen holidays and non-holidays.
Diving deeper, we find out the average number of car crashes during federal holiday is 66.52 per day while the total number of car crashes during non-holiday is 96.19 per day. By just comaring their mean value, the amount of car accidents is 23.6% lower on federal holidays than on non-holidays.
The second ANOVA test, called "mod_weekday", was used to explore the effect of different day on traffic accidents. Based on the results we got from the summary, the F value is 45.32, which is significantly large compare with 1, indicating there is a variance among means of different weekdays The p value is 2e-16,which is less than 0.05. These two values provide strong evidence to reject the null hypothesis.
The mean value for number of total crashes on each day is identifical with each other, among all of them, the number of traffic accidents reached its peak on Friday, the highest number of the week, with an average of 110.98, and the lowest number of traffic accidents occurred on Sunday, with an average of 75.07.
After investigating the variance between weekdays, we need to think: whether this trend is only related to different weekdays, or it is related to working days and weekends in general. Therefore, another test was designed based on the whether weekend is the factor that affect the overall results in genral.
From the summary of the weekend model ANOVA test, the F value is 139.8, which is greater than 1, and p value is 2e-16, also less than 0.05, both results shows that the average of the total crashes is significantly different with eath other on weekends and weekdays. The values provides insights into our findings as we can conclude that car crashes is more likely to occur on weekdays than during weekends as the mean value for weekday groups is 101 while the mean value for weekend groups is only 82.49.
-
Model Selection
-
Check Multicollinearity
Then, we use three methods to do the model selection.
- Backwards elimination via p-value
In this part, we removed the variable with largest p-value each time until all variables are statistically significant. Here is the model we get from Backwards elimination via p-value.
- Best subsets
We also use the best subsets approach to improve the model selection.
The bic graph indicates the goodness of fit of different regression models. In this graph, the lower the bic is, the better the model is. Therefore, the bic graph indicates that the subset with 3 variables is the best.
Mallows's cp is used to find the best subset of predictors based on the residual sum of squares. A smaller cp value indicates a more precise model. In the cp graph, the model of 5 predictors has the smallest cp value.
The adjusted R^2 indicates that how much variance of the output variable is explained by the input variables. By adding variables, we can see an increase of adjusted R^2. The adjusted R^2 reaches its peak when the subset is of 6 predictors. In terms of the different subsets that suggested by different graphs. We have to compare these results with other model selection methods to decide the optimal model.
- Automated Stepwise
In the automated stepwise model selection, we used backward stepwise which is in the beginning of the model, all variables are included, and then test each variable as it is removed from the model (which selected is depends on the lowest AIC value of the variables in the current model because lower AIC indicated a best-fit model), then keep those that are considered to be the most statistically significant - repeating the process until the results are optimal.
Here is the summary of the optimal model.
After the selection, the optimal model with the lowest AIC includes AWND, TAVG, hld, Weekday, and Total_Traffic, which is the same as the model that we got from the backward elimination via p-value.
- Final Model
According to results from different approaches, we include AWND, TAVG, hld, Weekday, and Total_Traffic as the predictors in the final model.
- Transformation
Due to the different scales of different variables, we take the log of Total_Crash and Total_Traffic to make the coefficients look normal. After the transformation, we successfully improve the Adjusted R^2 from 12.2% to 14.7%.
- Model Evaluation
The residual plots indicate that the model overall fits very well. The dataset is normally distributed and has no outliers.
-
From both the ANOVA tests and our regression models, we can conclude that number of car crashes reached its peak on Friday among all other days. The reason might be that many people travel between near towns or county for work, and they are going home on Friday. Friday is probably the buisest day of the week, so we can add more traffic police power in busy areas/time to maintain traffic order, and clear the vehicle in the accidents on time to ensure the circulation of the road section in the event of an accident.
-
Based on our findings, car crashes happened more during normal weekdays compare with the total number of crashes on weekends, therefore we can change the timing of traffic lights to accommodate to different volume of traffic according to traffic conditions during weekdays and weekends.
-
Send drivers real time notifications about traffic volume and conditions and ask them to be cautious during busy hours through radio or maps.
https://www.kaggle.com/code/ghghtak4/data-cleaning-and-eda
https://www.kaggle.com/code/ghghtak4/machine-learning-model
https://www.kaggle.com/code/ghghtak4/model-selection
Presentation video: https://youtu.be/kr3jxNvfYSI
Tutorial video: https://youtu.be/exMnIX4yf6I













