Data Science Live

Wharton Data Science Academy


10am - 3pm

Jul 30 (Friday) 2021


Jeffrey Cai (Head TF)
Norman Chen, Shannon Duncan,
David Fan, Shannon Laub, Tommy Lin

Wharton Data Science Academy

"Run by the Wharton Analytic Institute, the Wharton Academy of Data Science will bring state-of-the-art machine learning and data science tools to high school students. We aim to stimulate students’ curiosity in the fast-moving field of machine learning through this rigorous yet approachable program. Building up statistical foundations together with empirical and critical thinking skills will be the main theme throughout. By the end of the program, students will not only be equipped with essential data science techniques such as data visualization and data wrangling but will also be exposed to modern machine learning methodologies which are all building blocks for today’s AI field. Along the way, students will develop a working proficiency with the R language, which is among the most widely used by professional data scientists in both academia and industry.

We believe data science is not just a collection of techniques; it is foremost motivated by real world problems. The data scientist of the 21st century must be able to identify relevant problems, provide sensible analyses, and ultimately communicate their findings in meaningful ways. All learning modules are based on real life case studies.



Session 1

Air is free, but should not be taken for granted. An estimated 91% of people do not have access to clean air and are not meeting the World Health Organization’s (WHO) air quality guidelines. This is a warning of what is to come if humans continue to rely on hydrocarbon fuels and unsustainable production methods. Poor air quality is leaving a gash on society, causing 8 million people to prematurely succumb to heart disease, pneumonia, diabetes, and cancer each year. Governments and regulatory bodies must find what causes significant air pollution across society in order to take appropriate action to combat this issue. Without a more insightful understanding of the factors affecting air pollution, these organizations cannot take the most effective action to improve air quality. The goal of our study is to use various statistical analyses and computational techniques to investigate which factors have the most profound impact on air pollution. We will use Exploratory Data Analysis, LASSO, Multiple Linear Regression, and Backwards Selection to find a model for air pollution. We hope that our findings can aid leaders in making a renewed battle plan against pollution. Our inspiration for this study came from our passion for the issue of environmental destruction and degradation, as air pollution has a large role to play in this issue.

The goal of this project is to utilize measurements from several socioeconomic, political, and health factors to illustrate the correlation between each one and the happiness score while also studying the distribution of these numbers between nations. These factors include GDP per capita, life expectancy, COVID-19 cases, the Global Peace Index, etc. Using these variables, we created a variety of graphs - box plots and histograms displayed distributions of certain pieces of data among the countries, scatter plots demonstrated the correlation between happiness and a few of the obtained statistics, and a heat map illustrated the happiness ratings across the world. We also used the LASSO, backwards selection, and random forest models to find the factors with the most significant relationship to happiness. Based on the most accurate ones (LASSO and backwards selection with a mean squared error of 1.80038), we found that GNI per capita (Gross National Income averaged out to find the salary of each citizen), life expectancy, freedom, and social support were the most impactful. As a result of these strong correlations, we can predict the happiness and thus the need for expanded mental health services in each country using any of these statistics.

Group 7   Nathan, Christiana, Ella, Abhay, Ramya, Felix

Recent yet continuous pollution has contributed to a multitude of toxic effects to bodies of water around the globe, with acidity being one of the most prevalent factors to be affected. In the past, research has shown that highly acidic water can cause plenty of fatal health issues to humans, especially children. Furthermore, this very water is also capable of corroding pipes, due to its low pH, causing them to dissolve over time, which in turn causes leaks and more metal in our drinking supply. In an attempt to associate such detrimental water levels with the persistent looming of destructive climate change, we explore their relationship through this research. With the obvious main variable being pH to accurately measure the acidic levels in the water, we also observe temperature, gage height, conductance, discharge, dissolved oxygen, and finally time. Using the methods of backwards selection and ANOVA to narrow down and deduce the most significant variables for observation, and linear regression to predict potential pH levels in the future, we find that the majority of variables have a positive linear correlation with pH values and predict their continuous increase. To evaluate the accuracy of our model to describe the actual pH level of the Delaware River as a whole, we used river data from another site (Montague New Jersey) and created a confidence interval that corroborated our original model. These results can produce insight for subsequent measures to take for preserving our environment, perhaps assisting the protection agencies and the common man in our strive to preserve the Earth.

Group 5   Anoushka, Christine, Cynthia, Michael

The goal of our project is to determine whether there is a correlation between a county’s suicide rate and various geographic, environmental, political, and socioeconomic indicators. Through our analysis, we hope to promote greater awareness for mental health and especially suicide. As students, we have noticed the consistently rising suicide rates and prevalence of mental health issues among high schoolers—both of which have been accentuated by the COVID-19 pandemic. In fact, the city of Palo Alto—which neighbors Stanford University and is home to numerous nationally ranking high schools—sees a teen suicide rate that is more than four times the national average (Chawla). Just this past May, a student in the Mountain View–Los Altos Union High School District took their own life, marking the third consecutive year a student has done so. Ultimately, we want to provide better insight to local and state governments along with non-profit organizations so that they can properly allocate mental health resources. We collected data from several credible sites such as the Centers for Disease Control and Protection (CDC), the United States Department of Agriculture (USDA), and the United States Environmental Protection Agency (EPA). After the data cleaning process, we were left with 459 of the original 991 counties from the CDC dataset because many counties whose death count was less than 20 were marked as “unreliable.” In our analysis, we used various data analysis methods such as Trees, Random Forest, and LASSO in order to determine which variables are significant in determining suicide rates. Finally, we evaluated and compared the various models we created as a means to decide which model performed the best.

Works Cited:
Chawla, Ishika. “CDC Releases Preliminary Findings on Palo Alto Suicide Clusters.” The Stanford Daily, 21 July 2016, www.stanforddaily.com/2016/07/21/cdc-releases-preliminary-findings-on-palo-alto-suicide-clusters/.

Group 11   Jingjie, Xiaohu, Rishab, Yena, Sohum

“Money can’t buy happiness” is an adage we all have heard at some point in our lifetime. As the U.S. has grown more and more capitalistic, its suicide rates have steadily increased as well. We wanted to see how the growing gap in salary within the population correlates to the growing suicide rate, a significant indicator of happiness. We gathered the average salary for each state from the U.S. Bureau of Economic Analyses, the average suicide rate from the American Health Ranking, and other socioeconomic conditions from various organizations with data ranging from 2010-2019. We used R (a statistical programming language) to analyze the effects of salary on suicide rate. In particular, we utilized backwards selection, and then ran multiple modelling techniques like multiple linear regression, decision tree, and random forest on our data. By plotting salary against suicide rate, we discovered several outliers, and researched potential causes for their irregularity. After that, we focus on our multiple linear regression model, where there exists a definite negative correlation between salary and suicide rate significant at the 0.001 level while considering other variables. Additionally, after comparing the multiple linear regression model, decision tree model, and random forest model, we discover that the mean squared error for random forest is the lowest, thus the best prediction model. Its correlation rate with the actual data is very high at 98%! After looking at our results from the models and looking closer at the outlying states, we conclude that there are many ways for states to lower suicide rate. The most effective solutions include raising the minimum wage, and implementing social welfare programs like funding for public health resources, providing unemployment support, and pushing for mental health awareness.

Group 1   Anushka, Gracia, Joy, Miran, Sidarth, Yulan

Our project focuses on the issue of gun violence that has been growing exponentially over the past couple of years across the United States. The purpose of this project is to identify demographic factors that influence the ownership of guns. We chose to specifically study New York City because of the high concentration of gun-related violence in that region. We collected data from the NYPD about handgun permits as well as data from the CCC New York on demographics. After cleaning and merging the data sets, we used LASSO and backwards selection to create a linear regression model for the number of handgun permits by community district. We found that the demographic factors that most significantly affect gun ownership in New York City boroughs are the number of foreigners, the percentage of homeowners, the median monthly rent, the number of people in poverty, the number of Asian, Black, Latino, and Other people, the number of people between 18 and 24 years old, the number of workers in education, administration, and retail, and the number of children in foster care.

Group 12   Aditi, Avery, Drew, Sophie, Tony

Today, the media is more prevalent than ever. With the advent of the internet and social media, more and more people are becoming involved in the media, a fascinating trend that has yet to be fully explored. In this study, we sought to gain some insights into the media and people's participation in it by running a sentiment and word frequency analysis on a dataset of comment on New York Times online articles. We used the packages 'tm' and 'tidytext' for word frequency analysis and sentiment analysis, respectfully, and we ran LASSO regressions to come up with final linear models for both cases. In the end, our models were far less than we had hoped they would be in terms of value, but they still shed some light on this fascinating topic.

Session 2

Alzheimer’s is a type of dementia that progressively affects memory, thinking, and behavior until patients can no longer carry out even the most basic tasks. In this project, we aim to identify the main factors in an Alzheimer’s diagnosis and predict the probability of whether one is at risk of or has Alzheimer’s given certain factors. We began our Exploratory Analysis Data (EDA) by comparing the demographics of the individuals, namely age, sex, education, and socio-economic status. We found that the ages were approximately distributed between the ages of 60 and 100, and that education was also approximately distributed. However, the strong left skewness of socioeconomic status indicates that the patients were fairly well off, which makes sense because MRI scans are expensive. Next, we compared the distributions of whole brain volume (nWBW) and estimated total intracranial volume (eTIV) between patients with and patients without Alzheimer's. We found that the distributions were relatively similar, indicating that neither nWBW nor eTIV are significant indicators of Alzheimer's. In addition, when we plotted eTIV against Atlas Scaling Factor (ASF), we found that they were inversely proportional, indicating that ASF is also not a significant indicator. Using the same method of comparison, we found that the Mini Mental State Examination (MMSE) is a very good indicator of whether a patient has dementia or not, as there are significantly less high scorers among the patients with dementia. Finally, we compared our two models, the linear model and the logistic model, and found that the logistic model had a smaller testing error than the linear model, indicating that the logistic model is a better predictor of Alzheimer's disease. Between the two models, sex, education, SES, MMSE, eTIV, nWBV, namely SES and MMSE, were the most significant variables.

Group 18   Aidan, Annie, Irene, Ishaani, Warren

We set out to find a significant association between a country’s monthly covid death rate and general quality of life and its pandemic unemployment rates. We started with three data sets: cumulative numbers of COVID by day per country in 2020, unemployment rate per country by month, and quality of life by country, which includes factors such as health index. To find the assocation between quality of life and pre-pandemic unemployment in January 2020, we used backwards selection, which resulted in one significant variable: purchasing power. We then used multiple regression to investigate the association between a country’s purchasing power and its unemployment rate in January 2020. To find the association between COVID-19 deaths, quality of life, and unemployment rate, we used LASSO, which removed all of the quality of life variables. We then used multiple regression, controlling each country, to examine the relationship between unemployment rate and COVID-19 deaths. In the end, our null hypothesis was rejected since we found that overall, higher COVID-19 deaths were associated with higher levels of unemployment. We also found a significant association between each country’s purchasing power index and their pre-COVID unemployment rates. However, there was no apparent relationship between during-pandemic unemployment and the quality of life variables.

Issues surrounding gender disparities plague STEM fields such as engineering, influencing the gender wage gap in this male-dominated field. To understand the factors behind this wage gap, we investigate a Kaggle dataset of Indian engineers and observe the disproportionate representation of female engineers. We examine potentially relevant variables both graphically and numerically, ultimately finding that the gender wage gap does indeed exist through predictions based on our linear models. Through backward stepwise variable selection, we identify which variables influence engineers’ pay; then, we filter the data into male and female subsets to compare the resulting significant variables based on date of birth, high school exam scores, college information (including college tier, location, graduation year, GPA), and AMCAT exam results (which measure academic ability in English, logical and quantitative ability, as well as various personality traits). The results of these prediction methods shed light on the nuances of the gender wage gap and can provide valuable guidance on combating gender discrimination in the future, particularly for the field of engineering but for other STEM and/or male-dominated fields as well.

Neural networks, often referred to as deep learning, have been a field of research and development that has emerged in recent years as one of the next major steps towards creating the best artificial intelligence (AI) systems. In short, neural networks simulate and imitate the behavior of human brains which allows programs and models to be made to recognize patterns and ultimately solve tasks and problems. They are made of layers of interconnected neurons (often called nodes) and these models often have many layers which is called deep learning. Two common types of neural networks are feed forward neural networks and convolutional neural networks. In this research paper, we explore and analyze the effects of these two types of neural networks and compare the efficiency and accuracy of the models, and offer insight as to why we are seeing what we see in the output data and accuracy.

There has been a recent increase in sustainability and awareness about global warming. Especially during the COVID-19 pandemic, we were curious about the impact of deaths by certain diseases on global temperatures, and how air pollution plays a role. Ultimately, our goals of the study were to explore the relationships between these factors and predict how these connections might affect future deaths by disease. In conducting our analysis, we used a collection of regression models, heatmaps, scatterplots, barplots, and Anova, among others. Our final model showed how average temperature, the percentage of males, the smoking rate, and carbon emissions affect the rate of cancer cases per 100,000 people. Because of factors not incorporated into our final model, we did not come to a definitive conclusion, but we did learn about individual variables. For instance, how certain gases like carbon dioxide have a downward trend over time, while other gases do not.

Session 3

Group 2   Adithya, Raymond, Sahej, Andrew, Kameron, Alexander

Throughout the 20th and 21st centuries, many pieces of legislation have been introduced to combat climate change, however, there is no easy metric to show just how effective this legislation has been. This is because environmental data generally changes very slowly, however, air quality data has been shown to change relatively quickly. As such, this project aims to identify general trends in air quality, and cross-reference these changes with contemporary legislation in order to identify which pieces of legislation were the most effective inimproving air quality.

We all know that COVID-19 has drastically changed our lives, and there’s no doubt that it has affected the economy as well on a global scale. How much is that you may wonder, and what’ll it look like in the future? Well, we’re here to answer that. The goal of our study is to examine the effect that COVID-19 has had on the global economy, and to be able to predict economic data for each of the countries in our data set through COVID policy. The three economic variables we chose to examine the economy from were consumer confidence index, inflation rate index (CPI), and unemployment rate. We used our data to create a double prediction layer model. We then used 4 different types of models to determine the best model for our data, we also separated data into training and testing data for the model. We then used the model to predict future economic data. Overall there is a positive correlation for consumer confidence, negative correlation for unemployment, and mostly positive correlation for inflation, over the period of July 2021-December 2021.

Group 15   Jonathan, Ankita, Aahaan, Tony, Garrett

The varying police forces and laws surrounding the police tend to vary per region. Public outcry revolves around racism against minorities, stop and frisk practices, and a lack of mental health awareness in our law enforcement system. The majority of our data is from 2015-2016 though it ranges from 2000-2020. The goal of our project is to quantify the number of police shootings resulting in civilian deaths by state and compare it to demographic factors based on census data. However, our study relies on the major assumption that police force is the only variable that changes other than the measurable population differences.

Group 16   Abhinav, Ariston, Aurora, Rosemary, Shaurya

Since 2017, Bitcoin has become one of the most popular cryptocurrencies in the world. It has significantly increased in value along with its influence over popular media, where the increased demand further increased Bitcoin’s value. As Bitcoin becomes a staple in our daily lives, we explore how Bitcoin’s influence on the stock market changes, specifically looking at the correlation between the market and Bitcoin’s value. We compare Bitcoin’s stock market value to various benchmark indexes in U.S. stocks, such as the Dow-Jones Industrial Average (DJIA), the Nasdaq Stock Market, the Standard and Poor (S&P) 500, and the New York Stock Exchange (NYSE). Additionally, we center on comparisons between Bitcoin and various companies related to the cryptocurrency industry, cybersecurity industry, and IT industry, among others to determine whether any specific index has an impact on the stock of Bitcoin and whether we can determine if there is any relationship between these indices.

Wildfire prediction has proved a challenge due to the ever-changing variables of climate, geographical area, and human interaction. We propose a LASSO model to show whether or not there are variables that can help us predict the likelihood of a wildfire in a certain area. By using wildfire data from two data sets: one consisting of nearly two million wildfires in the United States over the past 24 years, and another consisting of 517 wildfires from Montesinho Park in Portugal. This approach is different to existing models, as we attempted to predict recurrence of wildfires rather than their occurrence. However, through our analyses, we found no correlations that could predict wildfire recurrence. This suggests that wildfires do not recur in predictable patterns, and may depend on a myriad of variables that drastically vary from place to place.

The purpose of the study was to investigate if states were inaccurately reporting data regarding COVID-19 deaths and cases. To test this, we used a chi-squared test to analyze how well data from the New York Times resembled the curve give by Benford's Law. We then plotted the p-values on a heatmap to see if states were inaccurately reporting data. We also conducted a linear regression to see if there was a correlation between lying about cases and deaths. We found that there were indeed some inaccurate reportings of data across the states. However, they don't seem to form a clear pattern and the inaccuracies don't appear to be purposeful.