WHARTON GLOBAL YOUTH PROGRAM
PROF. LINDA ZHAO PRESENTS

Data Science Live

Wharton Data Science Academy

9am - 3:30pm

Aug 2 (Friday) 2024

Acknowledgement

Jeff Cai, Neil Fasching,
Timothy Dorr, Richard Patti

Wharton Data Science Academy

Run by the Wharton Analytic Institute, the Wharton Academy of Data Science will bring state-of-the-art machine learning and data science tools to high school students. We aim to stimulate students’ curiosity in the fast-moving field of machine learning through this rigorous yet approachable program. Building up statistical foundations together with empirical and critical thinking skills will be the main theme throughout. By the end of the program, students will not only be equipped with essential data science techniques such as data visualization and data wrangling but will also be exposed to modern machine learning methodologies which are all building blocks for today’s AI field. Along the way, students will develop a working proficiency with the R language, which is among the most widely used by professional data scientists in both academia and industry.

We believe data science is not just a collection of techniques; it is foremost motivated by real world problems. The data scientist of the 21st century must be able to identify relevant problems, provide sensible analyses, and ultimately communicate their findings in meaningful ways. All learning modules are based on real life case studies.


Abstract

 


The current socio-economic climate is contributing to the increasing political polarity in the United States. Current explanations for the rapidly rising political polarization of the United States include an increase in foreign-born residents and rising income inequality. We used data from various national surveys and government databases to analyze the correlation between these socio-economic factors and political polarization. Our study focuses on identifying trends and patterns over the past few decades, examining the impact of demographic shifts, economic disparities, and other relevant variables. We aim to predict the future political climate of the United States of America using a linear model with stepwise variable selection, LASSO, relaxed LASSO, Random Forest, Boosting, and neural networks. The results of this study can be used to reduce the effects and intensity of political polarization in the future.

Group 2   Emily, Anisa, Kelis, Michael, Shreshth


Recidivism, or the act of a convicted criminal reoffending, remains a significant issue in the criminal justice system as high recidivism rates indicate failures in reintegrating and rehabilitating offenders into society. The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) algorithm, used in states such as Pennsylvania, New York, and California, predicts the likelihood of reoffending and has been a prominent tool in addressing these issues. However, studies have revealed that COMPAS unfairly identifies black defendants as having a higher risk of recidivism than white defendants, raising serious ethical concerns regarding bias and fairness in the application of machine learning in criminal justice.

Our objectives in this study are to develop a recidivism prediction model using logistic regression, neural networks, and Random Forest to determine if a defendant is likely to recidivate as well as to evaluate the COMPAS score's accuracy and what factors contribute most to recidivism. We obtained data from the Florida Department of Corrections, which includes detailed information about defendants in Broward County such as sex, age, race, and their COMPAS risk assessments for 2013 and 2014. We will conduct a thorough analysis to compare the performance of our model with COMPAS and address potential biases. The findings will contribute to the ongoing discussion about the use of algorithmic tools in the criminal justice system and the necessity of ensuring fair and just outcomes for all individuals in the system.


The American Dream is a key tenet of American pride - our country holds itself high as the land of opportunity to drastically improve one’s financial and social standing in a rags to riches story. In this paper, we investigate the degree to which the American Dream holds true and how its conditions have evolved since 1984, focusing on the degree to which social mobility is available to low-income households. In order to provide insight into how America could improve its vertical mobility, poverty rate, and overall prosperity, we use several models to analyze the effect of several key measures across other high income countries and provide predictions for how key changes in those measures would affect the country. The variables used to characterize social mobility include median income, house price, disposable income, the GINI index, tuition costs, education quality and access, welfare spending as a % of GDP, monthly Medicare premiums, percent of children that earn more than their parents, and the share of wealth by the top 1% compared to the bottom 50%. This study provides linear models, LASSO models, relaxed LASSO models, trees, and random forest models for predicting social mobility, poverty rate, and GDP per capita, but chooses to use linear models for predictions for America for greater explanation of variables, confidence intervals, and lowest error. This study provides a crucial look into the current condition of financial upwards growth for the current and future generations as well as recommendations for where the country should focus on learning from other high income countries for restoring the promise of the American Dream. This paper opens the door for following research to look into how such policies and recommendations could be implemented and their real effects.

Group 4   Kattie, Junwoo, William, Iris, Saanvi


Productivity is a crucial driver of economic growth and is a huge determinant of the strength of a country. Policymakers and economists use output measurements to assess a country’s productivity level and enact structural policies for its economy. For instance, Greece addressed its productivity decline from 2008 to 2009 through labor market reforms, encouraging competition, and investment in various sectors. There is currently no clear consensus on all the factors that are significant to productivity growth. This paper uses total factor productivity (TFP)- a measure of operational efficiency derived from gross output, capital and labor input, and their elasticity- as the criterion for optimal productivity growth. We first model the current state of country-level productivity before utilizing analytical models through regression and machine learning on data from the World Bank Book and The Conference Board to identify significant correlations between TFP growth and variables. We evaluate previously proposed criteria and highlight their prevalence in our findings in contributing to TFP growth. We conclude that imports and exports of goods and services, population distribution, and inflation have the highest effects on productivity growth. Our findings report the random forest model as the best predictive model due to its lowest mean squared error. The ensemble model reports a lower mean squared error than all individual models except the random forest, suggesting the potential predictive performance of combining all models.

Group 5   Carter, David, Jeremy, Aiden, Michael


Depression has increasingly become recognized as a pressing issue in today’s society. Currently, the WHO estimates that 4% of the population suffers from depression, and 700,000 people die each year from suicide. In fact, suicide is the fourth leading cause of death in people aged 15-29. Despite the availability of effective treatments for depression, many individuals suffering from depression often go undiagnosed and untreated due to economic inequality and social stigma. Social media presents a promising solution to this issue. Today, social media use is extremely widespread, and users’ data presents a cheap and accessible source of information that can be used to identify signs of depression. Thus, this study aims to predict and identify signs of depression based on a Twitter user’s activity. The data consists of a user’s tweet history, follower count, friend count, and other information. To analyze the data, logistic regression, LASSO, decision trees, random forest, and neural networks are utilized on the numerical data. For the tweet data, BERT classification, bag-of-words, and other textual analysis techniques are applied to predict depression and identify words and phrases that potentially indicate it. We found that number of followers is a significant negative factor, while the number of favorites was a significant positive factor. While the quantitative models had low accuracies, our best text-mining models were able to achieve close to 93% testing accuracies.


At its core, American democracy values the voices of the people and encourages civic participation. While previously, politics and presidents specifically felt far removed from citizens’ lives, the internet has made the relationship between presidents and their constituents far closer and more interactive. Now, many presidents frequently and even casually use X (formerly Twitter) to communicate directly with citizens. However, not all interaction is beneficial. Polarization and political violence have skyrocketed, demonstrated by the contentious 2020 elections and the subsequent January 6 riot on the Capitol. Many studies have shown that support of digital content translates to similar behavior offline. Relatedly, regarding the events of January 6, many have speculated that presidential communication through Twitter fueled this violence. Thus, we aim to investigate how Biden and Trump’s tweets from 2020 until the insurrection differ in rhetoric and change over time, and how these characteristics affect user support. We focus on Twitter because it is one of the sole platforms where presidents write original content and frequently communicate with the online community. We retrieve tweets from Biden and Trump with information about the raw text, time published, number of likes, and number of tweets. To identify characteristics, we perform sentiment analysis and mine for emotion type, intensity, positivity, negativity, toxicity, and violence through a variety of models such as BERT, VADER, Detoxify, and Grievance Dictionary. To quantify support, we use likes and retweets, which solely convey support, unlike comments. We use multiple linear regression and LASSO to train and test models. By comparing Tweet characteristics and tracking them over time, we can see what types of tweets garner the most support. Our research will improve the understanding of factors that drive engagement. While Biden and Trump will not both be participating in future elections, the highly charged political environment of 2020 is still present today. As polarization worsens in America and political dialogue grows progressively digital, it is all the more crucial to harness social media usage. Based on conclusions from our research, future candidates can tailor their social media strategies to better connect with voters, and social media platforms can consciously regulate political content to create a healthy environment for democratic discourse.


Voter turnout rate, a representation of the percentage of actual voters out of the voting-eligible population, is a crucial factor that informs election outcomes, political campaign strategies, and policy matters. While discussion of the demographic factors that influence an individual’s chance of voting is prevalent, quantitative knowledge of the socioeconomic factors that contribute to a given state’s voter turnout would allow presidential campaigns to determine which states are most worth targeting. By analyzing an aggregated dataset that includes violent crime rate, Human Development Index, partisan control of state government, median income, unemployment rate, and Gini coefficient for all fifty states in presidential election years from 2000 to 2020, we hope to determine the magnitude of influence that these factors have on the voter turnout rate in any given state. We employ machine learning techniques such as LASSO, xGBoost, ARIMA, and a relaxed linear regression to develop a model that can predict voter turnout by state for presidential elections using quantitative data on these socioeconomic factors. While our ultimate goal is to predict voter turnout for the 2024 presidential election, this research can also serve as a stepping stone to improving voter turnout rates, a necessary step to ensure a thriving democracy.

Group 8   Erin, Isabella, Michael, Kiela, Ethan


Characterized by its high costs and stratified nature, the United States healthcare system has long been a topic of discussion. This research aims to identify the socioeconomic factors that contribute to the quality of healthcare. Utilizing data from the American Community Survey and government databases, we examine the correlation between socioeconomic status (SES) indicators–such as income, education, and unemployment rate– and healthcare outcomes. We created various models using linear regression, random forests, and neural nets to predict the quality of healthcare based on our selected socioeconomic factors. Bagging, backward elimination, and LASSO are utilized to enhance prediction accuracy, where we will ultimately determine the best predictive model. Our findings provide policymakers with further insights into the economic factors driving healthcare disparities in the US, aiding in more informed decision-making and more focused areas for future research.


Despite having the world’s largest and most developed economy, the United States of America is notorious for having high crime rates. For instance, the US ranked 10th in homicide rates in 2024 with a homicide rate of approximately 6.81 per 100,000 people. Our project tackles this prominent problem; by analyzing crime data on the state and city level, we can assist lawmakers and government officials in keeping America safe. In our project, we utilized three datasets: one with every instance of crime in Philadelphia during 2024, one with law enforcement and infrastructure data for each state in 2022, and one with each state’s laws on gun control from 2022-24. We acquired the first dataset from Kaggle and manually assembled the two later data sets using various sources. Using logistic regression on the Philadelphia crime dataset, we created a model that could predict the severity of a crime at a certain location and time in the city, allowing officials to allocate their law enforcement resources more efficiently. Analyzing the other two datasets, we implemented multilinear regression, Lasso, dimension reduction, and backward selection to find state laws and characteristics that most effectively reduce crime. Finally, we applied random forests and neural networks to make our three models more accurate. Combined, our models have the potential to significantly influence policy-making and resource allocation, leading to a reduction in crime and, as a result, a safer America.

Group 10   Ray, Alexis, Grant, Max, Aneesh


The purpose of this study is to identify the factors most closely associated with an increased risk of car crashes. The dataset used consists of 800000 observations and 40 million+ cells of car crashes within Chicago. The variables include the time and day of the crash, weather conditions, the region, the posted speed limit, and the injury severity and number of injuries. Various statistical techniques, such as histograms, geographic distributions, and regression techniques like Poisson negative binomial, LASSO, randomForest, and neural networks, were applied to the dataset. The final findings from all our models show that the primary contributory variables to crash severity are the posted speed limit, reckless driving, and four-way intersections as a level of traffic way type. Based on these findings, policy recommendations were offered to target these root causes, which could help save individual lives and reduce the $300 billion spent annually on car crashes in the US.

Group 11   Nathanael, Joshua, Mark, Alice, Emily


Earthquakes are among the most destructive natural disasters, capable of causing significant loss of life, infrastructure damage, and economic disruption. While we may feel more prepared for these situations because of innovations such as effective warning systems, new building codes, and public education, maneuvering through an earthquake in the moment remains a challenging and complex task due to its inherent unpredictability. However, recent developments in data science, machine learning, and sensor technologies offer new opportunities to enhance our understanding and forecasting capabilities, allowing us to answer the question: “What Determines Preparedness During an Earthquake?” Our group aims to determine what factors can contribute to how prepared an individual is to take the necessary actions to survive in an earthquake in order to reduce the large number of casualties that occur in earthquakes around the globe year after year. We will use such techniques as logistic regression in order to predict certain actions one could take in such a situation, trees in order to detect patterns in the qualities of these individuals, random forests in order to improve the accuracy of our predictions, as well as L.A.S.S.O. in order to improve our models and refine our observations. This study will predict which individuals are most likely to be well-prepared for natural disasters, changing how we view safety protocol fundamentally, and helping people who avoid earthquakes altogether.

Group 12   Govind, Jaisimh, Shuwen, Andy, Shujia


Lung cancer is one of the most prevalent types of cancer, most commonly associated with smoking. However, we were curious as to other factors which may influence the rates of lung cancer, such as socioeconomic characteristics including poverty rates and environmental pollution levels, particularly ozone pollution. We examined lung cancer rates by counties using data from the INCD and the US Census. In conducting this study, we used linear regression models and logistic regression to conclude the factors with the most significant effect on lung cancer rates. Choosing a suitable dimension for the linear model through backward selection, we then modeled the effect of poverty rates and ozone pollution on the rate of lung cancer. We found that higher poverty rates are associated with lower odds of being in high ozone areas, but this relationship was not statistically significant. Additionally, our analysis indicates that while poverty rate is a predictor of lung cancer incidence, it explains only a small fraction of the variability, suggesting that other factors also play significant roles in determining lung cancer rates.

Group 13   Kaya, Tyler, Lilly, Irene, Charles


The need for efficient and readily accessible public transportation is a crucial aspect of life in metropolitan cities. Organizations such as Citi Bike aim to satisfy this need by providing active transportation share systems to urban areas. By analyzing datasets regarding New York City’s Citi Bike program for the years 2013 and 2023, this paper aims to evaluate the effectiveness of the most widely used bikeshare system in one of the largest and most internationally significant cities in the world, as well as interpret its patterns and implications considering urban development and city planning. Utilizing these publicly available data sets, we conduct detailed statistical analyses to assess changes in usage patterns over time, trip duration, distances traveled, and user demographics. For example, we model predictions for the duration of time a bike is used and the distance covered by each customer given various factors. This information will aid city planners in their process of determining whether or not to use shared bike systems in their cities and the optimal number of bikes to implement. Ultimately, our findings highlight the significance of bikeshare programs in the development of urban cities and transportation networks through data analysis and interpretation.

Group 14   Richard, Abhi, Jerry, Eric, Alex


Our model aims to determine the most important factor in determining the outcome of an international FIFA game. This research identifies the most crucial factors that can enhance the planning of game strategy, improve team performance, improve training outcome, and optimize resource allocation for better results. We are using the dataset internaitonal_matches.csv which was downloaded from Kaggle. This dataset contains results and statistics of all FIFA matches since 1993, although we will only consider matches after 2012. Considering only matches relevant to our study, there are still thousands of observations, and if the existing data does not suffice, we can append it with extra information about country rankings from FIFA. This data includes data of country, home team, away team, scores, tournament type, shootout, and rating that change with each game. The method we are going to use is logistic regression, by classifying 1 as a big win for the home team and 0 as a big loss for the home team, with the threshold set for a small win, small loss, and draw.

Group 15   Yiran, Xiaofei, Tanvi, Janelle


With the selectivity of education increasing significantly along with society’s advancement, admission into tertiary education becomes much more complex. College application is now much more difficult, and access to educational opportunities is closely related to social equality. In this research project, we analyzed datasets with a size of 488 schools, with 168 different variables in which factors related to acceptance such as ethnicity, standardized scores of applicants, and traits of the schools themselves were examined. Through these expansive datasets, we applied linear, tree, and neural network models to grasp an understanding of acceptance rates. By approaching this project from both traits of applicants and schools simultaneously, we were able to build a prediction model applicable to specific applicants and colleges, in turn receiving analytic results to gain insight into the socioeconomic gaps stemming from education inequalities.