Wharton Data Science Academy
Run by the Wharton Analytic Institute, the Wharton Academy of Data Science will bring state-of-the-art machine learning and data science tools to high school students. We aim to stimulate students’ curiosity in the fast-moving field of machine learning through this rigorous yet approachable program. Building up statistical foundations together with empirical and critical thinking skills will be the main theme throughout. By the end of the program, students will not only be equipped with essential data science techniques such as data visualization and data wrangling but will also be exposed to modern machine learning methodologies which are all building blocks for today’s AI field. Along the way, students will develop a working proficiency with the R language, which is among the most widely used by professional data scientists in both academia and industry.
We believe data science is not just a collection of techniques; it is foremost motivated by real world problems. The data scientist of the 21st century must be able to identify relevant problems, provide sensible analyses, and ultimately communicate their findings in meaningful ways. All learning modules are based on real life case studies.
Schedule & Abstract
9:00 - 9:45 AM | Session 1: Jeff | |
---|---|---|
9:50 - 10:35 AM | Session 2: Angela | |
10:35 - 11:00 AM | Break | |
11:00 - 11:45 AM | Session 3: Edward | |
12:00 - 2:00 PM | Lunch | |
2:00 - 2:45 PM | Session 4: Matt | |
2:50 - 3:35 PM | Session 5: Neil |
Group 1 Hyungjae, Wendy, Sudhish, Hagen
With over 8 million artists and 60,000 new songs per month on Spotify (the world's largest music streaming platform), the music industry is as competitive as ever. Music producers are in constant search of the perfect formula to make hit songs. It would prove greatly beneficial for producers and artists to determine what exact features of a song influence its “hitness”. This study aims to utilize data from Billboard, Spotify, and Genius to analyze the key factors that make a song successful and to construct a model that is able to predict this success. The dataset, which includes Spotify audio features, Billboard artist(s) popularity, and lyric sentiment, is used to train 5 models (Ordinary Least Squares Regression, Logistic Regression, Logistic Lasso Regression, Deep Neural Network, and Random Forest). The prediction models suggest that certain factors like the popularity of an artist, speechiness (degree of speech-likeness), valence (musical postivity), and danceability of a song are important in determining a hit song. However, due to the low accuracies (roughly 61% with a baseline of 53%), it also appears that the factors investigated in this study fail to completely represent a song and its “hitness”. The study proves that there are still limitations of music analysis in data science.
Group 2 Arnav, Kevin, Christina, Keryssa, Tristan
The entertainment industry has taken the world by storm. For the past several decades, the film industry has remained one of the most profitable yet unforgiving entertainment industries in the world. Alongside the immense monetary potential of film making comes an equally large risk posed by the large production and advertising costs required to create a film with potential for success. But are the massive budgets often associated with high earning films really that important to box office success? Through analysis of movie data taken from IMDb and Rotten Tomatoes, our goal is to determine just how important a movie's budget is and to reveal which other factors play a role in a film’s gross income, including genre, country of origin, and critic appeal. Making use of the R and Python coding languages, we used techniques including Exploratory Data Analysis, Neural Networks, and Random Forest to find trends in movie data ranging from 1980 to the present, and to create predictive models to determine an estimated monetary success of any movie that falls within the bounds covered by our existing data. With our models, we aim to be able to predict the approximate grossing of movies and determine which factors affect movies' grossing.
Group 3 Aarav, Daniel, Justin, Arnav, Kaitlyn
Technology has made investment information readily accessible, and this allows markets to perform more efficiently than in the past. However, information discrepancies persist. For instance, company insiders (i.e., CEO, Director, Board Member) have more knowledge about their corporation than most investors. This is why company insiders are legally mandated to disclose their trades with the SEC to prevent them from abusing their information edge. Still, insiders can often leverage their information edge to reap excess profits. This study seeks to uncover whether legal insider trading data can be used to predict the price action of a company’s stock. To do this, we scraped 10 years of insider trading data from openinsider.com and yahoo finance. We then analyzed how insider trade features correlate with price action ranging from 1 day to 6 months into the future. Leveraging these relationships, we developed Rule-Based and Random Forest trading models that can predict future stock price changes. We tested the strategy using k-fold Cross-Validation, and we found both the Rule-Based and Random Forest models outperformed the S&P 500 on a risk-adjusted basis (the top model produced a 30.2% annual return with a 1.6 Sharpe ratio). Ultimately, this project proves that insider trading data can be used to gain excess returns. Knowing this can help bridge the information divide between company insiders and average investors.
Group 4 Arushi, Kevin, Nathan, Peter
Every year, around 10 million high school seniors scramble through their college applications with the hopes of achieving financial well-being, career stability, and overall happiness. To examine this assumption on a large scale, we chose to analyze the effects of high school and college completion on societal welfare as a whole. Although seemingly arbitrary, we were able to quantify this value to a certain degree using probability and statistics. We applied the methods of exploratory data analysis, linear model regression, ANOVA, LASSO, and deep learning to determine whether education is statistically significant in evaluating several different measures of success within a county. Through the analysis of county-level socioeconomic data, acquired from the FBI and the US Census Bureau, we calculated the unemployment rate, poverty rate, crime rate, and median household income by county using education data along with 22 control variables (such as race and population). In order to determine which variables were statistically significant, we employed 4 LASSO models with cross-validation, ensuring that any highly insignificant variables had a coefficient of zero. We then used a linear regression model with backwards selection to remove any other extraneous variables. Lastly, we used deep learning in the form of a neural network to predict the success metrics of a county based on a collection of contributing socioeconomic variables. Our final models revealed that educational attainment, both at the high school and undergraduate level, does have a significant impact on a US county’s median household income, unemployment rate, poverty rate, and crime rate. Results from this study can be used to give many a quantifiable representation of the importance or unimportance of both high school and higher-level education for their future.
Group 5 Jason, Grace, Kathy, Joshua
Soccer, or football as it is known outside of the US, is the most popular sport in the world, with an estimated 250 million players active in over 200 countries. Of course, the popularity of the sport has also birthed a flourishing gambling industry, where fans of soccer flock to both brick-and-mortar betting stores and online betting websites to place their bets on which team they think will win. The aim of our study was to use statistical analysis and computational techniques to create a model that can predict which team will be the most successful. Using multiple linear regression as well as LASSO and backwards selection, we created a predictive model based on statistics from the top 20 teams from the English Premier League from 2000 to 2021. We also attempted to use neural networks using match data to develop another predictive model, although debugging the model was ultimately unfeasible due to time constraints. As a result of our analysis, we found that penalty kicks made, goals per shot on target, attendance, shots on target against, clean sheets percentage, number of players used, save percentage, and shots on target were the most statistically significant in influencing the success of a team. Our models, especially the neural network with improvement, can function as an accurate predictor in soccer betting.
Group 6 Chloe, Katherine, Ziyue, Maxwell, Jordan
We chose to analyze data on the Bechdel test because it opens up conversations about the gender disparities depicted in movies and other fictional works among the population, which highlights the predominance of men in the entertainment industry. Our goal of the study is to analyze what factors most contribute to whether or not certain movies from 1990 to 2013 pass the Bechdel test. This also includes analyzing the types of movies that are more likely to pass the Bechdel test and whether or not more recent movies are more likely to pass the Bechdel test due to varying ideals and priorities in current society. In order to explore our data, we created histograms, bar graphs, box plots, and scatter plots, and to model our data, we used logistic regression and text mining. We found that movies in more recent years are more likely to pass the Bechdel test, movies that passed the test had a lower budget, movies generally received a similar domestic gross income regardless of whether or not they passed the test, and movies that fail the test receive higher IMDB ratings. From our logistic regression, we found that budget, IMDB rating, domestic gross income, and genre are the most significant variables in our study. From text mining and creating word clouds based off of the movies’ plots, we found words that most positively or negatively correlate with passing the Bechdel test.
Group 7 Mike, Jun, Emil, Pratyanch
Everyone from young to old uses some form of social media, whether it may be to check the news, discuss popular topics, or explore new opinions. As the platforms’ use continues to increase exponentially, especially in political contexts, an opportunity arises for us to analyze current trends and predict user behavior. We explore the categorization of social media posts and comments scraped from popular platforms such as Twitter and Reddit. The utilization of three separate techniques – K-Means Clustering, Principal Component Analysis, and Shallow Neural Networks – allows us to build more meaningful classification algorithms along following their growing complexity. We can then use such classifiers to determine the topic a post or comment is focusing on and predict a user’s likely interests, which is how many Social media companies content recommendation algorithms work.
Group 8 Jonathan, Yejune, Isaiah, Emily
You might be wondering why you of all people are stuck in an unhappy marriage. Although your marriage is beyond saving, our EDA and data analysis may help you determine why (for closure). Using a marital satisfaction dataset along with divorce statistics from the ESS (European Social Survey), we found a significant link between marital satisfaction and a variety of factors, including education, personal sentiment, religion, and more. The analysis uses algorithms like multiple linear regression and logistic regression to understand trends in both satisfying and poor marriages. Additionally, we built a Neural Network that can predict whether you will have a happy marriage. With between 90% and 99% confidence, depending on the models, our analyses show many associations between a variety of factors and marital satisfaction.
Group 9 Ethan, Sophia, Tegan, Maggie, Kevin
Throughout modern history, there have been many dissimilar factors that have affected the health condition of people all over the world. Health has been the backbone of life with multiple factors affecting one’s health condition at all times: diet, activity, and genetics etc. However, there are also a number of factors that are less evident that may potentially have an equal or even greater effect on health that we have decided to further explore. So, we specifically examine niche health variables from the GSS dataset that are based off of random surveys from citizens all over the U.S.A. including confidence in medicine, religious preference, opinion of family income, etc. Potentially relevant variables are examined both graphically and numerically. Our goal for our project is to identify factors that may influence health conditions, interpret these variables, and hopefully predict optimal choices that people could make to minimize the risk of health conditions. Using linear modeling, Anova, and other analytical methods, we have the ability to provide valuable new information regarding distinct variables that have a surprising correlation to health conditions.
Group 10 Jayden, Rohan, Samara, Elena
Startups are a driving economic force, with some estimates holding that the value of startups is on par with the GDP of a G7 economy. Moreover, some estimate that high-potential/high-growth firms account for nearly half of all new jobs, and startup funding has grown likewise. However, not all startups are successful; in fact, approximately 90% of all startups fail and 10% fail within the first year. Considering the massive impact startups have on the economy, understanding what factors influence the success of a startup is important on both a personal and a global scale. While we analyzed the usual considerations of funding and education, we also analyzed how networking and previous experience affects a startup’s future success. In addition, we looked at locations at the city and state level to see where most new startups are concentrated. We found that certain universities, experience in management, total money raised in valuation, and startups in particular industries and locations are more strongly correlated with success. The methods we used include multiple linear regression models, logistic regressions, backward selection, decision trees, Anova, boxplots, and scatterplots. These prediction methods can help startups re-evaluate their company goals as they assess their probability of success. In the future, this research can be expanded by taking into consideration whether certain degrees are correlated with success for different types of startups depending on industries.
Group 11 Arnav N, Josh, Jason, Vivian, Edward
Gun violence has been one of the most divisive topics of recent memory in the United States. With increases in the number of mass shootings and suicide rates over the pandemic, those from both sides of the debate continue to deliberate whether imposing further restrictions on gun possession laws will improve the issue of gun violence or leave those vulnerable unprotected. But, is the imposition of more restrictive gun laws really necessary? Or, are there other socioeconomic factors that should be addressed before eyeing new legislative measures? We plan to settle this argument by examining various factors of the United States and 180 other countries including: education and poverty rates, GDP per capita, the gini coefficient, and the Gun Law Restrictivivity Index (GLRI), a measure to describe how restrictive a country’s possession laws are. With Exploratory Data Analysis, Multiple Linear Regression, Backwards Selection, Trees, and Random Forest to find a model for gun violence, we hope to create a model that can accurately define which variables lead to increased gun violence and to predict what levels of gun violence a country will experience. As a result of our study, we hope that government leaders can have a proper battle plan against gun violence and have the insight to create change to lessen the number of victims in the United States.
Group 12 Ian, Alex, Anjali, Emily-Jane, Liam
The film industry is one of the most profitable industries in the world. Talented actors and dedicated directors across the world alike have come together and created these billion dollar films that have pleased audiences for years and the payoff is crazy. We wanted to identify which factors in a film would most contribute to a high movie gross revenue. Such factors include genre, word count in title, and release date. To identify the impact of these factors, we used linear models, trees, and neural networks. Identifying which factors most contributed to high movie gross revenue would allow us to best understand how the movie industry works; it would also allow us to make sense of human behavior relating to the film industry.
Group 13 Karen, Christine, Jennifer, Brian
The United State's retail gasoline prices have dramatically fluctuated over the past two decades. Recent events, such as COVID-19 and the Russia-Ukraine conflict have severely decreased and increased gasoline prices, respectively. With gasoline prices surpassing the highest record in history, American citizens' concerns also grow, driving divides between political parties, states, and countries. Our study aims to identify all important factors that influence retail gasoline prices in the United States using multiple linear regression, LASSO, text mining, and random forest. We hope that the results from this study can confidently show what factors influence the price of retail gas and predict future gasoline prices.
Group 14 Aiwen, Giselle, Anvesha, Akaash
Houses are some of the most valuable investments that many people will make. Since the housing market is closely linked to the general economy, its effects spread from real estate agencies, to homeowners, to land and construction jobs, to local shops and other services. Our goals of this study were to explore the relationships between housing market behavior and social media activity on Twitter. Twitter has been proven to be a useful source of data thanks to its widespread availability. After collecting data on housing price indices, volume of existing housing sales, and Twitter activity volume and sentiment, we analyzed data between January 2008 and April 2022. We used simple and multiple linear regression models to examine the relationships between Twitter and the housing market. Our final models showed that lower housing prices are correlated with higher frequencies of Twitter posts and more positive Tweets, and that lower volumes of housing sales are correlated with higher frequencies of Twitter posts. These results can be useful for real estate agencies and prospective homebuyers to make more informed investment decisions based on social media activity.
Group 15 Justin, Lauren, Varun
In our project, we used a dataset consisting of information about students attending a university in Japan. The dataset included demographic information and survey answers about the mental health of each student, detailing their feelings with a yes-no question and on a scale of 1-5, 1-6, or 1-7. Using this dataset, our goal was to determine which factors in the dataset had the highest impact on the mental health of students; the information would be invaluable in informing university officials and medical professionals of potential warning signs, allowing them to diagnose students quicker and provide them with the care they desperately need. In order to determine the factors, we first cleaned the data, removing columns that wouldn’t be useful to our conclusion. Then, we used the dplyr and ggplot packages to filter the data to important columns and plot their relationships to each other. We found signs of a relationship between gender, self-reported guilt, relationship status, and depression. We also graphed the distributions of demographic and help-seeking information. After the EDA process, we moved on to linear regression and multiple logistic regression between many factors. We used Anova tests to determine the significance of each column of data and back selection, only including the factors that were statistically significant at a 0.1 level. For the linear model, we examined the residuals plot, the QQ-plot, and the scale-location plot; after examination, we determined that the model meets the assumptions for normality, homoscedasticity, and linearity. In our neural network model, we examined the effects of most of the variables on whether or not the student felt suicidal; the model showed a 79% testing accuracy and a 73% validation accuracy. Based on the results, we concluded that not all social comfort factors have the same impact on the mental health of students; for example, the fear that foreign students might feel in another country might have different levels of impact on the mental health of the student based on their background. Another important conclusion reached was that knowledge of the religion was an important factor in reducing mental health illness; students who were religious had a lower chance of experiencing suicidal ideation when compared to students who weren’t religious. Finally, we were able to find the importance of the support network a student has around them; those students that didn’t feel comfortable talking to their relatives or their intimate partners felt worse than others with a supportive network. After understanding these statistics, we now understand how important it is for students to feel “at home”, especially in a high-stress learning environment. In a new university with new people in a different country, every problem can seem magnified and insurmountable. Providing comprehensive support and network for students will ensure that they will never feel those harmful thoughts and can instead succeed in the world.