STAT471/571/701
"In the past, one could get by on intuition and experience. Times have changed. Today the name of the game is data" writes Steven D. Levitt. Data mining and statistical analysis have suddenly revolutionized many real life aspects like politics, housing, health analytics, criminal justice and sports...
By using data preparation, statistics, predictive modeling and machine learning, data mining tries to resolve many issues within individual sectors and the economy at large.
STAT 471/571/701 (Modern Data Mining) created and taught by Prof. Linda Zhao, a faculty in the statistics department provides a thorough treatment of data mining techniques such as data cleaning, exploratory data analysis, predictive modeling and machine learning. The course is designed to provide an overview of these techniques with prime focus on analyzing real datasets. The students are self-selective in the way they are strong analytically and are mostly enthusiastic with data science. It is well balanced mixture of upper level undergraduates, Wharton MBAs, MS and Ph.D’s campus wise. The students at the end of the semester are required to conduct studies of their choice focusing on tackling real world problems using the techniques learnt.
Keynote Speaker Ren Zhang
Ren is the Data Science Director at Amazon in charge of Personalization. Her team is in charge of developing the products, engineer platform, and applied science algorithm to drive personalized shopping experience at Amazon.com.
Prior to that, Ren was the Chief Data Scientist for BMO Financial, head of AI Center of Excellence, in charge of driving adoption of AI and Machine Learning capabilities across the enterprise. Ren has over 15 years’ experience as a senior AI leader within various financial organizations. Prior to BMO Financial, she worked at Prudential Financial, where she was Vice-President and Head of Data Science. Prior to Prudential, Ren was Executive Director of Data Science and Innovation at Commonwealth Bank of Australia. And at American Express, she held progressively senior leader roles, ranging from credit risk management, to loyalty analytics, to fraud risk strategy, and to risk capabilities, with her last being Vice-President, Risk and Information Management of Enterprise Growth. Ren holds a PhD in Statistics from The Wharton School at the University of Pennsylvania.
Schedule
15-minute keynote + 5-minute Q&A
10-minute talk
15-minute Python crash course
Tianxiao Zhang, Yanqi Liu, Muhua Chen
Heart disease has always been one of the most difficult diseases to be treated. It usually
cannot be discovered until the patient reaches a certain age or certain symptoms begin to emerge. However, by that time, treatments are no longer as effective as if it was discovered earlier. Thus, it would be helpful to know whether a person is likely to have heart disease in advance. The goal of this project is to build predictive models that can predict whether a person is likely to have heart disease (binary outcome) given other physical measures of the person that could be tracked earlier.
Stacey Bevan, Yaoyu Wang, Yuxuan Zhang, Hexiang Liu
In late 2021, the US Surgeon General’s Advisory released a public health statement about the alarming decreases in youth wellbeing. Even before the COVID pandemic, 20% of US youth reported a mental health disorder, and only half received appropriate treatment. Insight into predictors of adolescent wellbeing can advise policy aimed at reducing disease burden. The aims of the study are 1) develop a model of adolescent well-being from early social determinants and 2) between parent, teacher, and self-report, determine which is most predictive in the model. This study uses the Fragile Family sample, which longitudinally follows child outcomes from birth to age 15. Model strategies included a series of regressions with LASSO, random forest with PCA and bagging. Results indicate that access to social services, sleep practices and neighborhood conditions in childhood are associated with youth wellbeing. Interventions that improve early life conditions may mitigate mental health symptoms in adolescence.
Edward Zhang, Aaron Shurberg, Matt Simkus
Under the Federal Emergency Management Agency (FEMA), the National Flood Insurance Program annually pays out billions of dollars in insurance payouts to people that have suffered flood damage. This study aims to predict how much a victim of flood damage can expect to receive in payouts given where they live, their demographic information, and the flood risk factor of their place of residence according to the First Street Foundation.
Using datasets acquired from FEMA, the census, and the First Street Foundation, we aim to use the ensemble methods covered over this Modern Data Mining Course to generate predictive models that are able to predict the expected payout of insurance based on these potential predictors. These methods will include linear regression, random forest, and generalized boosting; based on these models, we hope to see if their aggregate results are insightful, and make recommendations based on the results that these models yield.
Zihu Xu, Haimeng Wang, Jingru Wang
The goal of our study is to cluster academic papers using unsupervised machine learning techniques based on the abstract of these scientific papers. We are trying to test the impact of different data preprocessing steps on the quality of clusters and find the meaning of each cluster. It's meaningful since we can apply what we have learned in class to the real-word cases and classify the paper without knowing its category if it works.
Ren Zhang
Ren is the Data Science Director at Amazon in charge of Personalization. Her team is in charge of developing the products, engineer platform, and applied science algorithm to drive personalized shopping experience at Amazon.com.
Ting Sun, Lionel Smoler Schatz, Weiyu Kong
Our study is comprised of two parts. The first part investigates which factors matter most in deciding an Airbnb’s market-clearing price (for example, does a good neighborhood, number of bedrooms, description, or good review determines the price)? The second part investigates what features contribute to high reviews? From a consumer perspective, our algorithms can be used to identify over/under valued properties as well as evaluate the quality of newly listed properties with no existing reviews.
Zhuoyun Wang, Mudan Chen, Jinmeng Liu
We found this project interesting since as a disease affecting the arteries leading to and within the brain, stroke is the No.5 cause of death and a leading cause of disability in the US. Also, stroke is not only affecting seniors - there is a trend of having strokes at a younger age. In fact, an estimated 10% of strokes occur in people under age 50. We believe strokes should be considered as medical emergencies that require urgent medical attention. Early detection and appropriate management are required to prevent further damage to the affected area of the brain. Therefore, models with reliable stroke prediction based on people's medical history and demographic features can be greatly helpful on early prevention.
Jihan Zhang, Chenxi Leng, Ziyi You
Motivated by the fact that U.S. supermarket and grocery store sales have increased up to 766 billion U.S. dollars, we aim to provide some marketing insights for a Superstore Giant which aims to enter the market or to increase their sales and profit. We try to understand which products, regions, categories, and customer segments the Superstore Giant should target or avoid. In this presentation, we will start by giving you a short summary of the background and dataset, followed by the steps we took to clean the dataset. Then, we will present interesting visualizations and tables related to the variables in the dataset. After that, we will present the models we used to fit the testing set. Finally, we will conclude with practical insights for the superstore and suggest a best fitting model.
Jia Xu, Yuqin Zhang, Zejia Cai
Alzheimer’s disease (AD) is the most common type of dementia which leads to memory loss and decline in thinking. AD is a progressive disease and usually starts slowly, but changes in the brain can begin many years before the appearance of first symptoms.
In this study, we aim to use Magnetic Resonance Imaging (MRI) data for both demented and nondemented adults to build classifiers that predict whether a subject will be diagnosed to develop dementia. Two datasets will be used: one deals with cross-sectional MRI data for adults aged between 18 to 96, and the other deals with longitudinal MRI data for older adults between 60 to 96.
The data was collected and released by the Alzheimer's Disease Research Center at Washington University and Open Access Series of Imaging Studies (OASIS).
Adiwid (Boom) Devahastin Na Ayudhya, Muaaz Noor, Conor Gibbons
Both Python and R are easily the top two most popular languages that are used by the data science community. While our Modern Data Mining course has historically been taught in R, our Pythonization Special Project Group has spearheaded the initiative to transition our material over to Python. This presentation not only outlines the fundamental differences of R vs. Python, but will also demonstrate live the Google Colaboratory (Colab) interface to write .ipynb files in an effort to showcase both the comparative advantages and disadvantages that Python has over R. This will also include a brief run through of one of our lecture modules as a case study. For those interested in seeing the Crash Course on Basic and Advanced Python our team has developed for future semesters, please feel free to ask questions after the presentation.