Public Figures Analysis
This is my analysis of a publicly available dataset: "Public Figures". The public_figures.csv file contains information about 226 20th and 21st-century public figures. Please read the public_figures_data_dictionary file on GitHub (button at bottom of screen) for a more complete description of the variables. The final objective of this project is to build a model to predict the likability rating of a public figure, based primarily on their personality.
In this project, I did EDA, employed a principal component analysis (PCA) to identify personality types that explained the largest amount of variance, and built two prediction models: Least Absolute Shrinkage and Selection Operator (LASSO) and Random Forests, in order to predict a subjects likability based on their personality.
Exploratory Data Analysis
I immediately split the dataset into a training and test set.
Ask Questions about the Dataset
What Industry is the most liked, on average?


Surprisingly, Natural Sciences is at the top. I would expect “Team Sports” or “Film and Theater” to be at the top, because those are the things people are most engrossed in out of all these, in my experience. Maybe it’s because while “Team Sports” and “Film and Theater” have the highest ratings, they also have some of the lowest ratings because there are some athletes and film and theater people who have some very undesirable attributes that are revealed because they are in the spotlight, whereas for people in the natural sciences, their personalities aren’t always under the microscope, so people can’t say many bad things about them.
Of the favorable attributes, who has higher scores out of those in the “Team Sports”, “Film and Theater”, and “Natural Sciences” industries?
*For the favorable attributes, I will use the ones that are clearly favorable : TIPI_1:“Extroverted, enthusiastic”, TIPI_3: “Dependable, self-disciplined”, TIPI_7: “Sympathetic, warm”, TIPI_9: “Calm, emotionally stable”




These results make sense. For TIPI_1: “Extroverted, enthusiastic”, I would expect athletes and actors/actresses to be seen as having this quality more than those in the natural sciences. For TIPI_3: “Dependable, self-disciplined”, I would expect those in the natural sciences and team sports to be seen as more dependable and self disciplined than those in acting, because self-discipline is crucial for maintaining peak physical condition, and dependability is needed to become well-known int he natural sciences. For TIPI_7: “Sympathetic, warm”, I don;t picture people in the natural sciences as warm and sympathetic, becasue we generally don;t see that side of people who are famous for that profession. I would expect to see those in the film and theater occupation as the clear leader for this category, but here the results are pretty similar. For TIPI_9: “Calm, emotionally stable”, we see that natural sciences far surpasses the other two categories, because even if these people are not calm and emotionally stable, we generally don’t see that type of behavior publicized.
I’ll bet that as age increases for those in the “TEAM SPORTS” or “INDIVIDUAL SPORTS” categories, likability increases faster than in the “NATURAL SCIENCES” category, because I feel like people like retired athletes much more than active athletes, because retired athletes can’t threaten your team’s playoffs hopes and aren’t always in the headlines for doing bad stuff on the field/court.


There is a problem with a lack of observations for the NATURAL SCIENCES industry, which explains why their median rating for TIPI_1 is so high. As it relates to my hypothesis from this question, the likability for those in Team Sports does seem to increase with age slightly, mostly due to a couple of points near the top left corner, and the likability of those in Natural Sciences also seems to increase with age, but again, there’s only 3 samples, so it’s hard to make any judgments on this. Also, there’s a clear outlier in the bottom of the graph. I wonder what athlete is disliked that much? Mike Tyson for biting Evander Holyfield’s ear?


That makes sense. I guess a lot of people think he was guilty of the crime he was charged with.
*I think that girls will have higher ratings for “Sympathetic, warm” than guys, because girls usually act that way more than guys. I will check overall male/female comparison.


As expected, the females tend to have higher rating for TIPI_7 than do makes. We will check among each industry, though, because the disparities in some industries could make the difference in rating seem more pronounced than it actually is.


For 3/7 categories that have males and females, the average rating for “TIPI_7:”Sympathetic, warm” was higher for females than it was for males. In the “LANGUAGE” category, the TIPI_7 rating is considerable higher for females compared to males, and in the other industries where there is a difference, the difference is very small. Therefore, it’s likely that this industry is the main contributor to the overall difference we see in the TIPI_7 rating between males and females. Surprisingly, there were no females in the “NATURAL SCIENCES” and “DESIGN” category, because I know there are a lot of important women in those 2 categories.
Principal Component Analysis
Continue exploratory data analysis by performing a principal component analysis on all 10 TIPI variables.




Based on the descriptions of the variables in the data dictionary, the first principal component represents public figures who are viewed as either having non-favorable personality traits or favorable, as there are strong positive coefficients for unfavorable value like Anxious, easily upset and Critical, quarrelsome and strong negative coefficients for Calm, emotionally stable and Sympathetic, warm. High positive scores for PC1 correspond to non-favorable attributes, and high negative scores on PC1 correspond to favorable attributes.
Principal Component 2 tells shows us the public figures who are either very extroverted or very introverted, because these are the two qualities that have massive coefficients. High positive scores for PC2 correspond to extrovert qualities, as TIPI_1 is “Extroverted, enthusiastic” and TIPI_5 is “Open to new experiences, complex”. High negative scores correspond to introverted qualities, as TIPI_6 is “Reserved, quiet” and TIPI_10 is “Conventional, uncreative”.
*If I want to reduce the 10 TIPI variables, how many principal components should I choose?


3 principal components is appropriate to interpret the data, because that is the minimum amount of principal components such that the cumulative PVE is >= 80%.
Cluster Analysis
Continue exploratory data analysis by performing a cluster analysis using the TIPI variables.




How many clusters best grouped the people in the training set?
I chose to use 3 clusters because the total mean WSS/TSS is at 0.25 at 3 clusters, and doesn’t significantly decrease as the number of clusters increases from there.
In the green cluster, it includes people who are rated as not being Sympathetic, warm, not being calm, emotionally stable, and being disorganized, careless
Least Absolute Shrinkage and Selection Operator (LASSO)


As seen by the calibration plot, it’s predictions very closely align with the actual values for most of the observations. However, for those predicted as less likable ( -75 to -25), these predictions are not as accurate. They are not horribly off, but the model predicts these people to be rated as more likable than they actually are rated.
Our LASSO model ended up with an MSE of 218, indicating solid overall fit and generalizability to the test data.