Which features are considered most attractive in celebrities? Exploring CelebA dataset.

Ubaada 17-03-2024 2293 views |



The plot is showing the correlation with images marked attractive in CelebA dataset with the features they have. CelebA or Large-scale CelebFaces Attributes is a dataset published with the paper Deep Learning Face Attributes in the Wild. The dataset contains images of celebarities with various facial features like Gray_Hair, Wearing_Necktie, High_Cheekbones  and an additional column named Attractive. They are all marked in binary -1/1 for yes/no like so:

image 5_o_Clock_Shadow Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Big_Lips Big_Nose Black_Hair Blond_Hair Blurry Brown_Hair Bushy_Eyebrows Chubby Double_Chin Eyeglasses Goatee Gray_Hair Heavy_Makeup High_Cheekbones Male Mouth_Slightly_Open Mustache Narrow_Eyes No_Beard Oval_Face Pale_Skin Pointy_Nose Receding_Hairline Rosy_Cheeks Sideburns Smiling Straight_Hair Wavy_Hair Wearing_Earrings Wearing_Hat Wearing_Lipstick Wearing_Necklace Wearing_Necktie Young
image -1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 1 -1 -1 1 -1 -1 -1 1 1 -1 1 -1 1 -1 -1 1

According to who? Bias:

Due to the subjective nature of the question I should, first of all, clarify who and how the data was labelled. It's mentioned in the paper:

Each image in CelebA and LFWA is annotated with forty face attributes and five key points by a professional labeling company

The paper is published by The Chinese University of Hong Kong, It's likely labelled by a local company reflecting the local preferences and biases. Ultimately it is their interpretation of a subjective thing.

How?

I'm using Huggingface's datasets library for accessing the dataset. Pandas and Matplotlib for further processing and visualisation respectively.

1. Import libraries and load the:

from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
dataset = load_dataset('tpremoli/CelebA-attrs')

2. Drop the non-numeric columns

trim_dataset = dataset.remove_columns(['image','prompt_string', 'Blurry'])
df = pd.DataFrame(trim_dataset['train'])

3. Calculate the correlation of 'Attractive' column with other numeric columns.

# Calculate correlation with the "Attractive" feature
correlation = df.corrwith(df['Attractive'])
correlation = correlation.drop('Attractive')
# Sort values for better visualization
sorted_correlation = correlation.sort_values()

We've also dropped the column that we're comparing other columns with for cleaner visualisation. The correlation of a column with itself is always going to be 1 anyway. That is not insightful.

4. Plot the data.

import numpy as np #for color interpolation
# Plotting General
plt.figure(figsize=(14, 7))
# Map correlation values to colors indicating direction and magnitude with RdYlGn colormap.
colors = [plt.get_cmap('RdYlGn')(i) for i in np.interp(sorted_correlation, (min(sorted_correlation), max(sorted_correlation)), (0, 1))]
sorted_correlation.plot(kind='bar', color = colors)
plt.title('Correlation of Features with Attractiveness (Men and Women)')
plt.xlabel('Features')
plt.ylabel('Correlation Coefficient')
plt.tight_layout()
plt.show()

We're coloring the bars here into positive (green) and negative (red). We're also using a gradient of colors with magnitude of correlation so the bars at the either end are darker.

Further Analysis

Below are some further plots I thought would be interesting to see.

Separating Men and Women plots



There's a column (Male) that marks whether the image is of a male or not. We can filter the dataset and plot the correlations separately.

#male and female only df. based on where the column 'Male' is 1 or -1
df_men = df[df['Male'] == 1].drop(columns='Male')
df_women = df[df['Male'] == -1].drop(columns='Male')
# men only correlation
men_correlation = df_men.corrwith(df["Attractive"])
men_correlation = men_correlation.drop(['Attractive', 'Wearing_Lipstick'])
men_sorted_correlation = men_correlation.sort_values(ascending=False)
# women only correlation
women_correlation = df_women.corrwith(df["Attractive"])
women_correlation = women_correlation.drop(['Attractive', 'Mustache', 'No_Beard', '5_o_Clock_Shadow', 'Goatee', 'Bald'])
women_sorted_correlation = women_correlation.sort_values(ascending=False)

Then plot the graphs as before.

Spread of Attractive features


The spread plots are calculated by first computing a count of positive features present in each sample. The count is weighted by the correlation coefficient of that feature with the Attractive column. Artificial features  Heavy_Makeup, Wearing_Lipstick, Wearing_Earrings were dropped. The absence of a negatively correlated features isn't counted against the sample.

# Create a new DataFrame, here it's a simple copy of 'df' for demonstration
att_sum_df = df.copy()
# drop non natural columns
natural_correlation = correlation.drop(columns=['Heavy_Makeup', 'Wearing_Lipstick', 'Wearing_Earrings', 'Wearing_Necklace', 'Wearing_Necktie', 'Wearing_Hat'])
# Add an index column
att_sum_df['index'] = range(len(att_sum_df))
# Initialize 'attractive_sum' column to 0
att_sum_df['attractive_sum'] = 0
# Iterate over column names. dataframe['column'] can select the entire column across all rows
for feature in natural_correlation.index:
	# this multiplies the entire 'feature' column with its correlation value
    # and adds it to the 'attractive_sum' column
    # Only marks the presence of the feature, not its absence
    att_sum_df['attractive_sum'] += (att_sum_df[feature] == 1) * natural_correlation[feature]

Then the score is plotted as a histogram:
bins = 400
# calculate the symmetric range from all distribution plots
import math
x_range = max(abs(att_sum_df['attractive_sum'].min()), abs(att_sum_df['attractive_sum'].max()))
x_range = math.ceil(x_range)

# Plot the distribution of 'attractive_sum'
plt.figure(figsize=(8, 4))
plt.hist(att_sum_df['attractive_sum'], bins=bins)
plt.title('Attractive Features Distribution')
plt.xlabel('Attractive Features Weighted Sum')
plt.ylabel('Frequency')
plt.tight_layout()
plt.xlim(-x_range, x_range)
plt.show()

# Plot the distribution for Men only
plt.figure(figsize=(8, 4))
plt.hist(att_sum_df[df['Male'] == 1]['attractive_sum'], bins=bins)
plt.title('Attractive Features Distribution (Men Only)')
plt.xlabel('Attractive Features Weighted Sum')
plt.ylabel('Frequency')
plt.tight_layout()
plt.xlim(-x_range, x_range)
plt.show()

# Women only
plt.figure(figsize=(8, 4))
plt.hist(att_sum_df[df['Male'] == -1]['attractive_sum'], bins=bins)
plt.title('Attractive Features Distribution (Women Only)')
plt.xlabel('Attractive Features Weighted Sum')
plt.ylabel('Frequency')
plt.tight_layout()
plt.xlim(-x_range, x_range)
plt.show()

Top and bottom samples

You can use the weighted sum column to sort and filter images.Following is top 5 and bottom 5 where High_Cheekbones == 1. This disparity in the plots above is showing in the sample images.



display_top_bottom_images(att_sum_df[(att_sum_df['High_Cheekbones'] == 1)], dataset, n=5)

I am using a simple function I wrote for this for picking out top 5 or bottom 5 images and stichting them together.

You can download the python notebook here.