# Data Literacy - Project
## Gender Share in Movies
#### Tobias Stumpp, Sophia Herrmann

## Beta-Binomial Hypothesis Testing

### Parameters

In [None]:
# Starting year of the period of years covered by the test
start_year = 1980
# Ending year of the period of years covered by the test
end_year = start_year + 40

# Split year of the period of years covered by the test that separates
# indicative data (>= start_year and < split_year)
# from
# data to be verified (>= split_year and < end_year).
split_year = start_year + 20

# Option to ignore movies where the average rating or the number of votes is below the respective 5% quantile.
ignore_irrelevant_movies = False

### Meta

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [None]:
path = '../dat/'
os.chdir(path)

### Read Data

In [None]:
columns = list(pd.read_csv('data_movie.csv', nrows=1))
print(columns)

In [None]:
columns_to_read = [c for c in columns if c != 'genres']

data_movie = pd.read_csv('data_movie.csv', usecols = columns_to_read)

display(data_movie.info())
display(data_movie.head())

---

#### Provide the option to only include movies that are relevant based on the average rating and number of votes.

In [None]:
data_movie[['numVotes','averageRating']].describe()

In [None]:
numVotes_split = data_movie['numVotes'].quantile(0.05)
numVotes_split

In [None]:
averageRating_split = data_movie['averageRating'].quantile(0.05)
averageRating_split

In [None]:
display(data_movie.shape)

In [None]:
if ignore_irrelevant_movies:
    data_movie = data_movie[(data_movie['numVotes'] > numVotes_split) & (data_movie['averageRating'] > averageRating_split)]

In [None]:
display(data_movie.shape)

---

#### Only include the data to movies of the selected range of years.

In [None]:
display(data_movie.shape)

In [None]:
data_movie = data_movie[(data_movie['startYear'] >= start_year) & (data_movie['startYear'] < end_year)]

In [None]:
display(data_movie.shape)

### Prepare Data

##### Add year span as a column

In [None]:
year_span_presplit = f"{start_year}-{split_year}"
year_span_postsplit = f"{split_year}-{end_year}"
year_span = np.where(data_movie['startYear'] < split_year, year_span_presplit, year_span_postsplit)
data_movie.insert(1, 'year_span' , year_span)

display(data_movie)

##### Add counts and proportions on crew members

In [None]:
data_cast_numbers = pd.crosstab(data_movie['tconst'], data_movie['category']).reset_index().rename(columns = {
    'actor':'num_actors',
    'actress':'num_actresses',
})

data_cast_proportion = data_movie.groupby(['tconst'])['category'].value_counts(normalize=True).unstack().reset_index().fillna(0).rename(columns = {
    'actor':'prop_actors',
    'actress':'prop_actresses',
})

data_cast_gender_stat = pd.merge(data_cast_numbers, data_cast_proportion)
data_cast_gender_stat

In [None]:
data_movie_distinct = data_movie.drop(columns=['category']).drop_duplicates(['tconst']).reset_index(drop = True)
display(data_movie_distinct)

data_movie_gender_stat = pd.merge(data_movie_distinct, data_cast_gender_stat)
data_movie_gender_stat.groupby('year_span').apply(display)

---

##### Add counts on proportions of actresses relative to actors

In [None]:
data_movie_gender_stat['num_actresses_>_num_actors'] = (data_movie_gender_stat['num_actresses'] > data_movie_gender_stat['num_actors'])
data_movie_gender_stat['num_actresses_=_num_actors'] = (data_movie_gender_stat['num_actresses'] == data_movie_gender_stat['num_actors'])
data_movie_gender_stat['num_actresses_<_num_actors'] = (data_movie_gender_stat['num_actresses'] < data_movie_gender_stat['num_actors'])

data_movie_gender_stat['num_actresses_=_0'] = (data_movie_gender_stat['num_actresses'] == 0)
data_movie_gender_stat['num_actresses_>_0'] = (data_movie_gender_stat['num_actresses'] > 0)

data_movie_gender_stat

In [None]:
data_actresses_stat = data_movie_gender_stat.groupby(['year_span','startYear'])[[
    'num_actresses_>_num_actors',
    'num_actresses_=_num_actors',
    'num_actresses_<_num_actors',
    'num_actresses_=_0',
    'num_actresses_>_0',
]].sum().reset_index()

data_actresses_stat['num_movies'] = (
    data_actresses_stat['num_actresses_>_num_actors'] +
    data_actresses_stat['num_actresses_=_num_actors'] +
    data_actresses_stat['num_actresses_<_num_actors']
)

data_actresses_stat

---

##### Split data into their year spans

In [None]:
data_actresses_stat_timespan_presplit, data_actresses_stat_timespan_postsplit = [
    g.reset_index(drop=True) for _, g in data_actresses_stat.groupby(['year_span'])
]

In [None]:
display(data_actresses_stat_timespan_presplit)
display(data_actresses_stat_timespan_postsplit)

In [None]:
display(data_actresses_stat_timespan_presplit.describe())
display(data_actresses_stat_timespan_postsplit.describe())

---

In [None]:
data_actresses_stat_sum = data_actresses_stat.drop(columns=['startYear']).groupby(['year_span']).sum()
data_actresses_stat_sum

---

### Analyze Data

#### Compute p-Values

Our goal is to find out if actresses achieved significantly more movies with *majority shares* or less movies with *minority shares* in the principal casts after the split year than before the split year.  
We perform a beta-binomial test and explicitly follow the example presented in the lecture and exercise on scores of the German Bundesliga.

- First, we put a beta-prior on $f_0$ (the majority probability before the split year) which is based on $m_0$ (the number of movies with majority share before the split year) in $n_0$ movies (the number of movies before the split year).

- Under the null hypothesis $H_0: f_1 = f_0$, the number of movies with majority share after the split year $m_1$ (given the number of movies after the split year $n_1$) follows a binomial distribution. 

- Putting these building blocks together, we obtain a [beta-binomial distribution](https://en.wikipedia.org/wiki/Beta-binomial_distribution)

    \begin{equation}
    p(m_1 \vert n_1, m_0, n_0) 
    = {n_1\choose m_1} 
    \frac{\mathcal{B}(m_0 + m_1 + 1, (n_0-m_0) + (n_1-m_1) + 1)}
    {\mathcal{B}(m_0 + 1, n_0 - m_0 + 1)}.
    \end{equation}

    This tells us the probability to observe $m_1$ for movies with *majority shares* after the split year, given the number of movies after the split year $n_1$ and the statistics $m_0$, $n_0$ for the years before.

In [None]:
from scipy.stats import betabinom

In [None]:
def p_val_won(m_1, n_1, m_0, n_0):
    """
    Compute p-value by summing the evidence p(m_1 | n_1, m_0, n_0) over the 
    observed number of won movies and 'more extreme' (i.e. smaller) movie counts.
    
    Parameters
    ----------
    m_1 : int
        Number of won movies after the split year (0 <= m_1 <= n_1)
    n_1 : int
        Number of movies after the split year (n_1 > 0)
    m_0 : int
        Number of won movies before the split year (0 <= m_0 <= n_0)
    n_0 : int
        Number of movies before the split year (n_0 > 0)
    
    Result
    ------
    The probability for observing m_1 or less movies.
    """
    return betabinom.cdf(m_1, n_1, m_0 + 1, n_0 - m_0 + 1)

In [None]:
def p_val_lost(m_1, n_1, m_0, n_0):
    """
    Compute p-value by summing the evidence p(m_1 | n_1, m_0, n_0) over the 
    observed number of lost movies and 'more extreme' (i.e. larger) movie counts.
    
    Parameters
    ----------
    m_1 : int
        Number of lost movies after the split year (0 <= m_1 <= n_1)
    n_1 : int
        Number of movies after the split year (n_1 > 0)
    m_0 : int
        Number of lost movies before the split year (0 <= m_0 <= n_0)
    n_0 : int
        Number of movies before the split year (n_0 > 0)
    
    Result
    ------
    The probability for observing m_1 or more movies.
    """
    return 1.0 - betabinom.cdf(m_1 - 1, n_1, m_0 + 1, n_0 - m_0 + 1)

In [None]:
def print_result(p_val):
    alpha = 0.05
    # Significant results?
    print(f"{'Yes' if (p_val <= alpha) else 'No'}, the result is {'significant' if (p_val <= alpha) else 'insignificant'} because given the pre-split-year data, observing the post-split-year data has a {p_val*100:.2f}% probability.")

#### Are there more movies with a majority of actresses in the principal roles?

In [None]:
p_val_actresses_in_majority = p_val_lost(
    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_>_num_actors'], # <---
    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],
    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_>_num_actors'],  # <---
    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],
)

print_result(p_val_actresses_in_majority)

#### Are there less movies with a minority of actresses in the principal roles?

In [None]:
p_val_actresses_in_minority = p_val_won(
    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_<_num_actors'], # <---
    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],
    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_<_num_actors'],  # <---
    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],
)

print_result(p_val_actresses_in_minority)

#### Are there less movies with a majority of actresses in the principal roles?

In [None]:
p_val_actresses_in_majority = p_val_won(
    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_>_num_actors'], # <---
    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],
    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_>_num_actors'],  # <---
    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],
)

print_result(p_val_actresses_in_majority)

#### Are there more movies with a minority of actresses in the principal roles?

In [None]:
p_val_actresses_in_minority = p_val_lost(
    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_<_num_actors'], # <---
    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],
    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_<_num_actors'],  # <---
    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],
)

print_result(p_val_actresses_in_minority)

---

#### Are there less movies with zero actresses in the principal roles?

In [None]:
p_val_actresses_eq_zero = p_val_won(
    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_=_0'], # <---
    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],
    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_=_0'],  # <---
    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],
)

print_result(p_val_actresses_eq_zero)

#### Are there more movies with more than zero actresses in the principal roles?

In [None]:
p_val_actresses_gt_zero = p_val_lost(
    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_>_0'], # <---
    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],
    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_>_0'],  # <---
    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],
)

print_result(p_val_actresses_gt_zero)

### Results

In summary, the series on the beta-binomial test shows that there are more films with a majority of actresses and fewer films with a minority of actresses in the lead roles.  
The rest of the tests with this test did not show significance.

We interpret the results overall as an indicator of improvement in the proportion of principal actresses.

Note: It is difficult for us to evaluate how reliable these results are. On the one hand, we've learned about the test method on a close example in the lecture and we are convinced that we can apply this model to this movie cast data, on the other hand, we don't know how meaningful this result imposes on the ratio of actors and actresses, despite the striking low p-values.