tobi
/
Gender-Share-in-Casting-Actors_DL-WS2122_public


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711
							{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Literacy - Project\n",
    "## Gender Share in Movies\n",
    "#### Tobias Stumpp, Sophia Herrmann"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Beta-Binomial Hypothesis Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Starting year of the period of years covered by the test\n",
    "start_year = 1980\n",
    "# Ending year of the period of years covered by the test\n",
    "end_year = start_year + 40\n",
    "\n",
    "# Split year of the period of years covered by the test that separates\n",
    "# indicative data (>= start_year and < split_year)\n",
    "# from\n",
    "# data to be verified (>= split_year and < end_year).\n",
    "split_year = start_year + 20\n",
    "\n",
    "# Option to ignore movies where the average rating or the number of votes is below the respective 5% quantile.\n",
    "ignore_irrelevant_movies = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Meta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import os\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = '../dat/'\n",
    "os.chdir(path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = list(pd.read_csv('data_movie.csv', nrows=1))\n",
    "print(columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns_to_read = [c for c in columns if c != 'genres']\n",
    "\n",
    "data_movie = pd.read_csv('data_movie.csv', usecols = columns_to_read)\n",
    "\n",
    "display(data_movie.info())\n",
    "display(data_movie.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Provide the option to only include movies that are relevant based on the average rating and number of votes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_movie[['numVotes','averageRating']].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numVotes_split = data_movie['numVotes'].quantile(0.05)\n",
    "numVotes_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "averageRating_split = data_movie['averageRating'].quantile(0.05)\n",
    "averageRating_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(data_movie.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if ignore_irrelevant_movies:\n",
    "    data_movie = data_movie[(data_movie['numVotes'] > numVotes_split) & (data_movie['averageRating'] > averageRating_split)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(data_movie.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Only include the data to movies of the selected range of years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(data_movie.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_movie = data_movie[(data_movie['startYear'] >= start_year) & (data_movie['startYear'] < end_year)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(data_movie.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Add year span as a column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "year_span_presplit = f\"{start_year}-{split_year}\"\n",
    "year_span_postsplit = f\"{split_year}-{end_year}\"\n",
    "year_span = np.where(data_movie['startYear'] < split_year, year_span_presplit, year_span_postsplit)\n",
    "data_movie.insert(1, 'year_span' , year_span)\n",
    "\n",
    "display(data_movie)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Add counts and proportions on crew members"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_cast_numbers = pd.crosstab(data_movie['tconst'], data_movie['category']).reset_index().rename(columns = {\n",
    "    'actor':'num_actors',\n",
    "    'actress':'num_actresses',\n",
    "})\n",
    "\n",
    "data_cast_proportion = data_movie.groupby(['tconst'])['category'].value_counts(normalize=True).unstack().reset_index().fillna(0).rename(columns = {\n",
    "    'actor':'prop_actors',\n",
    "    'actress':'prop_actresses',\n",
    "})\n",
    "\n",
    "data_cast_gender_stat = pd.merge(data_cast_numbers, data_cast_proportion)\n",
    "data_cast_gender_stat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_movie_distinct = data_movie.drop(columns=['category']).drop_duplicates(['tconst']).reset_index(drop = True)\n",
    "display(data_movie_distinct)\n",
    "\n",
    "data_movie_gender_stat = pd.merge(data_movie_distinct, data_cast_gender_stat)\n",
    "data_movie_gender_stat.groupby('year_span').apply(display)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Add counts on proportions of actresses relative to actors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_movie_gender_stat['num_actresses_>_num_actors'] = (data_movie_gender_stat['num_actresses'] > data_movie_gender_stat['num_actors'])\n",
    "data_movie_gender_stat['num_actresses_=_num_actors'] = (data_movie_gender_stat['num_actresses'] == data_movie_gender_stat['num_actors'])\n",
    "data_movie_gender_stat['num_actresses_<_num_actors'] = (data_movie_gender_stat['num_actresses'] < data_movie_gender_stat['num_actors'])\n",
    "\n",
    "data_movie_gender_stat['num_actresses_=_0'] = (data_movie_gender_stat['num_actresses'] == 0)\n",
    "data_movie_gender_stat['num_actresses_>_0'] = (data_movie_gender_stat['num_actresses'] > 0)\n",
    "\n",
    "data_movie_gender_stat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_actresses_stat = data_movie_gender_stat.groupby(['year_span','startYear'])[[\n",
    "    'num_actresses_>_num_actors',\n",
    "    'num_actresses_=_num_actors',\n",
    "    'num_actresses_<_num_actors',\n",
    "    'num_actresses_=_0',\n",
    "    'num_actresses_>_0',\n",
    "]].sum().reset_index()\n",
    "\n",
    "data_actresses_stat['num_movies'] = (\n",
    "    data_actresses_stat['num_actresses_>_num_actors'] +\n",
    "    data_actresses_stat['num_actresses_=_num_actors'] +\n",
    "    data_actresses_stat['num_actresses_<_num_actors']\n",
    ")\n",
    "\n",
    "data_actresses_stat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Split data into their year spans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_actresses_stat_timespan_presplit, data_actresses_stat_timespan_postsplit = [\n",
    "    g.reset_index(drop=True) for _, g in data_actresses_stat.groupby(['year_span'])\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(data_actresses_stat_timespan_presplit)\n",
    "display(data_actresses_stat_timespan_postsplit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(data_actresses_stat_timespan_presplit.describe())\n",
    "display(data_actresses_stat_timespan_postsplit.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_actresses_stat_sum = data_actresses_stat.drop(columns=['startYear']).groupby(['year_span']).sum()\n",
    "data_actresses_stat_sum"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analyze Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compute p-Values\n",
    "\n",
    "Our goal is to find out if actresses achieved significantly more movies with *majority shares* or less movies with *minority shares* in the principal casts after the split year than before the split year.  \n",
    "We perform a beta-binomial test and explicitly follow the example presented in the lecture and exercise on scores of the German Bundesliga."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- First, we put a beta-prior on $f_0$ (the majority probability before the split year) which is based on $m_0$ (the number of movies with majority share before the split year) in $n_0$ movies (the number of movies before the split year).\n",
    "\n",
    "- Under the null hypothesis $H_0: f_1 = f_0$, the number of movies with majority share after the split year $m_1$ (given the number of movies after the split year $n_1$) follows a binomial distribution. \n",
    "\n",
    "- Putting these building blocks together, we obtain a [beta-binomial distribution](https://en.wikipedia.org/wiki/Beta-binomial_distribution)\n",
    "\n",
    "    \\begin{equation}\n",
    "    p(m_1 \\vert n_1, m_0, n_0) \n",
    "    = {n_1\\choose m_1} \n",
    "    \\frac{\\mathcal{B}(m_0 + m_1 + 1, (n_0-m_0) + (n_1-m_1) + 1)}\n",
    "    {\\mathcal{B}(m_0 + 1, n_0 - m_0 + 1)}.\n",
    "    \\end{equation}\n",
    "\n",
    "    This tells us the probability to observe $m_1$ for movies with *majority shares* after the split year, given the number of movies after the split year $n_1$ and the statistics $m_0$, $n_0$ for the years before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import betabinom"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def p_val_won(m_1, n_1, m_0, n_0):\n",
    "    \"\"\"\n",
    "    Compute p-value by summing the evidence p(m_1 | n_1, m_0, n_0) over the \n",
    "    observed number of won movies and 'more extreme' (i.e. smaller) movie counts.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    m_1 : int\n",
    "        Number of won movies after the split year (0 <= m_1 <= n_1)\n",
    "    n_1 : int\n",
    "        Number of movies after the split year (n_1 > 0)\n",
    "    m_0 : int\n",
    "        Number of won movies before the split year (0 <= m_0 <= n_0)\n",
    "    n_0 : int\n",
    "        Number of movies before the split year (n_0 > 0)\n",
    "    \n",
    "    Result\n",
    "    ------\n",
    "    The probability for observing m_1 or less movies.\n",
    "    \"\"\"\n",
    "    return betabinom.cdf(m_1, n_1, m_0 + 1, n_0 - m_0 + 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def p_val_lost(m_1, n_1, m_0, n_0):\n",
    "    \"\"\"\n",
    "    Compute p-value by summing the evidence p(m_1 | n_1, m_0, n_0) over the \n",
    "    observed number of lost movies and 'more extreme' (i.e. larger) movie counts.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    m_1 : int\n",
    "        Number of lost movies after the split year (0 <= m_1 <= n_1)\n",
    "    n_1 : int\n",
    "        Number of movies after the split year (n_1 > 0)\n",
    "    m_0 : int\n",
    "        Number of lost movies before the split year (0 <= m_0 <= n_0)\n",
    "    n_0 : int\n",
    "        Number of movies before the split year (n_0 > 0)\n",
    "    \n",
    "    Result\n",
    "    ------\n",
    "    The probability for observing m_1 or more movies.\n",
    "    \"\"\"\n",
    "    return 1.0 - betabinom.cdf(m_1 - 1, n_1, m_0 + 1, n_0 - m_0 + 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_result(p_val):\n",
    "    alpha = 0.05\n",
    "    # Significant results?\n",
    "    print(f\"{'Yes' if (p_val <= alpha) else 'No'}, the result is {'significant' if (p_val <= alpha) else 'insignificant'} because given the pre-split-year data, observing the post-split-year data has a {p_val*100:.2f}% probability.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Are there more movies with a majority of actresses in the principal roles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val_actresses_in_majority = p_val_lost(\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_>_num_actors'], # <---\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_>_num_actors'],  # <---\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],\n",
    ")\n",
    "\n",
    "print_result(p_val_actresses_in_majority)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Are there less movies with a minority of actresses in the principal roles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val_actresses_in_minority = p_val_won(\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_<_num_actors'], # <---\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_<_num_actors'],  # <---\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],\n",
    ")\n",
    "\n",
    "print_result(p_val_actresses_in_minority)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Are there less movies with a majority of actresses in the principal roles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val_actresses_in_majority = p_val_won(\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_>_num_actors'], # <---\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_>_num_actors'],  # <---\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],\n",
    ")\n",
    "\n",
    "print_result(p_val_actresses_in_majority)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Are there more movies with a minority of actresses in the principal roles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val_actresses_in_minority = p_val_lost(\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_<_num_actors'], # <---\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_<_num_actors'],  # <---\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],\n",
    ")\n",
    "\n",
    "print_result(p_val_actresses_in_minority)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Are there less movies with zero actresses in the principal roles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val_actresses_eq_zero = p_val_won(\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_=_0'], # <---\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_=_0'],  # <---\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],\n",
    ")\n",
    "\n",
    "print_result(p_val_actresses_eq_zero)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Are there more movies with more than zero actresses in the principal roles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val_actresses_gt_zero = p_val_lost(\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_actresses_>_0'], # <---\n",
    "    data_actresses_stat_sum.loc[year_span_postsplit,'num_movies'],\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_actresses_>_0'],  # <---\n",
    "    data_actresses_stat_sum.loc[year_span_presplit, 'num_movies'],\n",
    ")\n",
    "\n",
    "print_result(p_val_actresses_gt_zero)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In summary, the series on the beta-binomial test shows that there are more films with a majority of actresses and fewer films with a minority of actresses in the lead roles.  \n",
    "The rest of the tests with this test did not show significance.\n",
    "\n",
    "We interpret the results overall as an indicator of improvement in the proportion of principal actresses.\n",
    "\n",
    "Note: It is difficult for us to evaluate how reliable these results are. On the one hand, we've learned about the test method on a close example in the lecture and we are convinced that we can apply this model to this movie cast data, on the other hand, we don't know how meaningful this result imposes on the ratio of actors and actresses, despite the striking low p-values."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}