\documentclass{article}

% if you need to pass options to natbib, use, e.g.:
%     \PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2021

\bibliographystyle{unsrtnat}
\PassOptionsToPackage{numbers, compress}{natbib}
% ready for submission
 
 \usepackage[preprint]{neurips_2021}
%\usepackage[nonatbib,preprint]{neurips_2021}

% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_2021}

% to compile a camera-ready version, add the [final] option, e.g.:
%     \usepackage[final]{neurips_2021}

% to avoid loading the natbib package, add option nonatbib:
%    \usepackage[nonatbib]{neurips_2021}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage[colorlinks=true]{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{graphicx} %package to manage images
\usepackage[nodayofweek,level]{datetime}
\usepackage{adjustbox}

\title{Analyzing Gender Share\\in Casting Actors}

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.

\author{%
  Sophia Herrmann\\
  Matrikelnummer 5688690\\
  \texttt{so.herrmann@student.uni-tuebingen.de} \\
  \And
  Tobias Stumpp\\
  Matrikelnummer 3798377\\
  \texttt{tobias.stumpp@student.uni-tuebingen.de} \\
}

\begin{document}

\maketitle

\begin{abstract}
  We use the dataset on \href{https://datasets.imdbws.com/title.principals.tsv.gz}{film-principals}, \href{https://datasets.imdbws.com/title.basics.tsv.gz}{film-titles}, \href{https://datasets.imdbws.com/title.ratings.tsv.gz}{film-ratings} from the \href{https://imdb.com}{IMDb}~\citep{imdbiface,imdbws} to examine how the female share on the cast of principal actors has changed over years. We want to look at when and in which genres the gender share has changed. We want to see if we can find correlations of film ratings and genres on gender share, and, if applicable, see how well film rating can be predicted.
\end{abstract}

% - Wieso ist Gender Share/unsere Fragestellung von Interesse.
%  - Bendchtel-Test-Ersatz
%  - Fragen
%    - "Bendchtel-Test hat Schlagzeilen um 2000 gemacht." Hat sich seither etwas verändert?
%    - "Filme, die den Bendcheltest bestehen wären erfolgreicher." Stimmt das?
%  - Ziel (kurz)
%     - Wir untersuchen "Frage 1" mit, wollen Ergebnis ob..
%     - Wir untersuchen "Frage 2" mit, wollen Ergebnis ob..
% - Welche Daten haben wir
%   - Datenvorstellung IMDb
%   - Übersicht der Features
% - Methoden
%   - Beschreibung
%   - (Vorraussetzungen/Assumptions)
% - Analyse & Ergebnisse 
%   - Datenanalyse
%   - Statistiktests
% - Probleme/Limitations
% - Resümee


\section{Impact of Bechdel test on the female share in principal cast}
\label{sect_intro}

In the context of gender equality, and inspired by the Bechdel test and a possible impact of the test, we aim to examine the gender balance in principal roles in movies by using IMDb data~\citep{imdbiface,imdbws} on movie casting.

The Bechdel test is an indicator of active female roles in fiction. The basis for the test as understood today goes back to a comic strip from 1985, with criteria that can also be derived from the narrative: A woman explains that she will only go to movies that (1) feature at least two women (2) talking to each other (3) about something other than a man.~\cite{bechdeltestwikien,dtwofblog}
The English Wikipedia page on the Bechdel test mentions two statements that we would like to examine within the scope of our possibilities on data analysis:

\begin{enumerate}
    \item "the test became more widely discussed in the 2000s"~\citep{bechdeltestwikien,bechdeltestgoogletrends}\\
    We test: Did the proportion of women in principal roles in movies change after the year 2000?
    \item "the films that passed the test had about a 37 percent higher return on investment (ROI)"\\
    We test: Does the proportion of women in principal roles correlate with movie success?~\citep{bechdeltestwikien,fivethirtyeightexclusionwomen}
\end{enumerate}

We assume, the 2000s media attention of the Bechdel test leaded to both an increase in the popularity of movies with higher female share in principal cast, but also assume a trend in movie industry to cast more actresses in principal roles. Herein we find an incentive for further analysis regarding possible observable patterns in the share of female in principal cast and the popularity of movies. Herein we interpret 2000 as a critical year for a significant shift.

In line with these assumptions, we test (1) for significant change of actress share in principal roles with year 2000, and we analyze (2) correlation and predictability between actress share and average rating as measure of popularity with years after 2000.

% - Welche Daten haben wir 

\section{Dataset description and preprocessing}
\label{sect_dataset}
We analyze data from the Internet Movie Database (IMDb), which provides a public subset for public research purposes. The IMDb as an online-platform provides users a retrieval and filing of detailed information on movies, television series, video productions, and computer games which provides a public subset for public research purposes. The public subset of IMDb api-retrievable-data includes movies from 1890 to the present day. The subset of the IMDb publicly provided data is regenerated daily. We make use the files and features as shown in table~\ref{feature_table}.

\begin{table}
  \caption{Files and features in use}
  \label{feature_table}
  \centering
  \begin{adjustbox}{width=\columnwidth,center}
  \begin{tabular}{lllp{12cm}}
    \toprule
    
    File     & Feature & Type & Description \\
    \midrule
    film-principals\footnote{\url{https://datasets.imdbws.com/title.principals.tsv.gz}}
    & tconst     & (string)  & alphanumeric unique identifier of the title \\ \cmidrule(r){2-4}
    & nconst     & (string)  & alphanumeric unique identifier of the name/person \\ \cmidrule(r){2-4}
    & category   & (string)  & the category of job that person was in \\
    
    \hline

    film-titles\footnote{\url{https://datasets.imdbws.com/title.basics.tsv.gz}}
    & tconst         & (string)       & alphanumeric unique identifier of the title \\ \cmidrule(r){2-4}
    & titleType      & (string)       & the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc) \\ \cmidrule(r){2-4}
    & startYear      & (YYYY)         & represents the release year of a title. In the case of TV Series, it is the series start year \\ \cmidrule(r){2-4}
    & runtimeMinutes & (integer)      & primary runtime of the title, in minutes \\ \cmidrule(r){2-4}
    & genres         & (string array) & includes up to three genres associated with the title \\
    
    \hline
    
    film-ratings \footnote{\url{https://datasets.imdbws.com/title.ratings.tsv.gz}}
    & tconst         & (string)  & alphanumeric unique identifier of the title \\ \cmidrule(r){2-4}
    & averageRating  & (integer) & weighted average of all the individual user ratings \\ \cmidrule(r){2-4}
    & numVotes       & (integer) & number of votes the title has received\\
    
    \bottomrule
  \end{tabular}
  \end{adjustbox}
\end{table}


\label{sect_preprocessing}
Our download from \formatdate{30}{1}{2022} captures 77.838.777 million movies which we preprocess in several steps:

\begin{itemize}
    \item We consider only movies within the time frame from 1980 to 2020.
    
    \item We drop movies regarding the feature \emph{movie duration}. Some movies show a duration of a few single minutes. On the other extreme, some movies show of over 1000 minutes. Filtering the dataset from likely lower quality movies, movies with a duration above the 95\% quantile [135 min] or below the 5\% quantile [52 min] are removed and therefore ignored in our analysis.
    
    \item We only keep relevant features: The movie id (tconst), the movie release year (startYear), genres, the movie duration (runtimeMinutes), category (indicating if the movie contains actor(s) and/or actress(es) in the principal cast).
    
    \item We functionally derive dependend data. I.e., we derive the share and proportion of actresses that are in principal cast for each movie. We derive the proportion of the absolute numbers of actresses against actors.
\end{itemize}


For the second analysis only the time frame between 2000 and 2020 was considered. Therefore, the data set drops to a size of 880.209 movies. Additionally, the feature genre had to be further prepossessed. Genre covers 951 different entries, where the majority of movies presents genre overlaps such as Drama-Comedy or Drama-Thriller-Horror. Keeping all of those 951 genres as a dummy variable is messy. Splitting those overlaps of genres and allowing movies to have several genres would lead to dependencies. Hence, for further analysis only movies were considered that belong to a single genre (number of single genres = 24, new data set size = 43'680). This approach could also reveal that movies that are strictly assigned to one genre differ a lot in their features against other genres.


% - Methoden
\section{Methods}
\label{sect_methods}
\begin{figure}
  \centering
  %\fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
  \includegraphics[width=1\textwidth]{fig-001_Share-in-principal-cast-of-actresses-in-all-movies-1980-2020.png}
  \caption{Share in principal cast of actresses in all movies, 1980 - 2020.}
  \label{actresses_prop_figure}
\end{figure}

\subsubsection*{Descriptive Analysis}
Firstly, we use figure~\ref{actresses_prop_figure} to receive an overview about range of dispersion of the shares of actresses on principal cast for each single year. Here, the left time frame covers the years from 1980 to 1990 (marked with blue points) and the right time frame covers the years from 2000 to 2020 (marked with orange points). Additionally, for each year the mean value over the shares of actresses on principal cast was computed and marked with green and red points.
Observing differences in the share of actresses on principal cast after 2000 is difficult to evaluate. The figure presents a high variation in the shares in principal cast of actresses, hence the computed means for each year go in line with high standard deviations. Hence, a clear change in pattern in the years after 2000 against the years before 2000 cannot be identified.
However, the mean values presents to be slightly higher after 2000. 
Presenting more qualitative insights of possible differences in the share of actresses on principal cast, significance test are implemented. 

\subsubsection*{Statistical analysis}

With t-testing, our goal is to find out if the mean $\mu_1$ on the proportion of actresses in principal roles from 2000-2020 differs significantly compared to the mean $\mu_0$ on the proportion of actresses in principal roles in 1980-2000.

With beta-binomial-testing, we put a beta-prior on $f_0$ (the probability to experience an amount of shares) which is based on $m_0$ (the number of a share on movies in 1980-2000) in $n_0$ movies (the number of movies in 1980-2000).\\
Next Under the null hypothesis $H_0: f_1 = f_0$, the number of movies with a share in 2000-2020 $m_1$ (given the number of movies in 2000-2020 $n_1$) follows a binomial distribution.\\
This tells us the probability to observe $m_1$ shares for movies in 2000-2020, given the number of movies in 2000-2020 $n_1$ and the statistics $m_0$, $n_0$ for the years 1980-2000.

\subsubsection*{Analyzing the relationship of the share of actresses on principal cast and average movie ratings and the suitability of linear regression models for predictive modeling}

The relationship of the female share on principal cast on the average mean rating between 2000 and 2020 was analyzed by a scatter plot. Further, the linear regression model was implemented to evaluate its suitability as prediction model for the average rating on the share of actresses on the principal cast.
Additionally, the impact of including the features movie duration and genre on the model fit of the linear regression was analyzed. For the latter model, only those movies were considered that covers a single genre. The genres were included as dummy variables, whereby the dummy variable for the genre "drama" was excluded due to multicollinearity. 

%   - Statistiktests und regression

\section{Results}
\label{sect_results}

With (1)~\ref{sect_intro} we want to study whether the proportion of principal roles filled by actresses differs between the periods 1980-2000 and 2000-2020. We do not find a clear indication in a visual~analysis~\ref{actresses_prop_figure}, we assume due to high variances and a discrete fashion of available data.

The statistical tests in a non-visual analysis, more specifically the t-test and the beta-binomial-test result in insignificant p-values\footnote{\url{https://coreco.samstagskind.de/tobi/Gender-Share-in-Casting-Actors_DL-WS2122_public/src/branch/master/exp/exp-003_T-Test-Hypothesis-Testing.ipynb}}~\citep{gitrepo} except for two occasions on the beta-binomal-test that propose significance: Testing whether there are unlikely\footnote{\url{https://coreco.samstagskind.de/tobi/Gender-Share-in-Casting-Actors_DL-WS2122_public/src/branch/master/exp/exp-004_Beta-Binomial-Hypothesis-Testing.ipynb}}~\citep{gitrepo}\\
\begin{itemize}
    \item more movies with a majority of actresses in the principal roles.
    \item less movies with a minority of actresses in the principal roles.
\end{itemize}

With (2)~\ref{sect_intro} we do not find a correlation of actress share of principal cast on average rating\footnote{\url{https://coreco.samstagskind.de/tobi/Gender-Share-in-Casting-Actors_DL-WS2122_public/src/branch/master/exp/exp-005_Relationship-Rating-and-Share-Actresses-on-principal-cast.ipynb}}~\citep{gitrepo}.
Firstly, a simple scatter plot of the share of actresses on principal cast against the average rating did not present any pattern. Each value of the actress share covered almost the whole range of possible rating scores. Additionally, the pearson correlation coefficient was computed and affirmed no meaningful linear relationship by a value of -0.07. Due to those results, the previous idea of using a linear regression model could already be stated as an unsuitable prediction model, not fulfilling model assumptions of linearity. In line with this, the linear regression model presented a bad model fit by the R-squared value of 0.005. Even though the estimated coefficient for the actress share was significant, the aim of receiving accurate predictions for average movie rating on actress share is not given by a linear regression model with a single predictor.
The results of including the movie duration and genre as additional explanatory variables into the linear regression model were again unsatisfactory. The overall model fit claimed to be better than in the first model, but was still bad by a R-squared of 0.22. Hence, the idea of controlling for single genres by dummy variables and therefore to receiving probably a lower variation in the data within all single genres is not given. 
Positively, many dummy variables were significant, that incentives to further research of a possible relationship of actress share on principal cast and average rating within single genres.

% - Probleme/Limitations

\section{Discussion}
\label{sect_discussion}
 The paper does not detect a clear difference of the share of actresses on principal cast in the years before and after 2000. The significant tests provided contradictory results.
 However, the use of the t test is to be questioned. The assumption of normal distributed data cannot be well fulfilled due to a more discrete pattern of the actress shares.
 
 Additionally, the previous sticking to the goal of predicting the average rating by the share of actresses on principal cast was naive. The linear regression model was unsuitable as well as the small set of predictor variables.

{
\small

\bibliography{bibliography}

}

\end{document}