Team project "Analyzing Gender Share in Casting Actors" as part of the lecture "Data Literacy"

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230
  1. \documentclass{article}
  2. % if you need to pass options to natbib, use, e.g.:
  3. % \PassOptionsToPackage{numbers, compress}{natbib}
  4. % before loading neurips_2021
  5. \bibliographystyle{unsrtnat}
  6. \PassOptionsToPackage{numbers, compress}{natbib}
  7. % ready for submission
  8. \usepackage[preprint]{neurips_2021}
  9. %\usepackage[nonatbib,preprint]{neurips_2021}
  10. % to compile a preprint version, e.g., for submission to arXiv, add add the
  11. % [preprint] option:
  12. % \usepackage[preprint]{neurips_2021}
  13. % to compile a camera-ready version, add the [final] option, e.g.:
  14. % \usepackage[final]{neurips_2021}
  15. % to avoid loading the natbib package, add option nonatbib:
  16. % \usepackage[nonatbib]{neurips_2021}
  17. \usepackage[utf8]{inputenc} % allow utf-8 input
  18. \usepackage[T1]{fontenc} % use 8-bit T1 fonts
  19. \usepackage[colorlinks=true]{hyperref} % hyperlinks
  20. \usepackage{url} % simple URL typesetting
  21. \usepackage{booktabs} % professional-quality tables
  22. \usepackage{amsfonts} % blackboard math symbols
  23. \usepackage{nicefrac} % compact symbols for 1/2, etc.
  24. \usepackage{microtype} % microtypography
  25. \usepackage{xcolor} % colors
  26. \usepackage{graphicx} %package to manage images
  27. \usepackage[nodayofweek,level]{datetime}
  28. \usepackage{adjustbox}
  29. \title{Analyzing Gender Share\\in Casting Actors}
  30. % The \author macro works with any number of authors. There are two commands
  31. % used to separate the names and addresses of multiple authors: \And and \AND.
  32. %
  33. % Using \And between authors leaves it to LaTeX to determine where to break the
  34. % lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
  35. % authors names on the first line, and the last on the second line, try using
  36. % \AND instead of \And before the third author name.
  37. \author{%
  38. Sophia Herrmann\\
  39. Matrikelnummer 5688690\\
  40. \texttt{so.herrmann@student.uni-tuebingen.de} \\
  41. \And
  42. Tobias Stumpp\\
  43. Matrikelnummer 3798377\\
  44. \texttt{tobias.stumpp@student.uni-tuebingen.de} \\
  45. }
  46. \begin{document}
  47. \maketitle
  48. \begin{abstract}
  49. We use the dataset on \href{https://datasets.imdbws.com/title.principals.tsv.gz}{film-principals}, \href{https://datasets.imdbws.com/title.basics.tsv.gz}{film-titles}, \href{https://datasets.imdbws.com/title.ratings.tsv.gz}{film-ratings} from the \href{https://imdb.com}{IMDb}~\citep{imdbiface,imdbws} to examine how the female share on the cast of principal actors has changed over years. We want to look at when and in which genres the gender share has changed. We want to see if we can find correlations of film ratings and genres on gender share, and, if applicable, see how well film rating can be predicted.
  50. \end{abstract}
  51. % - Wieso ist Gender Share/unsere Fragestellung von Interesse.
  52. % - Bendchtel-Test-Ersatz
  53. % - Fragen
  54. % - "Bendchtel-Test hat Schlagzeilen um 2000 gemacht." Hat sich seither etwas verändert?
  55. % - "Filme, die den Bendcheltest bestehen wären erfolgreicher." Stimmt das?
  56. % - Ziel (kurz)
  57. % - Wir untersuchen "Frage 1" mit, wollen Ergebnis ob..
  58. % - Wir untersuchen "Frage 2" mit, wollen Ergebnis ob..
  59. % - Welche Daten haben wir
  60. % - Datenvorstellung IMDb
  61. % - Übersicht der Features
  62. % - Methoden
  63. % - Beschreibung
  64. % - (Vorraussetzungen/Assumptions)
  65. % - Analyse & Ergebnisse
  66. % - Datenanalyse
  67. % - Statistiktests
  68. % - Probleme/Limitations
  69. % - Resümee
  70. \section{Impact of Bechdel test on the female share in principal cast}
  71. \label{sect_intro}
  72. In the context of gender equality, and inspired by the Bechdel test and a possible impact of the test, we aim to examine the gender balance in principal roles in movies by using IMDb data~\citep{imdbiface,imdbws} on movie casting.
  73. The Bechdel test is an indicator of active female roles in fiction. The basis for the test as understood today goes back to a comic strip from 1985, with criteria that can also be derived from the narrative: A woman explains that she will only go to movies that (1) feature at least two women (2) talking to each other (3) about something other than a man.~\cite{bechdeltestwikien,dtwofblog}
  74. The English Wikipedia page on the Bechdel test mentions two statements that we would like to examine within the scope of our possibilities on data analysis:
  75. \begin{enumerate}
  76. \item "the test became more widely discussed in the 2000s"~\citep{bechdeltestwikien,bechdeltestgoogletrends}\\
  77. We test: Did the proportion of women in principal roles in movies change after the year 2000?
  78. \item "the films that passed the test had about a 37 percent higher return on investment (ROI)"\\
  79. We test: Does the proportion of women in principal roles correlate with movie success?~\citep{bechdeltestwikien,fivethirtyeightexclusionwomen}
  80. \end{enumerate}
  81. We assume, the 2000s media attention of the Bechdel test leaded to both an increase in the popularity of movies with higher female share in principal cast, but also assume a trend in movie industry to cast more actresses in principal roles. Herein we find an incentive for further analysis regarding possible observable patterns in the share of female in principal cast and the popularity of movies. Herein we interpret 2000 as a critical year for a significant shift.
  82. In line with these assumptions, we test (1) for significant change of actress share in principal roles with year 2000, and we analyze (2) correlation and predictability between actress share and average rating as measure of popularity with years after 2000.
  83. % - Welche Daten haben wir
  84. \section{Dataset description and preprocessing}
  85. \label{sect_dataset}
  86. We analyze data from the Internet Movie Database (IMDb), which provides a public subset for public research purposes. The IMDb as an online-platform provides users a retrieval and filing of detailed information on movies, television series, video productions, and computer games which provides a public subset for public research purposes. The public subset of IMDb api-retrievable-data includes movies from 1890 to the present day. The subset of the IMDb publicly provided data is regenerated daily. We make use the files and features as shown in table~\ref{feature_table}.
  87. \begin{table}
  88. \caption{Files and features in use}
  89. \label{feature_table}
  90. \centering
  91. \begin{adjustbox}{width=\columnwidth,center}
  92. \begin{tabular}{lllp{12cm}}
  93. \toprule
  94. File & Feature & Type & Description \\
  95. \midrule
  96. film-principals\footnote{\url{https://datasets.imdbws.com/title.principals.tsv.gz}}
  97. & tconst & (string) & alphanumeric unique identifier of the title \\ \cmidrule(r){2-4}
  98. & nconst & (string) & alphanumeric unique identifier of the name/person \\ \cmidrule(r){2-4}
  99. & category & (string) & the category of job that person was in \\
  100. \hline
  101. film-titles\footnote{\url{https://datasets.imdbws.com/title.basics.tsv.gz}}
  102. & tconst & (string) & alphanumeric unique identifier of the title \\ \cmidrule(r){2-4}
  103. & titleType & (string) & the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc) \\ \cmidrule(r){2-4}
  104. & startYear & (YYYY) & represents the release year of a title. In the case of TV Series, it is the series start year \\ \cmidrule(r){2-4}
  105. & runtimeMinutes & (integer) & primary runtime of the title, in minutes \\ \cmidrule(r){2-4}
  106. & genres & (string array) & includes up to three genres associated with the title \\
  107. \hline
  108. film-ratings \footnote{\url{https://datasets.imdbws.com/title.ratings.tsv.gz}}
  109. & tconst & (string) & alphanumeric unique identifier of the title \\ \cmidrule(r){2-4}
  110. & averageRating & (integer) & weighted average of all the individual user ratings \\ \cmidrule(r){2-4}
  111. & numVotes & (integer) & number of votes the title has received\\
  112. \bottomrule
  113. \end{tabular}
  114. \end{adjustbox}
  115. \end{table}
  116. \label{sect_preprocessing}
  117. Our download from \formatdate{30}{1}{2022} captures 77.838.777 million movies which we preprocess in several steps:
  118. \begin{itemize}
  119. \item We consider only movies within the time frame from 1980 to 2020.
  120. \item We drop movies regarding the feature \emph{movie duration}. Some movies show a duration of a few single minutes. On the other extreme, some movies show of over 1000 minutes. Filtering the dataset from likely lower quality movies, movies with a duration above the 95\% quantile [135 min] or below the 5\% quantile [52 min] are removed and therefore ignored in our analysis.
  121. \item We only keep relevant features: The movie id (tconst), the movie release year (startYear), genres, the movie duration (runtimeMinutes), category (indicating if the movie contains actor(s) and/or actress(es) in the principal cast).
  122. \item We functionally derive dependend data. I.e., we derive the share and proportion of actresses that are in principal cast for each movie. We derive the proportion of the absolute numbers of actresses against actors.
  123. \end{itemize}
  124. For the second analysis only the time frame between 2000 and 2020 was considered. Therefore, the data set drops to a size of 880.209 movies. Additionally, the feature genre had to be further prepossessed. Genre covers 951 different entries, where the majority of movies presents genre overlaps such as Drama-Comedy or Drama-Thriller-Horror. Keeping all of those 951 genres as a dummy variable is messy. Splitting those overlaps of genres and allowing movies to have several genres would lead to dependencies. Hence, for further analysis only movies were considered that belong to a single genre (number of single genres = 24, new data set size = 43'680). This approach could also reveal that movies that are strictly assigned to one genre differ a lot in their features against other genres.
  125. % - Methoden
  126. \section{Methods}
  127. \label{sect_methods}
  128. \begin{figure}
  129. \centering
  130. %\fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
  131. \includegraphics[width=1\textwidth]{fig-001_Share-in-principal-cast-of-actresses-in-all-movies-1980-2020.png}
  132. \caption{Share in principal cast of actresses in all movies, 1980 - 2020.}
  133. \label{actresses_prop_figure}
  134. \end{figure}
  135. \subsubsection*{Descriptive Analysis}
  136. Firstly, we use figure~\ref{actresses_prop_figure} to receive an overview about range of dispersion of the shares of actresses on principal cast for each single year. Here, the left time frame covers the years from 1980 to 1990 (marked with blue points) and the right time frame covers the years from 2000 to 2020 (marked with orange points). Additionally, for each year the mean value over the shares of actresses on principal cast was computed and marked with green and red points.
  137. Observing differences in the share of actresses on principal cast after 2000 is difficult to evaluate. The figure presents a high variation in the shares in principal cast of actresses, hence the computed means for each year go in line with high standard deviations. Hence, a clear change in pattern in the years after 2000 against the years before 2000 cannot be identified.
  138. However, the mean values presents to be slightly higher after 2000.
  139. Presenting more qualitative insights of possible differences in the share of actresses on principal cast, significance test are implemented.
  140. \subsubsection*{Statistical analysis}
  141. With t-testing, our goal is to find out if the mean $\mu_1$ on the proportion of actresses in principal roles from 2000-2020 differs significantly compared to the mean $\mu_0$ on the proportion of actresses in principal roles in 1980-2000.
  142. With beta-binomial-testing, we put a beta-prior on $f_0$ (the probability to experience an amount of shares) which is based on $m_0$ (the number of a share on movies in 1980-2000) in $n_0$ movies (the number of movies in 1980-2000).\\
  143. Next Under the null hypothesis $H_0: f_1 = f_0$, the number of movies with a share in 2000-2020 $m_1$ (given the number of movies in 2000-2020 $n_1$) follows a binomial distribution.\\
  144. This tells us the probability to observe $m_1$ shares for movies in 2000-2020, given the number of movies in 2000-2020 $n_1$ and the statistics $m_0$, $n_0$ for the years 1980-2000.
  145. \subsubsection*{Analyzing the relationship of the share of actresses on principal cast and average movie ratings and the suitability of linear regression models for predictive modeling}
  146. The relationship of the female share on principal cast on the average mean rating between 2000 and 2020 was analyzed by a scatter plot. Further, the linear regression model was implemented to evaluate its suitability as prediction model for the average rating on the share of actresses on the principal cast.
  147. Additionally, the impact of including the features movie duration and genre on the model fit of the linear regression was analyzed. For the latter model, only those movies were considered that covers a single genre. The genres were included as dummy variables, whereby the dummy variable for the genre "drama" was excluded due to multicollinearity.
  148. % - Statistiktests und regression
  149. \section{Results}
  150. \label{sect_results}
  151. With (1)~\ref{sect_intro} we want to study whether the proportion of principal roles filled by actresses differs between the periods 1980-2000 and 2000-2020. We do not find a clear indication in a visual~analysis~\ref{actresses_prop_figure}, we assume due to high variances and a discrete fashion of available data.
  152. The statistical tests in a non-visual analysis, more specifically the t-test and the beta-binomial-test result in insignificant p-values\footnote{\url{https://coreco.samstagskind.de/tobi/Gender-Share-in-Casting-Actors_DL-WS2122_public/src/branch/master/exp/exp-003_T-Test-Hypothesis-Testing.ipynb}}~\citep{gitrepo} except for two occasions on the beta-binomal-test that propose significance: Testing whether there are unlikely\footnote{\url{https://coreco.samstagskind.de/tobi/Gender-Share-in-Casting-Actors_DL-WS2122_public/src/branch/master/exp/exp-004_Beta-Binomial-Hypothesis-Testing.ipynb}}~\citep{gitrepo}\\
  153. \begin{itemize}
  154. \item more movies with a majority of actresses in the principal roles.
  155. \item less movies with a minority of actresses in the principal roles.
  156. \end{itemize}
  157. With (2)~\ref{sect_intro} we do not find a correlation of actress share of principal cast on average rating\footnote{\url{https://coreco.samstagskind.de/tobi/Gender-Share-in-Casting-Actors_DL-WS2122_public/src/branch/master/exp/exp-005_Relationship-Rating-and-Share-Actresses-on-principal-cast.ipynb}}~\citep{gitrepo}.
  158. Firstly, a simple scatter plot of the share of actresses on principal cast against the average rating did not present any pattern. Each value of the actress share covered almost the whole range of possible rating scores. Additionally, the pearson correlation coefficient was computed and affirmed no meaningful linear relationship by a value of -0.07. Due to those results, the previous idea of using a linear regression model could already be stated as an unsuitable prediction model, not fulfilling model assumptions of linearity. In line with this, the linear regression model presented a bad model fit by the R-squared value of 0.005. Even though the estimated coefficient for the actress share was significant, the aim of receiving accurate predictions for average movie rating on actress share is not given by a linear regression model with a single predictor.
  159. The results of including the movie duration and genre as additional explanatory variables into the linear regression model were again unsatisfactory. The overall model fit claimed to be better than in the first model, but was still bad by a R-squared of 0.22. Hence, the idea of controlling for single genres by dummy variables and therefore to receiving probably a lower variation in the data within all single genres is not given.
  160. Positively, many dummy variables were significant, that incentives to further research of a possible relationship of actress share on principal cast and average rating within single genres.
  161. % - Probleme/Limitations
  162. \section{Discussion}
  163. \label{sect_discussion}
  164. The paper does not detect a clear difference of the share of actresses on principal cast in the years before and after 2000. The significant tests provided contradictory results.
  165. However, the use of the t test is to be questioned. The assumption of normal distributed data cannot be well fulfilled due to a more discrete pattern of the actress shares.
  166. Additionally, the previous sticking to the goal of predicting the average rating by the share of actresses on principal cast was naive. The linear regression model was unsuitable as well as the small set of predictor variables.
  167. {
  168. \small
  169. \bibliography{bibliography}
  170. }
  171. \end{document}

Powered by TurnKey Linux.