{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Literacy - Project\n", "## Gender Share in Movies\n", "#### Tobias Stumpp, Sophia Herrmann" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### README & TODO\n", "\n", "This file analyzes the time-frame 2000-2020, if due to the introducton of the Bechdel test,\n", "if it is possible to find a relationship between the share of actresses on the pricipal cast and the average movie rating.\n", "Additionally, if it is possible that the average movie rating can be predicted by a linear regression model and the predictors:\n", "- share of actress on the principal cast \n", "- share of actress on the principal cast, genre and movie duration" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import os\n", "import matplotlib.pyplot as plt\n", "import statsmodels.api as sm " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "path = '../dat/'\n", "os.chdir(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read data, keep only the years 2000 - 2020 and include the share of actresses on the principal cast" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_movie = pd.read_csv('data_movie.csv')\n", "\n", "# Keep only the years of 2000 to 2020 & sort the data frame according years\n", "dat_2000 = data_movie[(data_movie.startYear >= 2000) & (data_movie.startYear <= 2020)].sort_values(\"startYear\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/tobi/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self._setitem_single_block(indexer, value, name)\n" ] } ], "source": [ "# Compute share of actresses on the principal cast and include it into the data set\n", "\n", "# 1)\n", "# Number Actress\n", "number_actress = dat_2000[dat_2000.category == \"actress\"].groupby([\"tconst\"]).category.count().reset_index() \n", "number_actress = number_actress.rename(columns = {\"category\" : \"nactress\"})\n", "# Number Actors\n", "number_actor = dat_2000[dat_2000.category == \"actor\"].groupby([\"tconst\"]).category.count().reset_index()\n", "number_actor = number_actor.rename(columns = {\"category\" : \"nactor\"})\n", "\n", "# Merge number of actress & actors to data_movie, and delete original category-column & delete row-duplicates\n", "dat_2000 = pd.merge(dat_2000, number_actor, on=\"tconst\", how='left')\n", "dat_2000 = pd.merge(dat_2000, number_actress, on=\"tconst\", how='left')\n", "dat_2000.drop([\"category\"], axis = 1, inplace = True)\n", "dat_2000 = dat_2000.drop_duplicates()\n", "\n", "# 2)\n", "dat_2000[\"proportion\"] = dat_2000[\"nactress\"] / (dat_2000[\"nactress\"] + dat_2000[\"nactor\"])\n", "\n", "# Having NaN's, for films w/o actress or w/o actors\n", "# Replace 0 or 1. 1 if no actor is given. 0 if no actress is given\n", "propor_1 = dat_2000.index[(dat_2000.proportion.isnull()) & (dat_2000.nactor.isnull())]\n", "propor_0 = dat_2000.index[(dat_2000.proportion.isnull()) & (dat_2000.nactress.isnull())]\n", "\n", "dat_2000.proportion.loc[propor_1] = 1\n", "dat_2000.proportion.loc[propor_0] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Descriptive Analysis: Share actress on pricipal cast & average rating" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD7CAYAAAB68m/qAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAUxUlEQVR4nO3df7RlZX3f8fdnBpmJ8muQEZHhlxGKaMXoDdIuf5CiljFNSZp0hR9LIqQZ6ZKpa9llZDWuSGrCWjENtio6xYQQjAltIzWkQUlMosYaWoZVRCYE1ggq46COMigDApnx2z/2vrA5nHvvmeHOHO7D+7XWWXP2fvbZ+7vP3PM5z3n2PmenqpAkLX3Lpl2AJGlxGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0J9BklSSF027jqUkyVuSfGHadUiTMNCnIMlXk/wgyY4k25P8WZKjpl3XrKcSYkmuSrIzyQsmXP60JFv2ZFutS3Jqkr9Icl+SbUn+R5IjBu1J8ptJvtvf3pckg/Zjk/x1koeS/H2S14+s/5wkX0vyYJJPJjn0KdT6ziS3JXkgyd1J3jnSPmctSX4yyReS3J/km0k+muTAQfuKJFcm+X7f/o49rbN1Bvr0/FRVHQAcAXwL+OCU63nKkjwH+Fnge8C5i7je/RZrXdO0B/uxCrgCOBY4BngA+L1B+zrgp4GTgZcB/wJ466D9j4D/BzwX+BXgj5Os7mt5CfBfgTcDhwMPAR/ezfqGApzX13wGcFGSsyapBTgY+HXgBcCLgTXAbw0eewlwPN1z8BPALyc54ynU2q6q8raPb8BXgdcPpt8E3DmYPhi4GtgGfA14N92b76HAFro3A4ADgM3Aef30VcAG4C/oXvyfA44ZrLeAFy2wjRcDDwO7gB3A/buxX+cB9wBvB24baTuULoy2AtuBTwLPAX4A/LDf1g66F/UlwB8DfwB8H/g3fb2/C9wLfIMuAJb3635Rv6/fA74D/Ld+foD3A9/u224FXjpH7W8B7uqft7uBcwfzvwD8p77uu4G1g8edD9zeP+4u4K2DttP6/693Ad8EPtY/xxcDXwG+C/x34NAJn99XAA8Mpr8IrBtM/yJwY3//BOAR4MBB+98AF/b3LwX+cND2o8Cjw+Wf4t/4B4APTlLLmMf+K+DLg+lvAG8cTL8XuGbar+On480e+pQleTbw88CNg9kfpAuwFwKvowvK86vqPuAC4KNJnkcXVrdU1dWDx55L9wd/GHAL8PE5Nj3XNm4HLgT+tqoOqKpD+jrPSXLrArvzC3Q9sWuAE5O8YtD2MeDZwEuA5wHvr6oHgbXA1n5bB1TV1n75M+lC/ZB+H34f2EkX3j8GvJEu6On398/peodrePzTzhuB19IFyiF0z/N3R4vuP1l8gC6oDwT+Kd1zN+tVwB10z+n7gN8dDG18m65nfBBduL9/ZL+fT/dmdgxdj/rf0fWqX0f35rUduPxJz+R4rwU2DaZfAnxpMP2lft5s211V9cA87Y89tqq+QhfoJ0xYy5z65+Y1g1oXqmXUY/uZZBXd8zTXfmpo2u8oz8QbXQ99B3A/XUhtBf5x37acrjdz0mD5twKfHUx/EPhy/7jnDuZfxaDnQteD3wUc1U8XXSDOuw36Xulu7tPRdD3tl/fTNwD/pb9/RN+2aszjTgO2jMy7BPj8YPrwvt4fGcw7G/jr/v7VdEMTa0bW88+AO4FTgWXz1P6c/v/iZ4fbGDwXmwfTz+6fx+fPsa5PAm8f7NujwMpB++3A6YPpI4B/APZb4Pl9GXAf8JrBvF3AiYPp4/vaQjeUcuPIOn4DuKq//5eM9JDpesKnLcLf96/Rhe6KfnreWkbmv4HuTe6Efvqofp9Wjizz1cV8TbZys4c+PT9dXe93BXAR8Lkkz6frBe5PNwwy62vAkYPpK4CXAr9XVaM9zntm71TVDroQGD1AOck2dtebgdur6pZ++uPAOUmeRfeivK+qtu/G+u4Z3D8GeBZwb3/g7H668d/n9e2/TBdi/zfJpiQXAFTVXwEfousBfyvJFUkOGt1QdZ8Ufp7uk8m9/UHqEweLfHOw7EP93QMAkqxNcmN/4PJ+uuGzwwaP3VZVD4/sy/8c7MftdMF8+FxPRH9m0qfo3ij+ZtC0g+6TwayDgB3Vpd5o22z7A3M8drR9uP1z+wP4O5J8aq46+2Uvovu095NV9cjubCvJqcAfAj9XVXcOHju7/Lx1yoOiU1dVu6rqWroX9avpxoD/ge6FP+tout4TSZbThdnVwL8dcxriY2fLJDmA7uP+1pFl5t0GXY9od50HvLA/C+GbwGV0wbaWLpwPTXLImMfNta3h/HvoeuiHVdUh/e2gqnoJQFV9s6p+qapeQPdJ48Ozz0tVfaCqXkn3Ef0E4J2MUVU3VNUb6HrMfw98dKEdTrIC+ATd+Prh/Rv09XRvLnPt3z10QzuHDG4rq+objJHkGOAzwHur6mMjzZvoDojOOpnHhzk20f1/HDhP+2OPTfJCus7FnYyoqo/X40Nia8fV2a/jArrjA6dX1fDMpYVqIcmPAdcBF1TVXw62vZ3uuMlc+6kBA33K+lPPzqQb/729qnbRHSj7jSQH9i/od9AdIAT4D/2/F9AFydV9yM96U5JXJ9mfbmz5/1TVsLfLBNv4FrCmX8ck+/BP6A6qnQK8vL+9lK639QtVdS9dD/PDSVYleVaS1w629dwkB8+1/v7xfw78dpKDkixL8qNJXtdv/18nWdMvvp0uRHcl+fEkr+o/JTzI4wd7R+s/PMm/7MfSH6HrFT5puTH2pwvBbcDOJGvpxu3ns4HueT+m3/bq/v//SZIcCfwVcHlVbRizyNXAO5Icme400X9PN+xG38O9BXhPkpVJfoZu2OYT/WM/DvxUktf0+/0fgWvriePcE0tyLt2B1jdU1V3DtoVqSfJS4NPA+qr60zn28939386JwC/N7qdGTHvM55l4oxtD/wFdcDwA3EZ/VkXfvoouXLfR9eh+le7N95V0gTV7pspy4H8Dv9JPX8XjZ7nsAD4PHDdY7/Asl7Hb6Nv2B/6MbrjmO/28c4FNc+zPBuATY+afQheQh/a336cL8O104TG73JV0Byvv5/GzXP5gZF0HAx+hO2vke3SnwJ3Vt72P7tPFDrqzR9b180+nO7NlB92nko8DB4yp8wgeP0vmfuCz9McXGHM8YeR5fFu/T/fTHfi9Bvj1vu00nnx8YBndm+cd/f/9V4BL53he39Nva8fwNmhPv+/39bf3ARm0H9vvyw/67b1+ZP3nAF+ne7P7EyY822aOWu+m+9Q3rHXDJLXQnf30w5HHbhq0r+j/Rr7fP9fvmPZr+Ol6S/+EqQFJrqILkHdPuxZJ+55DLpLUCANdkhrhkIskNcIeuiQ1wkCXpEZM7VfsDjvssDr22GOntXlJWpJuvvnm71TV6nFtUwv0Y489lo0bN05r85K0JCX52lxtDrlIUiMMdElqhIEuSY0w0CWpEQsGen9x1m8nuW2O9iT5QJLNSW4duVqLJGkfmaSHfhXdRV/nspbuSinH011i6yNPvSxpOpI86SYtFQsGelV9nu6nOedyJnB1dW4EDklyxGIVKO0rc4W3oa6lYjHG0I/kiZcL28JTu5SZNFUjv/MtLRmLEejjui9jXwlJ1iXZmGTjtm3bFmHTkqRZixHoWxhcxxJYw5OvYQlAVV1RVTNVNbN69dhvrkqS9tBiBPp1wHn92S6nAt+r7hqQ0pLkAVEtVQv+lkuSP6K7NuJhSbbQXefwWQDVXbj2euBNwGbgIeD8vVWstDdV1dgQdyxdS8WCgV5VZy/QXnQXypWWPMNbS5nfFJWkRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0aWD9+vWsXLmSJKxcuZL169dPuyRpYga61Fu/fj0bNmzg0ksv5cEHH+TSSy9lw4YNhrqWjEzr50JnZmZq48aNU9m2NM7KlSt55JFHnjR/xYoVPPzww1OoSHqyJDdX1cy4NnvoUm82zJctW8ZnPvMZli1b9oT50tPdghe4kJ5pdu3a9di/XoZOS4k9dGlg2bJlXHbZZTz00ENcdtllj/XSpaXAMXSpN9sbX7FiBY888shj/4KXptPTx3xj6A65qHm7O2wyG+LDsfNJ1mHoa9oMdDVvd4J2XHAb1FoqDHRpYDa8kxjkWnI84iNJjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY2YKNCTnJHkjiSbk1w8pv3gJH+a5EtJNiU5f/FLlSTNZ8FAT7IcuBxYC5wEnJ3kpJHF3gb8XVWdDJwG/HaS/Re5VknSPCbpoZ8CbK6qu6rqUeAa4MyRZQo4MN0vGx0A3AfsXNRKJUnzmiTQjwTuGUxv6ecNfQh4MbAV+DLw9qr64eiKkqxLsjHJxm3btu1hyZKkcSYJ9HE/BD36M3T/HLgFeAHwcuBDSQ560oOqrqiqmaqaWb169W6WKkmazySBvgU4ajC9hq4nPnQ+cG11NgN3AycuTomSpElMEug3AccnOa4/0HkWcN3IMl8HTgdIcjjwj4C7FrNQSdL8FrzARVXtTHIRcAOwHLiyqjYlubBv3wC8F7gqyZfphmjeVVXf2Yt1S5JGTHTFoqq6Hrh+ZN6Gwf2twBsXtzRJ0u7wm6KS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpERMFepIzktyRZHOSi+dY5rQktyTZlORzi1umJGkh+y20QJLlwOXAG4AtwE1JrquqvxsscwjwYeCMqvp6kuftpXolSXOYpId+CrC5qu6qqkeBa4AzR5Y5B7i2qr4OUFXfXtwyJUkLmSTQjwTuGUxv6ecNnQCsSvLZJDcnOW+xCpQkTWbBIRcgY+bVmPW8Ejgd+BHgb5PcWFV3PmFFyTpgHcDRRx+9+9VKkuY0SQ99C3DUYHoNsHXMMp+uqger6jvA54GTR1dUVVdU1UxVzaxevXpPa5YkjTFJoN8EHJ/kuCT7A2cB140s8yfAa5Lsl+TZwKuA2xe3VEnSfBYccqmqnUkuAm4AlgNXVtWmJBf27Ruq6vYknwZuBX4I/E5V3bY3C5ckPVGqRofD942ZmZnauHHjVLYtLSQJ03ptSPNJcnNVzYxr85uiktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNWKSa4pKTxuHHnoo27dv3yfbSsZdTnfxrFq1ivvuu2+vbkPPLAa6lpTt27c3c+GJvf2GoWceh1wkqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMmCvQkZyS5I8nmJBfPs9yPJ9mV5OcWr0RJ0iQWDPQky4HLgbXAScDZSU6aY7nfBG5Y7CIlSQubpId+CrC5qu6qqkeBa4Azxyy3HvgE8O1FrE+SNKFJAv1I4J7B9JZ+3mOSHAn8DLBhvhUlWZdkY5KN27Zt291aJUnzmCTQx10na/QaYP8ZeFdV7ZpvRVV1RVXNVNXM6tWrJyxRkjSJSa4pugU4ajC9Btg6sswMcE1/jcTDgDcl2VlVn1yMIiVJC5sk0G8Cjk9yHPAN4CzgnOECVXXc7P0kVwH/yzCXpH1rwUCvqp1JLqI7e2U5cGVVbUpyYd8+77i5JGnfmKSHTlVdD1w/Mm9skFfVW556WZKk3eU3RSWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNmOiLRdLTRb3nILjk4GmXsSjqPQdNuwQ1xkDXkpJf+z5Voz/2uTQloS6ZdhVqiUMuktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakR/nyulpwk0y5hUaxatWraJagxBrqWlH31W+hJmvnddT1zOOQiSY0w0CWpEQa6JDXCQJekRkwU6EnOSHJHks1JLh7Tfm6SW/vbF5OcvPilSpLms2CgJ1kOXA6sBU4Czk5y0shidwOvq6qXAe8FrljsQiVJ85ukh34KsLmq7qqqR4FrgDOHC1TVF6tqez95I7BmccuUJC1kkkA/ErhnML2lnzeXXwQ+9VSKkiTtvkm+WDTua3ljv3GR5CfoAv3Vc7SvA9YBHH300ROWKEmaxCQ99C3AUYPpNcDW0YWSvAz4HeDMqvruuBVV1RVVNVNVM6tXr96TeiVJc5gk0G8Cjk9yXJL9gbOA64YLJDkauBZ4c1XdufhlSpIWsuCQS1XtTHIRcAOwHLiyqjYlubBv3wD8KvBc4MP9DyftrKqZvVe2JGlUpvUDRDMzM7Vx48apbFtaiD/OpaerJDfP1WH2m6KS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDViokBPckaSO5JsTnLxmPYk+UDffmuSVyx+qZKk+SwY6EmWA5cDa4GTgLOTnDSy2Frg+P62DvjIItcpSVrAJD30U4DNVXVXVT0KXAOcObLMmcDV1bkROCTJEYtcqyRpHpME+pHAPYPpLf283V2GJOuSbEyycdu2bbtbq7RHkuz2bU8eJ03bJIE+7i+19mAZquqKqpqpqpnVq1dPUp/0lFXVPrlJ0zZJoG8BjhpMrwG27sEykqS9aJJAvwk4PslxSfYHzgKuG1nmOuC8/myXU4HvVdW9i1yrJGke+y20QFXtTHIRcAOwHLiyqjYlubBv3wBcD7wJ2Aw8BJy/90qWJI2zYKADVNX1dKE9nLdhcL+Aty1uaZKk3eE3RSWpEQa6JDXCQJekRhjoktSITOsLEUm2AV+bysYlaek6pqrGfjNzaoEuSVpcDrlIUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1Ij/j//V439FCnzIAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " lower quantile: 0.25\n", " upper quantile: 0.5\n", " mean: 0.33\n" ] } ], "source": [ "# It is not important for further analysis,\n", "# only to have better understanding of the underlying data set\n", "\n", "# Boxplot:\n", "plt.boxplot(dat_2000.proportion)\n", "plt.title(\"Boxplot: Actress share 2000 - 2020\")\n", "plt.xticks([])\n", "plt.show()\n", "\n", "# See:\n", "# - Proportions are right skewed, having the mean value by 0.33.\n", "# - Upper quantile 0.75: 0.5; lower quantile 0.25: 0.25\n", "print(f\" lower quantile: {np.percentile(dat_2000.proportion, 25)}\")\n", "print(f\" upper quantile: {np.percentile(dat_2000.proportion, 75)}\")\n", "print(f\" mean: {np.percentile(dat_2000.proportion, 50).round(2)}\")\n", "# Some Outliers between 0.9 - 1.0\n", "\n", "# - Porportions -> 50% of the actress shares lies between 0.25 & 0.5.\n", "# Conclude: Actresses are less precence than actors" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD7CAYAAABzGc+QAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAQB0lEQVR4nO3de7BdZX3G8e9jYqLhYoIcRcCYKlZR2lHMdCQylqlaAUXs0M54xVg7OHYK9qalHQpeRrQdxgv+UUutpiBCHWCqaPFSNTrYykyitoqhFRUIBvAAQS7aIvHXP9YKbg5JzmXvnH3enO9nZk/22uvy/tbeOc9+97vW3itVhSSpPY8YdwGSpLkxwCWpUQa4JDXKAJekRhngktQoA1ySGmWAa14kqSRHjLuOxSTJ6iT3Jlky7lq0dxjgi0ySG5L8rP/D3p7kM0meOO66dkqyPsnVc1x3Q5IHkhw66rpa0L+2L9w5XVU3VdX+VbVjnHVp7zHAF6eTqmp/4AnAbcAHx1zP0JLsB5wC/AR49V7YfpKM7e8lydJxta2FywBfxKrqf4HLgGfsfCzJY5JcmGQyyY1JzkryiCQHJbk5yUn9cvsnuT7Jqf30hiQfSvKFJPck+UqSJ+2q3T20cSTwIeCY/hPCXbPYnVOAu4B3AK8baGtLkpcOTC9NcnuSo/vp5yb59yR3JfnPJMcNLLsxybuSfA34KfDkJK/vt3lPkh8keeOUfXtrkluSbEvyB4NDR0mWJzkvyU1Jbuufr0fv5jlan+RrSd6X5E7gbUmekuRLSe7o9+HiJCv75S8CVgNX9s/dW5Os6dtfOrA/7+y3e0+Szyc5eKDNU/vX444kfz21R68FqKq8LaIbcAPwwv7+CuCfgAsH5l8IfBI4AFgD/A/whn7ebwO3Ao8D/gG4bGC9DcA9wPOB5cAHgKsH5hdwxAzaWD+4Xv/Yq4D/mma/vgj8LfB44AHg6P7xs4GLB5Z7CXBdf/8w4A7gRLrOzIv66Yl+/kbgJuCZwFLgkf36TwEC/CZdsO9s6/j++Xlm/9xeNGW/3w98Cjio3/crgXfvZn/W9/txet/2o4Ej+hqXAxPAV4H37+q17afX9O0vHdif7wO/2m9vI/Ceft4zgHuBY4FlwHnAzwe3523h3cZegLd5fsG7P/J76XqrDwDbgF/r5y0B/g94xsDybwQ2Dkx/EPh2v95jBx7fAFw6ML0/sAN4Yj9dfQDtsQ12EeAz2KfVwC+AZ/XTnwM+0N8/gu6NZUU/fTFwdn//L4CLpmzrc8Dr+vsbgXdM0/a/AG/u739kMJD7tnfud4D7gKcMzD8G+OFutrseuGmatl8OfHPKaztdgJ81MP8Pgc/2988GLhmYtwK43wBf2DeHUBanl1fVSrqe3B8BX0lyCHAwXe/rxoFlb6Trqe50AXAU8NGqumPKdrfuvFNV9wJ3AlMPKM6kjdl6LbClqr7VT18MvCrJI6vqemALcFKSFcDLgI/3yz0J+L1++OSufsjmWLpjAw/bJ4AkJyT5epI7++VP7PcJun3dupt1J+hCcfNAW5/tH9+dqW0/LsmlSX6U5G7gYwNtz9StA/d/SvdG+7Daq+qndJ9GtIAZ4ItYVe2oqivoesrHArfTfWweHLteDfwIIN3paH9PNwTypjz8tMAHz2ZJsj/dUMG2KcvssQ26HuNsnUo3Pn1rkluB99IF2wn9/EuAVwInA9/tQx26wLqoqlYO3ParqvcMbPvBepIsBy6nG154fP8m+K90vWuAW4DDB9YdPLvnduBnwDMH2npMdQeTd2fqc/Hu/rFfr6oDgdcMtL2r5WfjIbX3Y/OPHWJ7mgcG+CLWn1lxMrCKrge7A/gE8K4kB/QHIf+UrqcH8Ff9v79PF2IX5qHnGJ+Y5Ngky4B3AtdU1UN6kTNo4zbg8H4bM9mHY+jGpH8DeFZ/O4qul73zYOaldOP3b+KXvW/6Nk9K8uIkS5I8KslxSQZDeNAyuk8tk8ADSU7ot7vTJ4DXJzmy7+2fPbDfv6A7bvC+JI/raz8syYtnsp+9A+iHv5IcBrxlyvzbgCfPYnuDLqN7Ltb1z/3beeibgxYgA3xxujLJvcDdwLvoxnyv7eedTjdW+wPgarrA+0iS59AF7al9CP8NXY/vzIHtfhw4h27o5Dns/nS+XbbRz/sScC1wa5LbAZK8Osm1u9oQXUh/sqq+XVW37rzRHUR9aZKDquoW4D+AdcA/71yxf3M5me6NaZKuR/4WdvN3UVX3AGfQBfV2uoOrnxqYfxVwPvBl4Pq+TejG/KEbc78e+Ho/BPJvwNN2s1+78nbgaLpTJT8DXDFl/ruBs/ohmj+fxXbpX//T6d7sbqE7bvDjgdq1AKXKCzpoeEk2ADdX1VnjrmWh6E+L/A6wvKoeGHc9s9EPgd0FPLWqfjjmcrQb9sClEUryO0mWJVlF9ynlylbCO8lJSVak+1LUeXRnG90w3qq0Jwa4NFpvpBuO+T7dweE3jbecWTmZ7qDzNuCpwCvKj+gLmkMoktQoe+CS1CgDXJIaNa+/cHbwwQfXmjVr5rNJSWre5s2bb6+qh31rd14DfM2aNWzatGk+m5Sk5iW5cVePO4QiSY0ywCWpUQa4JDXKAJekRk0b4Ek+kuTHSb4z8NhB6S6d9b3+31V7t0xJ0lQz6YFvoLtU1KAzgS9W1VPpLmV15tSVpBasXr2aJA/eVq9ePe6SpBmbNsCr6qt0Pw866GS6aynS//vy0ZYl7X2rV69m69atrFu3jm3btrFu3Tq2bt1qiKsZcz0P/PH9byxTVbfs/IF6qSVbt27lkEMOYfPmzRx66KEsX76cQw45hK1bt06/srQA7PWDmElOS7IpyabJycm93Zw0K5OTk5x77rncd999nHvuufh/VC2Z0a8RJlkDfLqqjuqn/xs4ru99P4HuiuLTXllk7dq15TcxtVAku79imL/SqYUkyeaqWjv18bn2wD/FL683+Drgk3MtTBq3JUuWsHHjRpYsWTL9wtICMpPTCC+hu7bf05LcnOQNwHuAFyX5HvCiflpq0o4dOzjuuOPYsWPHuEuRZmXag5hV9crdzHrBiGuRxiIJV111FSeccIJDJ2qK38TUords2TKOP/54li1bNu5SpFkxwLXorVy5ki1btrBy5cpxlyLNyrz+Hri0EN12220ceeSR4y5DmjV74Fr0PAtFrbIHrn3Sns7xnmrnWShzWd+DnhonA1z7pNkGaxLDWM1xCEWSGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVFDBXiSP0lybZLvJLkkyaNGVZgkac/mHOBJDgPOANZW1VHAEuAVoypMkrRnww6hLAUenWQpsALYNnxJkqSZmHOAV9WPgPOAm4BbgJ9U1edHVZgkac+GGUJZBZwM/ApwKLBfktfsYrnTkmxKsmlycnLulUqSHmKYIZQXAj+sqsmq+jlwBbBu6kJVdUFVra2qtRMTE0M0J0kaNEyA3wQ8N8mKJAFeAGwZTVmSpOkMMwZ+DXAZ8A3g2/22LhhRXZKkaSwdZuWqOgc4Z0S1SJJmwW9iSlKjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0a6pqY0nw46KCD2L59+15vJ8le3f6qVau4884792obWlwMcC1427dvp6rGXcbQ9vYbhBYfh1AkqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqOGCvAkK5NcluS6JFuSHDOqwiRJezbs74F/APhsVf1ukmXAihHUJD1EnXMgvO0x4y5jaHXOgeMuQfuYOQd4kgOB5wPrAarqfuD+0ZQl/VLefvc+c0GHetu4q9C+ZJghlCcDk8BHk3wzyYeT7DeiuiRJ0xgmwJcCRwN/V1XPBu4Dzpy6UJLTkmxKsmlycnKI5iRJg4YJ8JuBm6vqmn76MrpAf4iquqCq1lbV2omJiSGakyQNmnOAV9WtwNYkT+sfegHw3ZFUJUma1rBnoZwOXNyfgfID4PXDlyRJmomhAryqvgWsHU0pkqTZ8JuYktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRw/4euDQvkoy7hKGtWrVq3CVoH2OAa8GbjyvSJ5mXdqRRcghFkhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJatTQAZ5kSZJvJvn0KAqSJM3MKHrgbwa2jGA7kqRZGCrAkxwOvAT48GjKkSTN1LA98PcDbwV+MXwpkqTZmHOAJ3kp8OOq2jzNcqcl2ZRk0+Tk5FybkyRNMUwP/HnAy5LcAFwK/FaSj01dqKouqKq1VbV2YmJiiOYkSYPmHOBV9ZdVdXhVrQFeAXypql4zssokSXvkeeCS1Kilo9hIVW0ENo5iW5KkmbEHLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUqKXjLkAapyQPu19V4ypHmhUDXPukwWDem+sa9honA1z7pJkE656C2mBWCxwDl6RGGeCS1CgDXJIaNecAT/LEJF9OsiXJtUnePMrCJEl7NsxBzAeAP6uqbyQ5ANic5AtV9d0R1SZJ2oM598Cr6paq+kZ//x5gC3DYqAqTJO3ZSMbAk6wBng1cs4t5pyXZlGTT5OTkKJqTJDGCAE+yP3A58MdVdffU+VV1QVWtraq1ExMTwzYnSeoNFeBJHkkX3hdX1RWjKUmSNBPDnIUS4B+BLVX13tGVJEmaiWF64M8DXgv8VpJv9bcTR1SXJGkacz6NsKquBub+i0GSpKH4TUxJapQBLgHLli0bdwnSrBngEnD//fePuwRp1gxwCfA7CmqRAS4BfktYLTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLwJo1a8ZdgjRrBrgE3HDDDeMuQZo1A1wCTjnllHGXIM2aAS4Bl19++bhLkGbNAJeApz/96eMuQZo1A1wCrrvuunGXIM2aAS4BGzZsGHcJ0qwZ4BKwfv36cZcgzZoBLgFnnHHGuEuQZs0A16JVVQ/eP//883f5uLSQzfmixtK+wLBWy+yBS1KjDHBJapQBLkmNMsAlqVEGuCQ1KvN5FD7JJHDjvDUoSfuGJ1XVxNQH5zXAJUmj4xCKJDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ16v8BEPkyE0WxJHAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " lower quantile: 5.1\n", " upper quantile: 6.9\n", " mean: 6.1\n" ] } ], "source": [ "# It is not important for further Analysis,\n", "# only to have better understanding of the underlying data set\n", "\n", "# Boxplot:\n", "plt.boxplot(dat_2000.averageRating)\n", "plt.title(\"Boxplot: Average rating\")\n", "plt.xticks([])\n", "plt.show()\n", "\n", "# See:\n", "# - Average ratings are normal distributed\n", "print(f\" lower quantile: {np.percentile(dat_2000.averageRating, 25)}\")\n", "print(f\" upper quantile: {np.percentile(dat_2000.averageRating, 75)}\")\n", "print(f\" mean: {np.percentile(dat_2000.averageRating, 50).round(2)}\")\n", "# Some Outliers between rating of ~9.2 - 10 and ~0 - 2.2 \n", "\n", "# - Average ratings -> 50% of the average ratings lie between 5.1 & 6.9." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyze possible realtionship between average rating and the share of actresses on pricipal cast. Additionally analyze if average ratings can be predicted by the use of linear regression models" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 1. Anaylze possible pattern/relationship between\n", "# actress share and average rating:\n", "plt.figure(figsize=(16,8))\n", "plt.scatter(dat_2000.proportion, dat_2000.averageRating, s = 5)\n", "plt.xlabel(\"Share of actresses on principal cast\")\n", "plt.ylabel(\"Average Rating\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Result:\n", "- No clear pattern identifiable\n", "- Many values/outcomes of the actress share covers a wide range of average ratings\n", "----" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proportionaverageRating
proportion1.000000-0.072625
averageRating-0.0726251.000000
\n", "
" ], "text/plain": [ " proportion averageRating\n", "proportion 1.000000 -0.072625\n", "averageRating -0.072625 1.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We already see, that idea of linearly predicting the average rating on actress share cannot be done by a linear regression model.\n", "# (There is no linear relationship)\n", "\n", "# However, we will affirm this conclusion by some statistics:\n", "\n", "# 2. Compute pearson correlation coefficient for share of actresses on the principal cast and average movierating:\n", "dat_2000[[\"proportion\", \"averageRating\"]].corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Result:\n", "- the correlation coefficient has a negative sign, but is too low for presenting a meaningfullness correlation between average rating and actress share.\n", "---------" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: y R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 518.6
Date: Mon, 07 Feb 2022 Prob (F-statistic): 1.72e-114
Time: 10:11:14 Log-Likelihood: -1.7020e+05
No. Observations: 97801 AIC: 3.404e+05
Df Residuals: 97799 BIC: 3.404e+05
Df Model: 1
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 6.0990 0.008 760.018 0.000 6.083 6.115
x1 -0.4049 0.018 -22.772 0.000 -0.440 -0.370
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 3067.568 Durbin-Watson: 1.978
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3365.381
Skew: -0.445 Prob(JB): 0.00
Kurtosis: 3.181 Cond. No. 4.64


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.005\n", "Model: OLS Adj. R-squared: 0.005\n", "Method: Least Squares F-statistic: 518.6\n", "Date: Mon, 07 Feb 2022 Prob (F-statistic): 1.72e-114\n", "Time: 10:11:14 Log-Likelihood: -1.7020e+05\n", "No. Observations: 97801 AIC: 3.404e+05\n", "Df Residuals: 97799 BIC: 3.404e+05\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 6.0990 0.008 760.018 0.000 6.083 6.115\n", "x1 -0.4049 0.018 -22.772 0.000 -0.440 -0.370\n", "==============================================================================\n", "Omnibus: 3067.568 Durbin-Watson: 1.978\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 3365.381\n", "Skew: -0.445 Prob(JB): 0.00\n", "Kurtosis: 3.181 Cond. No. 4.64\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Even though there is no correlation, hence no linear relationship,\n", "# a linear regression would not be a suitable model.\n", "# However, just as safty check and regarding our research question we will implement it.\n", "\n", "# 2. Linear regression of average rating on the share of actresses on the principal cast\n", "y = dat_2000[[\"averageRating\"]].values\n", "x = dat_2000[[\"proportion\"]].values\n", "x_ = sm.add_constant(x) #adding a constant\n", "\n", "reg = sm.OLS(y, x_).fit()\n", "y_pred = reg.predict(x_)\n", "reg.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Result:\n", " - the estimated coefficient for share of actresses on the principal cast is: -0.4049.\n", " - if the actress share increases by 10%-point (0.10), then on average the average rating decreases by 0.04. \n", " - The estimate is singificant and in line with our pearson-correlation coefficient, that also implies a small neg. relationship. \n", " \n", "- However, the model has no predictive power. The R-squared value is super low (0.005), hence our model does not explain the variation in the average rating. \n", " Therefore, the aim of well predicting the avarage rating on the share of actresses on the principal cast cannot be fullfilled.\n", " \n", " ---------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predicting average rating on share of actresses on the principal cast, movie duration and genre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Idea: Increase predictive accuracy by including additional explanatory variables as RuntimeMinutes and Genre.\n", " (f.e. Persons who are watching certain genres differ regarding their preference/awareness for actress share. Hence, genre could have an impact on their rating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1) Checking how many types of genre does the data set includes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Number of genres in data set: 951\n", "\n", "Top 10 genres regarding their number movies:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genrecounts
0Drama19602
1Comedy8339
2Documentary4657
3Comedy,Drama4642
4Horror3678
5Drama,Romance2960
6Thriller2274
7Comedy,Romance2271
8Comedy,Drama,Romance2195
9Drama,Thriller1615
\n", "
" ], "text/plain": [ " genre counts\n", "0 Drama 19602\n", "1 Comedy 8339\n", "2 Documentary 4657\n", "3 Comedy,Drama 4642\n", "4 Horror 3678\n", "5 Drama,Romance 2960\n", "6 Thriller 2274\n", "7 Comedy,Romance 2271\n", "8 Comedy,Drama,Romance 2195\n", "9 Drama,Thriller 1615" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genre_counts = dat_2000.genres.value_counts()\n", "genre_counts = genre_counts.reset_index().rename(columns = {'index':'genre', 'genres' : \"counts\"})\n", "print(f\" Number of genres in data set: {len(genre_counts)}\")\n", "print(\"\")\n", "print(\"Top 10 genres regarding their number movies:\")\n", "genre_counts.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " We can see that the data set consists on 951 genres. Where the majority of those are genre-overlaps such as Comedy-Drama.\n", " \n", " Splitting those combinations and allowing the movies to capture several genres\n", " leads to a dependencies. Further, includung all genres as dummy variables, leads to 950 dummy variables,\n", " that is messy.\n", "\n", " Therfore, we stick to only those movies who strictly capture only one gerne.\n", " This reduces the number of dummy variables and could lead to more discrimant power, because the movies and their type of viewers differ more from each other and hence, differ in their\n", " preference for actress share." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2) Using only movies that contain a single genreand prepare the data set for a linear regression" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Using the csv file \"data_movie_genre\", that contains all single genres regarding their movie.id\n", "data_movie_genre = pd.read_csv('data_movie_genre.csv')\n", "\n", "# Drop movie.id and keep unique single genres in a list\n", "single_genres = list(data_movie_genre.genre.unique().tolist())\n", "\n", "# Keeping only those single genres in the list, that are covered between 2000 and 2020\n", "single_genres = dat_2000[dat_2000[\"genres\"].isin(single_genres)].genres.unique().tolist()\n", "\n", "# Keeping only single genres in the data set\n", "dat_2000_gen = dat_2000[dat_2000[\"genres\"].isin(single_genres)]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Create Dummy variables for those single genres.\n", "# Dropping the last gerne \"western\", due to multicollineartiy/dummy trap\n", "dat_2000_gen = dat_2000_gen.join(pd.get_dummies(dat_2000_gen[\"genres\"] ))\n", "dat_2000_gen.drop(columns = \"Drama\", inplace = True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: y R-squared: 0.218
Model: OLS Adj. R-squared: 0.218
Method: Least Squares F-statistic: 487.0
Date: Mon, 07 Feb 2022 Prob (F-statistic): 0.00
Time: 10:11:15 Log-Likelihood: -72342.
No. Observations: 43680 AIC: 1.447e+05
Df Residuals: 43654 BIC: 1.450e+05
Df Model: 25
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 6.1531 0.040 152.550 0.000 6.074 6.232
Comedy 0.0025 0.000 6.342 0.000 0.002 0.003
Horror -0.1816 0.024 -7.551 0.000 -0.229 -0.134
Action -0.8941 0.040 -22.516 0.000 -0.972 -0.816
Thriller -0.6837 0.180 -3.805 0.000 -1.036 -0.332
Documentary -0.5107 0.087 -5.882 0.000 -0.681 -0.341
Romance -0.6602 0.066 -10.001 0.000 -0.790 -0.531
Adult 0.2165 0.109 1.990 0.047 0.003 0.430
Family -0.6678 0.017 -40.124 0.000 -0.700 -0.635
Sci-Fi -0.4855 0.067 -7.259 0.000 -0.617 -0.354
Fantasy 0.8538 0.022 39.271 0.000 0.811 0.896
Animation -0.4056 0.052 -7.735 0.000 -0.508 -0.303
Crime -0.4284 0.087 -4.913 0.000 -0.599 -0.258
Adventure 0.1616 0.127 1.277 0.202 -0.086 0.410
Mystery -1.7754 0.023 -76.888 0.000 -1.821 -1.730
Musical 0.9499 0.124 7.653 0.000 0.707 1.193
Biography -0.0975 0.105 -0.929 0.353 -0.303 0.108
War -0.2995 0.085 -3.522 0.000 -0.466 -0.133
History -0.3798 1.268 -0.299 0.765 -2.865 2.106
Western -2.5504 0.732 -3.483 0.000 -3.986 -1.115
Music -0.4086 0.044 -9.317 0.000 -0.495 -0.323
Sport -1.2113 0.066 -18.424 0.000 -1.340 -1.082
News -0.0012 0.178 -0.007 0.995 -0.350 0.347
Reality-TV -0.9539 0.028 -33.925 0.000 -1.009 -0.899
runtimeMinutes -0.3767 0.153 -2.462 0.014 -0.677 -0.077
proportion -1.5547 0.116 -13.431 0.000 -1.782 -1.328
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1068.241 Durbin-Watson: 1.969
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1382.625
Skew: -0.305 Prob(JB): 5.84e-301
Kurtosis: 3.622 Cond. No. 1.96e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.96e+04. This might indicate that there are
strong multicollinearity or other numerical problems." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.218\n", "Model: OLS Adj. R-squared: 0.218\n", "Method: Least Squares F-statistic: 487.0\n", "Date: Mon, 07 Feb 2022 Prob (F-statistic): 0.00\n", "Time: 10:11:15 Log-Likelihood: -72342.\n", "No. Observations: 43680 AIC: 1.447e+05\n", "Df Residuals: 43654 BIC: 1.450e+05\n", "Df Model: 25 \n", "Covariance Type: nonrobust \n", "==================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------\n", "const 6.1531 0.040 152.550 0.000 6.074 6.232\n", "Comedy 0.0025 0.000 6.342 0.000 0.002 0.003\n", "Horror -0.1816 0.024 -7.551 0.000 -0.229 -0.134\n", "Action -0.8941 0.040 -22.516 0.000 -0.972 -0.816\n", "Thriller -0.6837 0.180 -3.805 0.000 -1.036 -0.332\n", "Documentary -0.5107 0.087 -5.882 0.000 -0.681 -0.341\n", "Romance -0.6602 0.066 -10.001 0.000 -0.790 -0.531\n", "Adult 0.2165 0.109 1.990 0.047 0.003 0.430\n", "Family -0.6678 0.017 -40.124 0.000 -0.700 -0.635\n", "Sci-Fi -0.4855 0.067 -7.259 0.000 -0.617 -0.354\n", "Fantasy 0.8538 0.022 39.271 0.000 0.811 0.896\n", "Animation -0.4056 0.052 -7.735 0.000 -0.508 -0.303\n", "Crime -0.4284 0.087 -4.913 0.000 -0.599 -0.258\n", "Adventure 0.1616 0.127 1.277 0.202 -0.086 0.410\n", "Mystery -1.7754 0.023 -76.888 0.000 -1.821 -1.730\n", "Musical 0.9499 0.124 7.653 0.000 0.707 1.193\n", "Biography -0.0975 0.105 -0.929 0.353 -0.303 0.108\n", "War -0.2995 0.085 -3.522 0.000 -0.466 -0.133\n", "History -0.3798 1.268 -0.299 0.765 -2.865 2.106\n", "Western -2.5504 0.732 -3.483 0.000 -3.986 -1.115\n", "Music -0.4086 0.044 -9.317 0.000 -0.495 -0.323\n", "Sport -1.2113 0.066 -18.424 0.000 -1.340 -1.082\n", "News -0.0012 0.178 -0.007 0.995 -0.350 0.347\n", "Reality-TV -0.9539 0.028 -33.925 0.000 -1.009 -0.899\n", "runtimeMinutes -0.3767 0.153 -2.462 0.014 -0.677 -0.077\n", "proportion -1.5547 0.116 -13.431 0.000 -1.782 -1.328\n", "==============================================================================\n", "Omnibus: 1068.241 Durbin-Watson: 1.969\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 1382.625\n", "Skew: -0.305 Prob(JB): 5.84e-301\n", "Kurtosis: 3.622 Cond. No. 1.96e+04\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 1.96e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Linear Regression Average rating on all genre dummies, runtimeMinutes and actress share\n", "\n", "# include runtimeMinutes and proportion into the list of single genres,\n", "# as reference for the predictor variables\n", "single_genres.remove(\"Drama\")\n", "single_genres.append(\"runtimeMinutes\")\n", "single_genres.append(\"proportion\")\n", "\n", "y = dat_2000_gen[[\"averageRating\"]].values\n", "x = dat_2000_gen[dat_2000_gen.columns.intersection(single_genres)].values\n", "x = sm.add_constant(x)\n", "\n", "single_genres = [\"const\"] + single_genres\n", "\n", "reg = sm.OLS(y, x).fit()\n", "reg.summary(xname = single_genres)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Result:\n", "- the estimated coefficient of the share on actresses on principal cast is: -1.55 and significant.\n", "Hence, if the share in actress by movies of genre drama increases by 10%-points,\n", "the average rating would decrease by 0.155.\n", "\n", "- The highest significant effect of actress share on average rating provides the genre Western. An increase of actress share by 10%-points, would lead to an decrease in average rating by (-1.55 + -2.55)/10 = 0.41.\n", "\n", "In comparison to the first linear model, now with more explainatory variables\n", "we can see a better model fit. However, the R-squarded is very low: 0.22.\n", "Hence, our model cannot predict well the variation in avarage rating.\n", "\n", "Moreover, we can see that including genres into the model reveals a higher model fit.\n", "This and the significance of some dummy variables incentivices to for further analysis, using only data within single genres." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }