A Tutorial on Analyzing and Interpreting Movies¶

Authors: Akshay Anil, Atul Bharati, Chaewoon Hong¶

Welcome to our tutorial on how to analyze data on movies! This dataset was pulled from a movies database and is available on Kaggle, courtesy of The Movie Database (TMDb). It contains information on 5000 popular movies. By analyzing the dataset, we may be able to find the magic formula needed for movies to become successful. We will be looking to see if genre and budget are key factors by having a null hypothesis of no correlation between the three.

In this tutorial, we will be walking you step by step in how to approach this dataset in order to get the data you want. We will be analyzing the data we collect with the help of visual graphics.

Our code is written in Python 3 on the Jupyter Notebook.

We start off by importing all the libraries that we will need in this project:

pandas - used to store our dataset in a table
numpy - provides mathematical functions to help us analyze the data
matplotlib - allows us to graph our data
warnings - suppresses any warnings we run into
re - used for regular expressions
cpi - we will be using the inflation method to account for money value over time
datetime - used in conjunction with economics
sklearn - library for machine learning
statsmodels - used to verify our null hypothesis

If you happen to get an error like this:

No module named 'module_name'

Simply go to the terminal and run

pip install module_name

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import re
import cpi
import datetime
from sklearn import linear_model
from statsmodels.formula.api import ols
warnings.filterwarnings('ignore')

For this tutorial, we downloaded the dataset from Kaggle and saved it as movie_list.csv. Even though CSV stands for Comma-Separated Values, we need to confirm that other characters such as semi-colons are not used instead. We do this by opening the file with a text editor and searching for delimiters.

CSV Commas

Here we see that commas are used, so we add sep=',' as an argument to the read_csv function.

Let's take a look at the dataset that we have.

# Display the data as is from the source
movies = pd.read_csv('movie_list.csv', sep=',')

# Display the first five rows 
movies.head()

We get a table with 20 columns! Let's trim down the dataset so we are left with movies that are originally in English and have already been released. We also want to remove any rows where the revenue or budget is equal to 0.

# Limit movies to those that are in English and already Released for a fair comparison
movies = movies[movies.original_language == 'en']
movies = movies[movies.status == 'Released']
# We want movies that at least had a budget and had a revenue
movies = movies[movies.budget >= 0]
movies = movies[movies.revenue >= 0]
# Remove movies that are not categorized into genres
movies = movies[movies.genres != '[]']

# Remove columns that are not vital for analysis
movies.drop(columns=['popularity', 'original_language', 'status', 'keywords', 'original_title', 'overview', 'homepage', 'id', 'production_companies', 'production_countries', 'spoken_languages', 'tagline'], inplace=True)

# Remove any rows with missing data
movies.dropna(inplace=True)

Let's create some functions to split up the release date so we can analyze changes over months, days, and years.

# Functions to split up release_date
def get_year(row):
    date = row['release_date']
    y, _, _ = date.split("-")
    return int(y)
def get_month(row):
    date = row['release_date']
    _, m, _ = date.split("-")
    return int(m)
def get_day(row):
    date = row['release_date']
    _, _, d = date.split("-")
    return int(d)

# Create and set year, month, and day of release for each movie
movies['release_year'] = movies.apply(get_year, axis=1)
movies['release_month'] = movies.apply(get_month, axis=1)
movies['release_day'] = movies.apply(get_day, axis=1)

# Remove the release_date column because we don't need it anymore
movies.drop(columns=['release_date'], inplace=True)
movies.head()

The value of a dollar changes over time so we should take that into consideration. Below we create two methods, one for budget and one for revenue, to update their values to what they are worth in 2018.

def get_current_budget(row):
    return cpi.inflate(int(row['budget']), datetime.date(row['release_year'], row['release_month'], row['release_day']),
datetime.date(2018,1,1))

def get_current_revenue(row):
    return cpi.inflate(int(row['revenue']), datetime.date(row['release_year'], row['release_month'], row['release_day']), 
datetime.date(2018,1,1))

# Replace existing budget and revenue values with those adjusted for inflation
movies['budget'] = movies.apply(get_current_budget, axis=1)
movies['revenue'] = movies.apply(get_current_revenue, axis=1)

movies.head()

Now let's swap some columns to make the dataframe easier to read.

# Swap the headers
# Budget with title
col_list = list(movies)
col_list[0], col_list[5] = col_list[5], col_list[0]
# budget with popularity 
col_list[2], col_list[5] = col_list[5], col_list[2]

# Swap the values of the columns
movies.columns = col_list
c = movies.columns
# Budget with title
movies[[c[0], c[5]]] = movies[[c[5], c[0]]]
# Budget with popularity 
movies[[c[2], c[5]]] = movies[[c[5], c[2]]]

movies.head()

Great! This looks like something we can work with. Let's run a quick dataframe describe to identify invalid data. In respect to the data we are using, none of the values should be negative (Negative runtime? What would that even look like?).

movies.describe()

Perfect! Now let's start visualizing the data we cleaned up.

Graphing¶

Tracking budget over the years might be interesting to see. However, not all years are equally represented in terms of number of movies. How do work around this problem? Unlike the mean, the maximum value is not impacted by the number of values. Let's try plotting maximum budget per year.

plt.figure(figsize=(25, 10));
# Gives us lines a background to help line up bars with their values
sns.set()
# Get the total budget and revenue per year
budget_year = movies.groupby(['release_year'])['budget'].max()
budget_year = pd.DataFrame({'release_year':budget_year.index, 'max_budget':budget_year.values})
# Plot the data
sns.barplot(budget_year['release_year'], budget_year['max_budget'])
plt.title('Max Budget Per Year')
plt.xlabel('Year')
plt.ylabel('Budget ($100M)')
plt.xticks(rotation=45)
plt.show()

The plot shows how the budget for movies has clearly risen over the years. This could perhaps be explained by the advancement of technology and the incorporation of high quality sets.

Looking at the Genres¶

Next we'll be looking at the genres of the movies and take a look at the average rating per genre. We'll be using the same movies variable because it contains all the columns that we will be needing. We'll be using regular expressions to extract the genre names from the 'genres' column.

# The format of the genres column is a string that looks like this: 
# [{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]
# We only want the genres after "name":, so we will use regex to eliminate the other characters. 

# Iterate through the rows
for index,row in movies.iterrows():
    # Initialize a new string - this will replace the row value for genre to make it easier to read and extract
    display_genres = ""
    # Isolates the name: (genre) from the ids
    g = re.findall(r"name\": \"\w+", row['genres'])
    # g should now look like this: 
    # ['name": "Action', 'name": "Adventure', 'name": "Fantasy', 'name": "Science']
    
    # Iterate through each item in g and remove everything but the genres
    for item in g:
        remove = re.search("name\": \"", item)
        display_genres += (item[:remove.start()] + item[remove.end():]) + " "
    display_genres = display_genres[:-1]
    display_genres = display_genres.split(' ')    
    # Replace the genre value with the new list 
    movies.at[index, 'genres'] = display_genres        
movies.head()

Having a list of genres per movie makes it difficult to analyze each genre separately. Let's unpack the list so that each row only has one genre.

# Break down the list and create a new dataframe with the split genres
movies = pd.DataFrame([(d, tup.title, tup.budget, tup.revenue, tup.runtime, tup.vote_average, tup.vote_count, tup.release_year, tup.release_month, tup.release_day) for tup in movies.itertuples() for d in tup.genres])
# Give the columns the names they used to have
movies.columns=['genre', 'title', 'budget', 'revenue', 'runtime', 'vote_average', 'vote_count', 'release_year', 'release_month', 'release_day']
movies.head()

Sweet! Avatar used to have the value [Action, Adventure, Fantasy, Science], but now they are their own rows. This is called melting.

# Get all the possible genres from the dataset
# the keys of the genres dict will have the genre names and the values will be the sum of the average ratings for that
# particular genre
genres = {}
# average_counter is a dict that will have the same keys as genres dict but its values will be the number of average ratings
average_counter = {}

for index, row in movies.iterrows():
    item = row['genre']
    if item not in genres:
        # Initialize the keys as 0/1 first 
        genres[item] = 0
        genres[item] += row['vote_average']
        average_counter[item] = 1
    else:
        genres[item] += row['vote_average']
        average_counter[item] += 1

# Displayed below are all the possible genres (keys) from this dataset
genres

{'Action': 6475.699999999999,
 'Adventure': 4621.800000000003,
 'Fantasy': 2432.7000000000007,
 'Science': 3065.0,
 'Crime': 4100.900000000002,
 'Drama': 13330.799999999974,
 'Thriller': 7304.000000000004,
 'Animation': 1315.4999999999995,
 'Family': 2976.9000000000015,
 'Western': 465.8,
 'Comedy': 9736.299999999997,
 'Romance': 5181.500000000007,
 'Horror': 2797.699999999999,
 'Mystery': 2030.0000000000007,
 'History': 1090.6999999999996,
 'War': 815.6,
 'Music': 1137.1999999999996,
 'Documentary': 640.3999999999996,
 'Foreign': 127.5,
 'TV': 45.3}

# Include the average ratings
average_ratings = []
for (sum, denom) in zip(genres.values(), average_counter.values()):
    if denom != 0:
        average_ratings.append(sum/denom)
    else:
        average_ratings.append(0)

# Include the number of movies
xvals = []
for (genre, count) in zip(genres.keys(), average_counter.values()):
    xvals.append(genre + " (" + str(count) + ")")
    
plt.figure(figsize=(25, 10))
sns.barplot(xvals, average_ratings)
plt.title('Average Voter Ratings Per Genre')
plt.xlabel('Genre')
plt.ylabel('Average Rating (1-10)')
plt.xticks(rotation=45)
plt.show()

It is quite difficult to compare each genre's ratings with the other ratings, so let's sort the histograms. We'll be doing a lot of sorting in this tutorial so let's create a function for sorting.

def sort_graph(xlist, ylist):
    # initialize to_sort as a dict. The keys will be the xlist (where order doesn't matter like genre names / movie titles)
    # and the values will be the ylist (ex. average  budget, avarege rating). 
    # When we sort the ratings (values), the movie names (keys) will still remain with its rating value. 
    to_sort = {}
    for (x,y) in zip(xlist,ylist):
        to_sort[y] = x

    # sort the values
    sorted_by_value = sorted(to_sort.items(), key=lambda kv: kv[1])

    xout = []
    yout = []
    for (x, y) in sorted_by_value:
        xout.append(x)
        yout.append(y)
        
    return xout,yout

genres,ratings = sort_graph(average_ratings,xvals)

plt.figure(figsize=(25, 10))
sns.barplot(genres, ratings)
plt.title('Average Rating Per Genre (Sorted)')
plt.xlabel('Genre')
plt.ylabel('Average Rating (1-10)')
plt.xticks(rotation=45)
plt.show()

From this graph, we can see that the horror genre seems to have the lowest average ratings compared to the other genres. The reason we put the number of movies in parenthesis next to the genre titles is that some genres have smaller samples which means the average ratings for a particular genre may not be a true representation of the population.

For example, the "Foreign" genre had only one movie with a rating of 6.9. Even if the Foreign genre only has one movie, if it was plotted on the graph, it would seem like the Foreign genre has the highest average rating out of all other genres. For this very reason, we removed the Foreign genre from the graph and put the number of movies in parenthesis as a precaution.

Taking a Closer Look¶

Now that we have cleaned up the genres column, let's take a closer look into the Horror genre. Compared to the other genres, the Horror genre has a significantly lower average rating than the other genres. First, let's plot the horror movies individually to see their ratings.

horror_title = []
horror_rating = []
for index, row in movies.iterrows():
    if row['genre'] == 'Horror':
        horror_title.append(row['title'])
        horror_rating.append(row['vote_average'])

titles,ratings = sort_graph(horror_rating,horror_title)

plt.figure(figsize=(30, 12))
sns.barplot(titles, ratings)
plt.title('Average Voter Ratings For Horror Movies')
plt.xlabel('Movie Title')
plt.ylabel('Average Rating (1-10)')
plt.xticks(rotation=90)
plt.xticks(np.arange(0,len(horror_title),3))
plt.show()
print("(Every 3rd Movie Label is labelled for ease of read)")

(Every 3rd Movie Label is labelled for ease of read)

Perhaps the reason the horror movies have lower ratings than all other genres is because of the budgets. Let's look at the average budget for each genre to see if it plays a role in the low ratings. We'll use code that is very similar to the one we used previously to find the average ratings.

genre_budget = movies.groupby(['genre'])['budget'].mean()
genre_budget = pd.DataFrame({'genre':genre_budget.index, 'average_budget':genre_budget.values})
# the keys of the dict will have the genre names and the values will be the sum of the budgets for that particular genre
genres = {}
# average_counter is a dict that will have the same keys as genres but its values will be the number of values
average_counter = {}

for index, row in movies.iterrows():
    item = row['genre']
    if item not in genres:
        # Initialize the keys as 0/1 first 
        genres[item] = 0
        genres[item] += row['budget']
        average_counter[item] = 1
    else:
        genres[item] += row['budget']
        average_counter[item] += 1

# Include the average budgets
average_budget = []
for (sum, denom) in zip(genres.values(), average_counter.values()):
    if denom != 0:
        average_budget.append(sum/denom)
    else:
        average_budget.append(0)

# Include the number of movies
xvals = []
for (genre, count) in zip(genres.keys(), average_counter.values()):
    xvals.append(genre + " (" + str(count) + ")")
    
plt.figure(figsize=(20, 10))
sns.barplot(xvals, average_budget)
plt.title('Average Budget Per Genre')
plt.xlabel('Genre')
plt.ylabel('Average Budget ($100M)')
plt.xticks(rotation=45)
plt.show()

Since the values in the x-axis (genres) don't have a particular order, let's reorder the genres so we can easily look at how the genres are related to each other when it comes to average budget.

genres,budgets = sort_graph(average_budget,xvals)
    
plt.figure(figsize=(20, 10))
sns.barplot(genres, budgets)
plt.title('Average Budget Per Genre (Sorted)')
plt.xlabel('Genre')
plt.ylabel('Average Budget ($100M)')
plt.xticks(rotation=45)
plt.show()

Again, the horror genre is near last when it comes to its budget. Now let's see the relationship between the average rating and average budget for each genre.

plt.figure(figsize=(15, 10))
plt.scatter(average_budget, average_ratings)
plt.title('Relationship Between Budget and Ratings for Every Genre')
plt.xlabel('Budget')
plt.ylabel('Rating')

for i in range(len(average_budget)):
    plt.text(average_budget[i], average_ratings[i], xvals[i], fontsize=12)
    
# Draw the regression line
# m = slope, b = y-intercept
m, b = np.polyfit(average_budget, average_ratings, 1)

# Calculate y=mx+b for each point and then create regression line
yvals = []
index = 0
for y in average_ratings:
    yvals.append(m*average_budget[index]+b)
    index += 1
plt.plot(average_budget, yvals , 'r')
plt.show()

Looking at the regression line, although for the horror genre it seems that budget and rating go hand in hand, the line is more horizontal than vertical but has a slightly negative slope. This implies that there's actually not much of a correlation, or relationship, between the budget of a film and the average voter rating.

Linear Regression Modelling (Machine Learning)¶

We can use a machine learning algorithm like linear regression to make a model and predict the dependent/response variable given a value for the independent/explanatory variable. To learn more about the basics of linear regression, visit this link.

In this example, we will try to predict the revenue of action movies given the budget of action movies. This is an example of simple linear regression since there's only one independent (x-axis) variable involved.

action_movies = movies[movies.genre == 'Action']
action_movies.head()

plt.figure(figsize=(15, 10))
plt.scatter(action_movies['budget'], action_movies['revenue'])
plt.title('Budget vs Revenue for Action Movies')
plt.xlabel('Budget (per $100M)')
plt.ylabel('Revenue (per $1B)')

reg = linear_model.LinearRegression()
X = [[budget] for budget in action_movies['budget'].values]
y = [[revenue] for revenue in action_movies['revenue'].values]
reg_model = reg.fit(X, y)

plt.plot(X, reg_model.predict(X))

[<matplotlib.lines.Line2D at 0x1f086b23320>]

Even though most of the data points are clustered near the bottom left, meaning that the budget and revenue for those action movies are relatively low, the regression line model indicates that there's some positive association between budget and revenue.

The line of code below shows how we can use this linear regression model:

print(reg_model.predict([[150000000]]))

[[4.41609013e+08]]

The value inside the square brackets is the input value (in this case, the budget) and the predict function will determine the output (the revenue) based on the regression line depicted above. The revenue for an action movie will be about 450 million dollars if a movie crew has 150 million dollars to spend to create the movie.

The value below gives the slope of the regression line:

reg_model.coef_

array([[3.00142947]])

It indicates that there's about 3 dollars in revenue for every dollar in budget for action movies.

Now let's see if we can fit a linear regression model for revenue including a term for an interaction between budget and genre. More information about interaction terms can be found here.

arr = []
# assigns an index to each genre
for index, row in movies.iterrows():
    if 'Action' in row['genre']:
        arr.append(0)
    elif 'Documentary' in row['genre']:
        arr.append(1)
    elif 'Horror' in row['genre']:
        arr.append(2)
    elif 'Music' in row['genre']:
        arr.append(3)
    elif 'Romance' in row['genre']:
        arr.append(4)
    elif 'Drama' in row['genre']:
        arr.append(5)
    elif 'Crime' in row['genre']:
        arr.append(6)
    elif 'Comedy' in row['genre']:
        arr.append(7)
    elif 'Mystery' in row['genre']:
        arr.append(8)
    elif 'Thriller' in row['genre']:
        arr.append(9)
    elif 'Western' in row['genre']:
        arr.append(10)
    elif 'History' in row['genre']:
        arr.append(11)
    elif 'War' in row['genre']:
        arr.append(12)
    elif 'Science' in row['genre']:
        arr.append(13)
    elif 'Action' in row['genre']:
        arr.append(14)
    elif 'Family' in row['genre']:
        arr.append(15)
    elif 'Fantasy' in row['genre']:
        arr.append(16)
    elif 'Adventure' in row['genre']:
        arr.append(17)
    elif 'Animation' in row['genre']:
        arr.append(18)
    elif 'Foreign' in row['genre']:
        arr.append(19)
    elif 'TV' in row['genre']:
        arr.append(20)
        
movies['genre_index'] = pd.Series(data=arr, index=movies.index)
# Fit a model for revenue including a term for an interaction between budget and genre
reg_model = ols(formula="revenue ~ budget*genre_index", data=movies).fit()
# displays linear regression model data
reg_model.summary()

The table above gives the ordinary least squares (OLS) regression results. We will direct our attention to the second portion of the table, which contains the p-values for each variable. Feel free to look at this guide to assist you with the understanding of regression table results.

The p-value column (P>|t|) gives probabilities that can help you determine whether or not the null hypothesis should be rejected: the coefficient is equal to 0, which indicates no effect. A low p-value (< 0.05) indicates that the null hypothesis should be rejected, meaning that changes in independent variable relate to those in the dependent variable. Otherwise, if the p-value is high (>= 0.05), then the variable might not be a good fit for the model. In our example, this isn't a problem for for budget and genre because their p-values are both below 0.05. However, our null hypothesis that the combination of budget and genre is a bad indicator for revenue is proven slightly correct with a p-value of 0.077 that is slightly over 0.05.

The other main column of interest is coefficient. We'll specifically use the budget variable in the analysis. The value of the coefficient is given as 2.7063 which indicates that for every additional dollar spent for the budget, we expect about 2 dollars and 71 cents increase in revenue/profits. This is simply the slope of the regression line for this model that was created.

Conclusion¶

As it can be seen, taking a dataset and playing around with it can be intriguing and you can learn many things if you try analyzing it like to the extent we did.

After we set up the dataframe properly by modifying the columns and some values in some cells like in the genres column, we were able to perform analysis on the bar graphs and scatter plots we created. For example, we looked at the average voter ratings per genre and after sorting the bars in an ascending order, we realized that horror movies tend to have the lowest ratings. Then, we delved in deeper by analyzing the average voter ratings for all the individual horror movies in the dataset and realized that most of the ratings were relatively low, thus proving why horror movies have low ratings in general. In terms of budgets for creating movies, horror movies were near the bottom of all the movie categories so there's a possible relationship between voter ratings and budgets of various movies. We performed linear regression between those two variables and realized that there's a slight positive correlation.

For the machine learning aspect of this tutorial, we focused on the budget and revenue of action movies by creating a simple linear regression model. We realized that there was some positive association between the budget and revenue based on the scatter plot and OLS regression results table. The regression model can be used to predict the revenue/profits (in billions of dollars) of an action movie based on the expenditures (in terms of hundreds of millions of dollars) for creating the movie, which is nice for extrapolation or predicting outside the range of the given x values.

We hope you have taken something of value and use from this tutorial!

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.312950	[{"name": "Legendary Pictures", "id": 923}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2012-07-16	1084939099	165.0	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.6	9106
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.926995	[{"name": "Walt Disney Pictures", "id": 2}]	[{"iso_3166_1": "US", "name": "United States o...	2012-03-07	284139100	132.0	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.1	2124

	budget	genres	revenue	runtime	title	vote_average	vote_count	release_year	release_month	release_day
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	2787965087	162.0	Avatar	7.2	11800	2009	12	10
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	961000000	169.0	Pirates of the Caribbean: At World's End	6.9	4500	2007	5	19
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	880674609	148.0	Spectre	6.3	4466	2015	10	26
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	1084939099	165.0	The Dark Knight Rises	7.6	9106	2012	7	16
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	284139100	132.0	John Carter	6.1	2124	2012	3	7

	budget	genres	revenue	runtime	title	vote_average	vote_count	release_year	release_month	release_day
0	2.720294e+08	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	3.200036e+09	162.0	Avatar	7.2	11800	2009	12	10
1	3.575882e+08	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	1.145474e+09	169.0	Pirates of the Caribbean: At World's End	6.9	4500	2007	5	19
2	2.553310e+08	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	9.178103e+08	148.0	Spectre	6.3	4466	2015	10	26
3	2.704743e+08	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	1.173793e+09	165.0	The Dark Knight Rises	7.6	9106	2012	7	16
4	2.809401e+08	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	3.070234e+08	132.0	John Carter	6.1	2124	2012	3	7

	vote_average	genres	budget	runtime	title	revenue	vote_count	release_year	release_month	release_day
0	7.2	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	2.720294e+08	162.0	Avatar	3.200036e+09	11800	2009	12	10
1	6.9	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	3.575882e+08	169.0	Pirates of the Caribbean: At World's End	1.145474e+09	4500	2007	5	19
2	6.3	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	2.553310e+08	148.0	Spectre	9.178103e+08	4466	2015	10	26
3	7.6	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	2.704743e+08	165.0	The Dark Knight Rises	1.173793e+09	9106	2012	7	16
4	6.1	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	2.809401e+08	132.0	John Carter	3.070234e+08	2124	2012	3	7

	vote_average	budget	runtime	revenue	vote_count	release_year	release_month	release_day
count	4469.000000	4.469000e+03	4469.000000	4.469000e+03	4469.000000	4469.000000	4469.000000	4469.000000
mean	6.091184	4.063904e+07	106.860371	1.258772e+08	724.798389	2002.286865	6.813829	15.193332
std	1.119128	5.054196e+07	21.648784	2.642942e+08	1266.791184	12.431588	3.421426	8.632051
min	0.000000	0.000000e+00	0.000000	0.000000e+00	0.000000	1916.000000	1.000000	1.000000
25%	5.600000	2.269439e+06	94.000000	0.000000e+00	61.000000	1999.000000	4.000000	8.000000
50%	6.200000	2.290909e+07	103.000000	3.608449e+07	257.000000	2005.000000	7.000000	15.000000
75%	6.800000	5.838812e+07	117.000000	1.391715e+08	796.000000	2010.000000	10.000000	22.000000
max	10.000000	4.168339e+08	338.000000	7.085038e+09	13752.000000	2017.000000	12.000000	31.000000

	vote_average	genres	budget	runtime	title	revenue	vote_count	release_year	release_month	release_day
0	7.2	[Action, Adventure, Fantasy, Science]	2.720294e+08	162.0	Avatar	3.200036e+09	11800	2009	12	10
1	6.9	[Adventure, Fantasy, Action]	3.575882e+08	169.0	Pirates of the Caribbean: At World's End	1.145474e+09	4500	2007	5	19
2	6.3	[Action, Adventure, Crime]	2.553310e+08	148.0	Spectre	9.178103e+08	4466	2015	10	26
3	7.6	[Action, Crime, Drama, Thriller]	2.704743e+08	165.0	The Dark Knight Rises	1.173793e+09	9106	2012	7	16
4	6.1	[Action, Adventure, Science]	2.809401e+08	132.0	John Carter	3.070234e+08	2124	2012	3	7

	genre	title	budget	revenue	runtime	vote_average	vote_count	release_year	release_month	release_day
0	Action	Avatar	2.720294e+08	3.200036e+09	162.0	7.2	11800	2009	12	10
1	Adventure	Avatar	2.720294e+08	3.200036e+09	162.0	7.2	11800	2009	12	10
2	Fantasy	Avatar	2.720294e+08	3.200036e+09	162.0	7.2	11800	2009	12	10
3	Science	Avatar	2.720294e+08	3.200036e+09	162.0	7.2	11800	2009	12	10
4	Adventure	Pirates of the Caribbean: At World's End	3.575882e+08	1.145474e+09	169.0	6.9	4500	2007	5	19

Dep. Variable:	revenue	R-squared:	0.297
Model:	OLS	Adj. R-squared:	0.297
Method:	Least Squares	F-statistic:	1607.
Date:	Sat, 15 Dec 2018	Prob (F-statistic):	0.00
Time:	11:50:22	Log-Likelihood:	-2.3611e+05
No. Observations:	11397	AIC:	4.722e+05
Df Residuals:	11393	BIC:	4.723e+05
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.042e+06	5.47e+06	-0.373	0.709	-1.28e+07	8.68e+06
budget	2.7063	0.076	35.510	0.000	2.557	2.856
genre_index	1.791e+06	6.49e+05	2.758	0.006	5.18e+05	3.06e+06
budget:genre_index	0.0126	0.007	1.771	0.077	-0.001	0.027

Omnibus:	18539.278	Durbin-Watson:	0.556
Prob(Omnibus):	0.000	Jarque-Bera (JB):	23577637.270
Skew:	10.516	Prob(JB):	0.00
Kurtosis:	224.829	Cond. No.	2.02e+09