This blog is an example of how multiple CSV files of the same format can be loaded into SQL Server and prepared for reporting. We will be using personal movie rating data from IMDb as an example to help illustrate the problem.
One of the most important moments in the movie-watching experience for me happens after the lights come back on and the popcorn bucket is long-since empty: rating the movie on IMDb. It’s a time to be analytical. It’s a time to be moved by emotions.
I’m intrigued by the number of ways that people use a 1 to 10 scale. I mean, how much of that range should really be used? Should each individual’s ratings form a perfect bell curve centered on an average score of 5/10? Since ratings are entirely subjective, whose rating of a movie can you trust?
Naturally, my fascination with this inspired me to pull my movie ratings into a database. In fact, I was able to get 7 friends of mine to go through the laborious task of rating all the movies they’ve ever seen on IMDb! The end product of this project was a Power BI dashboard that allowed me to visualize the data, but there were many steps that I needed to take to get to that point. Let’s get started!
Exporting Movie Ratings from IMDb
First off, you’ll need to rate movies on IMDb. If you’d like to try this out with some test data, you can download CSV file below and skip down to the next section.
Download CSV File
If you haven’t already created an account on IMDb, set one up and start rating movies! Once you feel good about the number of movies you’ve rated, it’s time to export this list. Unlike Rotten Tomatoes, IMDb does not have an API. Rather, IMDb allows you to export your ratings as a CSV file. It’s a bit difficult to find where you can download your ratings, so follow these instructions:
1. Sign-in to your IMDb account.
2. In the top-right corner of the browser, hover over your account name and click “Your Ratings.”
3. Scroll to the bottom of page 1 of your ratings until you see the link to “Export this list” and click it:
4. I recommend creating a folder in your C: drive called “IMDbRatings” and saving it there.
5. In order to prevent issues later on, I recommend that you open up the CSV file with a text editor (i.e. Notepad) to verify that the values are surrounded by quotation marks*. Here is a sample of what the file should look like:
*Out of my group of friends, one out of eight had an issue where there were not quotation mark surrounding each field. He exported it from someone else’s laptop and it was formatted properly.
If you’ve made it this far, great job! Now we’re ready to pull this data into SQL Server. But before we do that, let me explain the data that is exported from IMDb.
Understanding the Data that is Exported from IMDb
If you open up the CSV file, you can take a peek at all the data that IMDb provides alongside your ratings. Some of it is valuable, some of it isn’t. I’ll describe each of the fields available:
- Position: The order of the movies by the date that you rated them in descending order, so that the most recent movie you rated is in “position” 1.
- Const: The unique identifier for each movie.
- Created: The date that you rated each movie.
- Modified: Supposedly, this is the date that you modified the rating of each movie. However, in my experience I’ve always seen this field entirely NULL regardless of whether or not I modified my ratings.
- Description: This field is entirely NULL.
- Title: The title of the movie.
- Title Type: This separates out “Feature Film”, “Documentary”, “TV Series”, and “Mini-Series”. In my project I only included feature films.
- Directors: A comma-separated list of the directors.
- You rated: The 1 to 10 rating that you gave.
- IMDb Rating: The IMDb rating of the movie.
- Runtime (mins): The length of the movie in minutes.
- Genres: A comma-separated list of the genres of the movie.
- Votes: The number of people who rated the movie on IMDb.
- Release Date (month/day/year): This is the international release date of the movie. It is not necessarily the release date of the movie in the United States.
- URL: The URL of the movie on IMDb.com.
It is very important to note that each record is a snapshot of the data from IMDb at the time that you rated the movie. If you were to compare the ratings for the movie Inception across a group of people, you might see differences in the IMDb rating, number of votes, or even the genre. If you want to do additional analysis outside of the guidance of this blog post, I recommend using the number of votes to determine the most recent record (as it will go up with time).
Getting Your IMDb Ratings into SQL Server
If you’d like to quickly copy and execute the code that I’ve prepared, I recommend setting up a folder on your C-drive called IMDbRatings with a sub-folder titled Files (file location: C:\IMDbRatings\Files). Additionally, I named the database IMDb. Now, here are the steps to pull the CSV files into SQL Server:
Step 1: Creating the Import Table
Before we pull the data in, we need a place for it. This is the SQL code I used to create a table that will intake our raw data:
CREATE TABLE IMDb.dbo.rawData ( [Position] nvarchar(50) NULL, [Const] nvarchar(50) NULL, [Created] nvarchar(50) NULL, [Modified] nvarchar(50) NULL, [Description] nvarchar(50) NULL, [Title] nvarchar(200) NOT NULL, [TitleType] nvarchar(50) NULL, [Directors] nvarchar(200) NULL, [YouRated] int NULL, [IMDbRating] decimal(10,2) NULL, [Runtime] int NULL, [Year] int NULL, [Genres] nvarchar(200) NULL, [NumVotes] int NULL, [ReleaseDate] date NULL, [URL] nvarchar(200) NULL, [FileName] nvarchar(200) NULL, [UploadDatetime] datetime NULL ) GO
Alongside the fields that exist in the CSV files, I added a file name and an upload datetime field. Including the file name allows us to differentiate the users from one another, since this is not exported from IMDb. As for the datetime of the upload, it is useful for documenting purposes. Now it’s time to start pulling the data!
Step 2: BULK INSERT to Import Multiple CSV files into SQL Server
When I was thinking through my goals for this project, one of the priorities for me was that I wanted to be able to quickly update the database whenever someone sends me their updated ratings. Also, it was important for me to be able to add a new person to the dataset. While there’s a good case for doing this with SSIS, I prefer the simplicity of using a BULK INSERT within a stored procedure.
Since there are 16 columns in the CSV file and we have 18 columns in the table (because we added the [FileName] and the [UploadDatetime] columns as well), we need to execute a BULK INSERT into a view that is pulling the 16 columns from the table that come from the CSV file. Use this code to create a view in the IMDb database:
CREATE VIEW dbo.vw_rawData AS SELECT [Position] ,[Const] ,[Created] ,[Modified] ,[Description] ,[Title] ,[TitleType] ,[Directors] ,[YouRated] ,[IMDbRating] ,[Runtime] ,[Year] ,[Genres] ,[NumVotes] ,[ReleaseDate] ,[URL] FROM dbo.rawData
Note that the view is pulling every field that exists in the CSV file and not the external fields (file name and upload date).
Step 3: Creating the Stored Procedure
Next, we need to create the stored procedure that will run the BULK INSERT for each CSV file in the specified folder location. In order to tell the BULK INSERT where to find the CSV files, we will need to use xp_cmdshell. It is not enabled by default, so you will need to copy, paste, and execute the code found on MSDN here.
One last note before the code: Since there can be diacritic characters (like á, é, ú, etc.) in the spelling of movie titles and directors, we need to add a particular codepage (to support UTF-8) to the BULK INSERT. However, UTF-8 is only supported in later versions of SSMS 2014 and in SSMS 2016, so if you have an earlier version of SSMS 2014 or anything earlier that SSMS 2012 then you will need to remove the portion of code that looks like CODEPAGE = ''65001'',. By doing this, the diacritic characters will be translated into something a bit messier, but everything will run fine. Thus, I recommend using newer versions of SSMS.
Now that that’s out of the way, use the following code to create the stored procedure:
CREATE PROCEDURE [dbo].[usp_ImportIMDbData] @Filepath varchar(500) ,@Pattern varchar(100) ,@TableName varchar(128) ,@ViewName varchar(128) ,@ResetTable bit = 0 AS SET QUOTED_IDENTIFIER OFF DECLARE @query varchar(1000) DECLARE @numfiles int DECLARE @filename varchar(100) DECLARE @files TABLE (Name varchar(200) NULL) --Delete the contents of the rawData table and let the user know IF @ResetTable = 1 BEGIN PRINT 'Emptying table [' + @TableName + ']...' EXEC ('DELETE ' + @TableName) END --Pull a list of the CSV file names from the folder that they're stored in SET @query = 'master.dbo.xp_cmdshell "dir ' + @filepath+@pattern + ' /b"' INSERT @files(Name) EXEC (@query) DECLARE curs_files CURSOR FOR SELECT Name FROM @files WHERE Name IS NOT NULL --For each CSV file, execute a query SET @numfiles =0 OPEN curs_files FETCH NEXT FROM curs_files INTO @filename WHILE (@@FETCH_STATUS = 0) BEGIN SET @numfiles+=1 --BULK INSERT each CSV file into the rawData view and update the rawData table with the file name and the upload datetime SET @query = ('BULK INSERT ' + @ViewName + ' FROM ''' + @Filepath+@filename + ''' WITH( CODEPAGE = ''65001'', DATAFILETYPE = ''char'', FIRSTROW = 2, FIELDTERMINATOR=''","'', ROWTERMINATOR=''0x0a'');' + ' UPDATE ' + @TableName + ' SET [FileName] = ' + '''' + @filename + '''' + ' WHERE [FileName] Is Null;' + ' UPDATE ' + @TableName + ' SET [UploadDatetime] = ' + '''' + CAST(GETDATE() as nvarchar(1000)) + '''' + ' WHERE [UploadDatetime] Is Null;' ) PRINT 'Importing [' + @filename + '] into [' + @TableName + ']...' EXEC (@query) FETCH NEXT FROM curs_files INTO @filename END CLOSE curs_files DEALLOCATE curs_files
Now that the stored procedure has been created, just run the following code to import the CSV files (be sure to update the naming if you didn’t follow my conventions):
EXEC usp_ImportIMDbData @filepath = 'C:\IMDbRatings\Files\' ,@pattern = '*.csv' ,@TableName = 'rawData' ,@ViewName = 'vw_rawData' ,@ResetTable = 1
If you pulled in your own ratings and noticed any data type conversion errors, I’d recommend opening up your CSV file in Excel and checking the offending field for any data quality issues from IMDb and making those quick changes manually. And with that, congratulations! With the execution of a stored procedure you are able to load all of the CSV files within a folder into a table.
Step 4: Cleaning Up the Data
If you check out the data within the table, you’ll see that some clean up needs to happen before you start creating tables and views off of this data. By renaming some columns and sorting out a couple messy strings, the data is ready to go! Here is a query that can ultimately be created as a view, and the view can be used to create a staging table:
SELECT SUBSTRING([FileName],0,CHARINDEX('_',[FileName],0)) + ' ' + SUBSTRING([FileName],CHARINDEX('_',[FileName],0)+1,LEN([FileName])-(CHARINDEX('_',[FileName],0)+4)) AS UserFullName ,[Const] as [MovieID] ,[Title] ,[Directors] ,[Genres] ,[IMDbRating] ,[Runtime] ,[NumVotes] ,[ReleaseDate] ,[YouRated] as [UserRating] ,CONVERT(date, SUBSTRING([Created],21,4) + '-' + SUBSTRING([Created],5,3) + '-' + SUBSTRING([Created],9,2)) AS RatingDate ,[UploadDatetime] FROM [IMDb_demo].[dbo].[rawData] WHERE [TitleType] = 'Feature Film'
For my purposes I only pull in Feature Films, but if you’d like to see other “Title Types” then exclude the WHERE clause and include the [TitleType] column.
Analysis begins with identifying what questions you want to answer. I started out by answering simple questions that I had, like “What does the distribution of ratings look like across the group of raters?” and “Whose ratings are the most different from the IMDb rating?” Later on, I was answering questions that sparked action, like “With these people present, what movie should we watch?”
As for visualizing the data, there are several routes you can take. The simplest, most accessible route is to just pull this into Excel in a Pivot Table. If you want to go even further within Excel, then I’d recommend pulling it into Power Pivot so that you can build some DAX measures if you want to do more complex calculations outside of summing, averaging, or counting. Additionally, you can build some Excel charts to create some nice design elements. Personally, I used Power BI to visualize the data. It’s relatively simple to learn if you’re familiar with Excel, as it is also a Microsoft product. Power BI allows for far more customization than Excel charts and the Power BI team is always adding new features. Plus, it’s totally free!
Now, go watch some movies!
By Walker Storrer
Posted by admin