coursera-data-engineering

Advanced SQL queries

As a Data Engineer, you will have to manipulate data to process it and help the organization find valuable insights. To do so, you should learn the basics of SQL to work with data in different ways.

This time, you will be working with a DVD rental sample dataset where you will find information about stores, customers, and rented films. Below you will find the diagram with the relationships between the tables and a brief description of each one.

The purpose of this assignment is to answer business questions using SQL language to query the database. The idea is that you get the same answer that is shown in each exercise.

After this assignment you’ll be able to:

Table of Contents

<a name=’1’></a>

1 - Database

You will be working with a modified version of the Sakila Sample Database, which is licensed under the New BSD license.

For learning purposes, let’s assume that the data belongs to Rentio, which is a fictitious company dedicated to renting movies to clients from all around the world. The company has a Dimensional Model created to perform analytical queries based on the data of the staff and the stores, including inventory, films, and paying customers. The management team would like to make informed decisions based on the available data.

Your task will be to design and build analytical queries based on Rentio’s Dimensional Model stored in their Data Warehouse.

<a name=’1.1’></a>

1.1 - Tables

Rentio’s Dimensional Model includes the following tables.

Here you can find the entity-relationship model (ERM) of the star schema showing all the tables and how they are related:

image

1.2 - Running SQL Commands in a Notebook

To interact with SQL Databases within a JupyterLab notebook, you will use the SQL “magic” offered by the ipython-sql extension. JupyterLab defines “magic” as special commands prefixed with %. Here, we’ll employ the load_ext magic to load the ipython-sql extension.

# Loading the SQL module
%load_ext sql

The provided magic command loads the ipython-sql extension, enabling connection to databases supported by SQLAlchemy. In this example, you’ll connect to an existing MySQL database. However, to proceed, it’s essential to obtain your credentials and establish a connection to the MySQL database, which is done in the src/env file.

import os 
import socket

from dotenv import load_dotenv

load_dotenv('./src/env', override=True)

DBHOST = socket.gethostname()
DBPORT = os.getenv('DBPORT')
DBNAME = os.getenv('DBNAME')
DBUSER = os.getenv('DBUSER')
DBPASSWORD = os.getenv('DBPASSWORD')

connection_url = f'mysql+pymysql://{DBUSER}:{DBPASSWORD}@{DBHOST}:{DBPORT}/{DBNAME}'

%sql {connection_url}

Some of the queries in this notebook are quite complicated. You will find optional exercises that will help you to reach the final query step-by-step.

2 - Complex Queries

2.1 - Logical Operators

WHERE clause is used to filter data based on a condition. If you want to use multiple conditions you will need to use logical operators.

The AND operator returns the records for those that meet all the conditions applied in the operator.

SELECT
    *
FROM table_name
WHERE
    column1 = 'value1'
    AND column2 < 'value2'
    AND column3 > 'value3';

The OR operator returns the records for those that meet at least one of the conditions applied in the operator.

SELECT
    *
FROM table_name
WHERE
    column1 = 'value1'
    OR column2 < 'value2'
    OR column3 > 'value3';

The NOT operator is used in a condition to reverse its result.

SELECT
    *
FROM table_name
WHERE NOT column1 = 'value1';

<a name=’ex01’></a>

Exercise 1

Write an SQL query to get the total amount obtained from the renting of Travel, Family, and Children films during the months of June and July of 2005. Present the results grouped and ordered by store id and category name.

Note: Remember that a good practice when exploring your tables for the first time is to set a LIMIT, mostly if you want to get all columns from a table with the * wildcard.

%%sql
SELECT
    store_id,
    dim_category.name AS category_name,
    SUM(amount) AS total_amount
FROM
    fact_rental
    INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
WHERE
    dim_category.name IN ('Travel', 'Family', 'Children')
    AND fact_rental.rental_date BETWEEN '2005-06-01' AND '2005-08-01'
GROUP BY
    store_id,
    dim_category.name
ORDER BY
    store_id,
    dim_category.name;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
6 rows affected.
store_id category_name total_amount
1 Children 1004.38
1 Family 1182.09
1 Travel 901.91
2 Children 973.39
2 Family 1144.00
2 Travel 1050.52
Expected Output
store_id category_name total_amount
1 Children 1004.38
1 Family 1182.09
1 Travel 901.91
2 Children 973.39
2 Family 1144.00
2 Travel 1050.52

2.2 - Common Table Expressions (CTEs)

A Common Table Expression (CTE) is used to create a temporary result set that can be referenced in another query. It only exists during the execution of a query.

WITH temp_results AS (
    SELECT
        column1,
        column2
    FROM table1
    WHERE column2 = 'value1'
)
SELECT
    *
FROM table2
INNER JOIN temp_results
ON temp_results.column1 = table2.column1;

Exercise 2

Write an SQL query using a CTE to get the average number of films per category. Then, calculate the average rounded down and rounded up.

Optional Exercise 2.1

You can start by creating a query to extract the category_id and film_id from the fact_rental. Use DISTINCT and LIMIT. To compare your results with the expected output, you can order by the category_id and film_id.

%%sql
SELECT DISTINCT 
    category_id, 
    film_id
FROM
    fact_rental
ORDER BY
    category_id,
    film_id
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
category_id film_id
1 19
1 21
1 29
1 56
1 67
1 97
1 105
1 111
1 115
1 126
Expected Output

Note: Not all of the records are shown here.

category_id film_id
1 19
1 21
1 29

Optional Exercise 2.2

Let’s practice writing Common Table Expressions (CTEs). Use optional exercise 2.1 without the LIMIT and ORDER BY statements as the code base to create a temporary result table named film_category. From the CTE film_category, select the category_id and aggregate the number of films in each category using the COUNT() function. Name the output column as films. Perform grouping by category_id. To check your result against the expected output, add the ORDER BY column category_id and LIMIT for the top 10 rows.

%%sql
WITH film_category AS (
    SELECT DISTINCT 
        category_id,
        film_id
    FROM
        fact_rental
)
SELECT
    category_id,
    COUNT(1) AS films
FROM
    film_category
GROUP BY
    category_id
ORDER BY
    category_id
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
category_id films
1 61
2 64
3 58
4 54
5 56
6 63
7 61
8 67
9 67
10 58
Expected Output

Note: Not all of the records are shown here.

category_id films
1 61
2 64
3 58
4 54
5 56
6 63

#### Optional Exercise 2.3

In this exercise, create a query with 2 CTEs. The first CTE uses the code for film_category as in optional exercise 2.2, and the second CTE is based on the code for the SELECT part of the query for optional exercise 2.2, naming the second CTE as film_category_count. Here is the scheme showing how to use two CTEs in the query:

WITH temp_results AS (
    SELECT
        column1,
        column2
    FROM table1
    WHERE column2 = 'value1'
),
temp_results_2 AS (
    SELECT
        *
    FROM table2
    INNER JOIN temp_results
    ON temp_results.column1 = table2.column1;
)
SELECT
    *
FROM temp_results_2

To calculate the average number of unique films from all categories, average_by_category, use the CTE film_category_count, averaging all the film counts with the AVG() function applied to films column.

%%sql
WITH film_category AS (
    SELECT DISTINCT 
        category_id,
        film_id
    FROM
        fact_rental
),
film_category_count AS (
    SELECT
        category_id,
        COUNT(1) AS films
    FROM
        film_category
    GROUP BY
        category_id
    ORDER BY
        category_id
)
SELECT
    AVG(films) AS average_by_category
FROM
    film_category_count;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
average_by_category
59.8750
Expected Output
average_by_category
59.8750

Optional Exercise 2.4

Use the query from the optional exercise 2.3. Wrap the last part into a CTE with the name films_average_by_category. From films_average_by_category select average_by_category, and also apply the FLOOR() and CEIL() functions to average_by_category to create separate columns for each function’s result. This will give you the final output.

%%sql
WITH film_category AS (
    SELECT DISTINCT 
        category_id,
        film_id
    FROM
        fact_rental
),
film_category_count AS (
    SELECT
        category_id,
        COUNT(1) AS films
    FROM
        film_category
    GROUP BY
        category_id
    ORDER BY
        category_id
),
films_average_by_category AS (
    SELECT
        AVG(films) AS average_by_category
    FROM
        film_category_count
)
SELECT
    average_by_category AS average,
    FLOOR(average_by_category) AS average_down,
    CEIL(average_by_category) AS average_up
FROM
    films_average_by_category;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
average average_down average_up
59.8750 59 60
Expected Output
average average_down average_up
59.875 59 60

2.3 - Subqueries

A subquery is a query nested inside another query. The inner queries will be executed first and their results will be passed into the outer queries in order to be executed.

SELECT
    *
FROM table1
WHERE column1 > (
    SELECT AVG(column1)
    FROM table1
);

Exercise 3

Write an SQL query using a subquery to get the film categories that have the number of films above the average rounded up calculated in the previous exercise.

Optional Exercise 3.1

Start by using the following two CTEs from the previous exercise 2: film_category and film_category_count. Then from the film_category_count CTE, select the average number of films, but enclose this result inside the CEIL function. Name this result as average_by_category.

%%sql
WITH film_category AS (
    SELECT DISTINCT 
        category_id,
        film_id
    FROM
        fact_rental
),
film_category_count AS (
    SELECT
        category_id,
        COUNT(1) AS films
    FROM
        film_category
    GROUP BY
        category_id
    ORDER BY
        category_id
)
SELECT
    CEIL(AVG(films)) AS average_by_category
FROM
    film_category_count;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
average_by_category
60
Expected Output
average_by_category
60

Optional Exercise 3.2

Use the same two CTEs film_category and film_category_count. Then, select all columns from film_category_count. Create a condition where you will compare the films column with the result of the main query that you created in the previous optional exercise 3.1, but now you will use it as a subquery and you won’t need to rename the column as average_by_category.

%%sql
WITH film_category AS (
    SELECT DISTINCT 
        category_id,
        film_id
    FROM
        fact_rental
),
film_category_count AS (
    SELECT
        category_id,
        COUNT(1) AS films
    FROM
        film_category
    GROUP BY
        category_id
    ORDER BY
        category_id
)
SELECT
    *
FROM
    film_category_count
WHERE
    films > (
        SELECT
            CEIL(AVG(films))
        FROM
            film_category_count
    );
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
7 rows affected.
category_id films
1 61
2 64
6 63
7 61
8 67
9 67
15 73
Expected Output
category_id films
1 61
2 64
6 63
7 61
8 67
9 67
15 73

Optional Exercise 3.3

In the previous optional exercise, you’ve got nearly the final result, but now you need to load the category names instead of the IDs.

This should give you the final result.

%%sql
WITH film_category AS (
    SELECT DISTINCT 
        fact_rental.category_id,
        dim_category.name AS category,
        film_id
    FROM
        fact_rental
        INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
),
film_category_count AS (
    SELECT
        category_id,
        category,
        COUNT(1) AS films
    FROM
        film_category
    GROUP BY
        category_id
)
SELECT
    category,
    films
FROM
    film_category_count
WHERE
    films > (
        SELECT
            CEIL(AVG(films))
        FROM
            film_category_count
    )
ORDER BY
    category;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
7 rows affected.
category films
Action 61
Animation 64
Documentary 63
Drama 61
Family 67
Foreign 67
Sports 73
Expected Output
category films
Action 61
Animation 64
Documentary 63
Drama 61
Family 67
Foreign 67
Sports 73

2.4 - CASE Statement

The CASE statement works like an if statement in programming. It will evaluate some conditions and return a value when the first condition is met. If no conditions are met it will return the value used in the ELSE clause.

SELECT
    CASE
        WHEN column1 < 0 THEN 'value1'
        WHEN column1 > 0 THEN 'value2'
        ...
        ELSE 'value3'
    END
    column2
FROM table_name;

Exercise 4

Write an SQL query to get the maximum purchase amount by customers on 2007-04-30 between 15:00 and 16:00. Obtain the customer’s full name in capital letters, the maximum purchase amount, and the payment date. Then, create a column called value_rate, and assign the low label if the amount is between 0 and 3, the mid label if it is between 3 and 6, and the high label if it is above 6. Sort by the maximum purchase amount in descending order and full name in ascending order.

Optional Exercise 4.1

Start by creating a query to extract from the fact_rental table these columns: the customer ID, the maximum amount named as max_amount (use function MAX()) and from payment_date column get only the date without hours. Use function DATE() naming the resulting column as payment_date. Constrain your results to the date times between ‘2007-04-30 15:00:00’ and ‘2007-04-30 16:00:00’. Remember to group by the customer identifier and by the payment_date(only date, without hours, so the DATE() function should be applied there as well).

Limit the output to 10 records just for this optional exercise.

%%sql
SELECT
    customer_id,
    MAX(amount) AS max_amount,
    DATE(payment_date) AS payment_date
FROM
    fact_rental
WHERE
    payment_date BETWEEN '2007-04-30 15:00:00'
    AND '2007-04-30 16:00:00'
GROUP BY
    customer_id,
    DATE(payment_date)
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
customer_id max_amount payment_date
555 7.99 2007-04-30
571 1.99 2007-04-30
174 4.99 2007-04-30
55 2.99 2007-04-30
186 2.99 2007-04-30
296 2.99 2007-04-30
82 8.99 2007-04-30
336 0.99 2007-04-30
538 0.99 2007-04-30
267 6.99 2007-04-30
Expected Output

Note: Not all of the records are shown here.

customer_id max_amount payment_date
555 7.99 2007-04-30
571 1.99 2007-04-30
174 4.99 2007-04-30

Optional Exercise 4.2

Use the previous query as a CTE max_amount_customer (without the LIMIT statement). Then, create a query expression following these steps:

This should give you the expected final result.

%%sql
WITH max_amount_customer AS (
    SELECT
        customer_id,
        MAX(amount) AS max_amount,
        DATE(payment_date) AS payment_date
    FROM
        fact_rental
    WHERE
        payment_date BETWEEN '2007-04-30 15:00:00'
        AND '2007-04-30 16:00:00'
    GROUP BY
        customer_id,
        DATE(payment_date)
)
SELECT
    CONCAT(
        UPPER(dim_customer.first_name),
        ' ',
        UPPER(dim_customer.last_name)
    ) AS full_name,
    max_amount,
    payment_date,
    CASE
        WHEN max_amount >= 0
        AND max_amount < 3 THEN 'low'
        WHEN max_amount >= 3
        AND max_amount < 6 THEN 'mid'
        WHEN max_amount >= 6 THEN 'high'
    END AS value_rate
FROM
    max_amount_customer
    INNER JOIN dim_customer ON max_amount_customer.customer_id = dim_customer.customer_id
ORDER BY
    max_amount DESC,
    full_name ASC;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
54 rows affected.
full_name max_amount payment_date value_rate
KATHRYN COLEMAN 8.99 2007-04-30 high
LOUIS LEONE 8.99 2007-04-30 high
ROY WHITING 8.99 2007-04-30 high
CARRIE PORTER 7.99 2007-04-30 high
DWIGHT LOMBARDI 7.99 2007-04-30 high
MARION SNYDER 7.99 2007-04-30 high
ALEXANDER FENNELL 6.99 2007-04-30 high
BESSIE MORRISON 6.99 2007-04-30 high
BRANDON HUEY 6.99 2007-04-30 high
DAVID ROYAL 6.99 2007-04-30 high
DERRICK BOURQUE 6.99 2007-04-30 high
GEORGE LINTON 6.99 2007-04-30 high
MARGIE WADE 6.99 2007-04-30 high
ROSA REYNOLDS 6.99 2007-04-30 high
BYRON BOX 5.99 2007-04-30 mid
LYDIA BURKE 5.99 2007-04-30 mid
STEPHANIE MITCHELL 5.99 2007-04-30 mid
ANGEL BARCLAY 4.99 2007-04-30 mid
ARLENE HARVEY 4.99 2007-04-30 mid
DIANA ALEXANDER 4.99 2007-04-30 mid
ERIK GUILLEN 4.99 2007-04-30 mid
FRED WHEAT 4.99 2007-04-30 mid
VANESSA SIMS 4.99 2007-04-30 mid
WILLARD LUMPKIN 4.99 2007-04-30 mid
YVONNE WATKINS 4.99 2007-04-30 mid
MONICA HICKS 3.99 2007-04-30 mid
SHEILA WELLS 3.99 2007-04-30 mid
STACY CUNNINGHAM 3.99 2007-04-30 mid
AGNES BISHOP 2.99 2007-04-30 low
ANNE POWELL 2.99 2007-04-30 low
DARYL LARUE 2.99 2007-04-30 low
DORIS REED 2.99 2007-04-30 low
EDWIN BURK 2.99 2007-04-30 low
HELEN HARRIS 2.99 2007-04-30 low
HERBERT KRUGER 2.99 2007-04-30 low
HOLLY FOX 2.99 2007-04-30 low
IRENE PRICE 2.99 2007-04-30 low
JESUS MCCARTNEY 2.99 2007-04-30 low
LESLIE SEWARD 2.99 2007-04-30 low
MARION OCAMPO 2.99 2007-04-30 low
MEGAN PALMER 2.99 2007-04-30 low
MELISSA KING 2.99 2007-04-30 low
MORRIS MCCARTER 2.99 2007-04-30 low
RAMONA HALE 2.99 2007-04-30 low
SHAWN HEATON 2.99 2007-04-30 low
VINCENT RALSTON 2.99 2007-04-30 low
AMANDA CARTER 1.99 2007-04-30 low
JOHNNIE CHISHOLM 1.99 2007-04-30 low
STEPHEN QUALLS 1.99 2007-04-30 low
ANITA MORALES 0.99 2007-04-30 low
JOSHUA MARK 0.99 2007-04-30 low
KEVIN SCHULER 0.99 2007-04-30 low
RACHEL BARNES 0.99 2007-04-30 low
TED BREAUX 0.99 2007-04-30 low
Expected Output

Note: Not all of the records are shown here.

full_name max_amount payment_date value_rate
KATHRYN COLEMAN 8.99 2007-04-30 high
LOUIS LEONE 8.99 2007-04-30 high
ROY WHITING 8.99 2007-04-30 high
CARRIE PORTER 7.99 2007-04-30 high
DWIGHT LOMBARDI 7.99 2007-04-30 high
MARION SNYDER 7.99 2007-04-30 high
ALEXANDER FENNELL 6.99 2007-04-30 high
BESSIE MORRISON 6.99 2007-04-30 high
BRANDON HUEY 6.99 2007-04-30 high
DAVID ROYAL 6.99 2007-04-30 high
DERRICK BOURQUE 6.99 2007-04-30 high
GEORGE LINTON 6.99 2007-04-30 high
MARGIE WADE 6.99 2007-04-30 high
ROSA REYNOLDS 6.99 2007-04-30 high
BYRON BOX 5.99 2007-04-30 mid
LYDIA BURKE 5.99 2007-04-30 mid
STEPHANIE MITCHELL 5.99 2007-04-30 mid
ANGEL BARCLAY 4.99 2007-04-30 mid
ARLENE HARVEY 4.99 2007-04-30 mid
DIANA ALEXANDER 4.99 2007-04-30 mid

2.5 - Pivot Tables

A pivot table is a term used to describe a rotated table with one or more categories as columns used to aggregate the items by a specific value.

For example, let’s say you want to pivot the following table by category while adding the value.

id category value
1 category1 1.0
1 category1 1.5
1 category2 1.0
1 category2 1.5
2 category1 2.0
2 category1 2.5
2 category2 2.0
2 category2 2.5
3 category1 3.0
3 category1 3.5
3 category2 3.0
3 category2 3.5

Then you should get the following result.

id category1 category2
1 2.5 2.5
2 4.5 4.5
3 6.5 6.5

Exercise 5

Based on rental orders, write an SQL query to get the customer’s full name, film category, and payment amount by customer and film category. Sort the results by the customer’s full name, film category, and payment amount in ascending order.

%%sql
SELECT
    CONCAT(
        dim_customer.first_name,
        ' ',
        dim_customer.last_name
    ) AS full_name,
    dim_category.name AS category,
    SUM(amount) AS amount
FROM
    fact_rental
    INNER JOIN dim_customer ON fact_rental.customer_id = dim_customer.customer_id
    INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
GROUP BY
    full_name,
    category
ORDER BY
    full_name,
    category,
    amount
LIMIT 30;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
30 rows affected.
full_name category amount
Aaron Selby Action 4.99
Aaron Selby Classics 7.98
Aaron Selby Comedy 7.98
Aaron Selby Drama 5.99
Aaron Selby Family 8.98
Aaron Selby Foreign 8.97
Aaron Selby Games 6.99
Aaron Selby Horror 10.98
Aaron Selby Music None
Aaron Selby Travel 25.94
Adam Gooch Action 2.99
Adam Gooch Animation 10.99
Adam Gooch Children 6.97
Adam Gooch Classics 3.99
Adam Gooch Comedy 8.99
Adam Gooch Family 1.98
Adam Gooch Foreign 13.98
Adam Gooch Games 10.98
Adam Gooch Horror 12.97
Adam Gooch Music 6.99
Adam Gooch New 4.99
Adam Gooch Sports 11.98
Adrian Clary Children 2.98
Adrian Clary Classics 2.99
Adrian Clary Drama 7.99
Adrian Clary Family 6.97
Adrian Clary Foreign 2.99
Adrian Clary Games 15.96
Adrian Clary Horror 3.99
Adrian Clary Music 17.97
Expected Output

Note: Not all of the records are shown here.

full_name category amount
Aaron Selby Action 4.99
Aaron Selby Classics 7.98
Aaron Selby Comedy 7.98
Aaron Selby Drama 5.99
Aaron Selby Family 8.98
Aaron Selby Foreign 8.97
Aaron Selby Games 6.99
Aaron Selby Horror 10.98
Aaron Selby Music None
Aaron Selby Travel 25.94
Adam Gooch Action 2.99
Adam Gooch Animation 10.99
Adam Gooch Children 6.97
Adam Gooch Classics 3.99
Adam Gooch Comedy 8.99
Adam Gooch Family 1.98
Adam Gooch Foreign 13.98
Adam Gooch Games 10.98
Adam Gooch Horror 12.97
Adam Gooch Music 6.99

Exercise 6

Using previous results, write an SQL query to create a pivot table that shows the total amount spent for each customer in each category. Also, fill the null values with a 0.

This approach ensures that each customer’s total spending in each category is accurately pivoted into separate columns.

%%sql
WITH customer_category_sum AS (
    SELECT
        CONCAT(
            dim_customer.first_name,
            ' ',
            dim_customer.last_name
        ) AS full_name,
        dim_category.name AS category,
        SUM(amount) AS amount
    FROM
        fact_rental
        INNER JOIN dim_customer ON fact_rental.customer_id = dim_customer.customer_id
        INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
    GROUP BY
        full_name,
        category
    ORDER BY
        full_name,
        category,
        amount
)
SELECT
    full_name,
    MAX(
        CASE
            WHEN category = 'Family' THEN amount
            ELSE 0
        END
    ) AS "Family",
    MAX(
        CASE
            WHEN category = 'Games' THEN amount
            ELSE 0
        END
    ) AS "Games",
    MAX(
        CASE
            WHEN category = 'Animation' THEN amount
            ELSE 0
        END
    ) AS "Animation",
    MAX(
        CASE
            WHEN category = 'Classics' THEN amount
            ELSE 0
        END
    ) AS "Classics",
    MAX(
        CASE
            WHEN category = 'Documentary' THEN amount
            ELSE 0
        END
    ) AS "Documentary",
    MAX(
        CASE
            WHEN category = 'Sports' THEN amount
            ELSE 0
        END
    ) AS "Sports",
    MAX(
        CASE
            WHEN category = 'New' THEN amount
            ELSE 0
        END
    ) AS "New",
    MAX(
        CASE
            WHEN category = 'Children' THEN amount
            ELSE 0
        END
    ) AS "Children",
    MAX(
        CASE
            WHEN category = 'Music' THEN amount
            ELSE 0
        END
    ) AS "Music",
    MAX(
        CASE
            WHEN category = 'Travel' THEN amount
            ELSE 0
        END
    ) AS "Travel"
FROM
    customer_category_sum
GROUP BY
    full_name
ORDER BY
    full_name
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
full_name Family Games Animation Classics Documentary Sports New Children Music Travel
Aaron Selby 8.98 6.99 0.00 7.98 0.00 0.00 0.00 0.00 0.00 25.94
Adam Gooch 1.98 10.98 10.99 3.99 0.00 11.98 4.99 6.97 6.99 0.00
Adrian Clary 6.97 15.96 0.00 2.99 0.00 6.98 0.00 2.98 17.97 0.00
Agnes Bishop 3.98 7.98 8.98 9.98 4.99 0.00 0.00 2.99 0.00 11.98
Alan Kahn 8.99 4.99 12.98 14.97 4.99 9.98 9.98 0.99 2.99 0.00
Albert Crouse 9.98 1.98 3.98 0.00 14.98 8.98 2.99 0.00 13.97 0.00
Alberto Henning 6.97 0.99 0.00 0.00 7.97 0.00 2.99 0.00 10.97 4.99
Alex Gresham 2.99 14.98 0.99 4.99 13.98 32.93 6.99 2.99 9.96 9.98
Alexander Fennell 23.95 5.98 17.97 2.99 0.00 2.99 2.99 11.97 8.98 4.99
Alfred Casillas 4.98 3.98 0.00 4.99 11.98 11.98 0.00 7.98 8.97 9.98
Expected Output

Note: Not all of the records are shown here.

full_name Family Games Animation Classics Documentary Sports New Children Music Travel
Aaron Selby 8.98 6.99 0 7.98 0 0 0 0 0 25.94
Adam Gooch 1.98 10.98 10.99 3.99 0 11.98 4.99 6.97 6.99 0
Adrian Clary 6.97 15.96 0 2.99 0 6.98 0 2.98 17.97 0
Agnes Bishop 3.98 7.98 8.98 9.98 4.99 0 0 2.99 0 11.98
Alan Kahn 8.99 4.99 12.98 14.97 4.99 9.98 9.98 0.99 2.99 0
Albert Crouse 9.98 1.98 3.98 0 14.98 8.98 2.99 0 13.97 0
Alberto Henning 6.97 0.99 0 0 7.97 0 2.99 0 10.97 4.99

3 - SQL Functions

3.1 - Working with Dates

Depending on the Database Management System (DBMS) that you use, there can be some differences with the function names to handle dates. As the DBMS that you are using here is MySQL, it provides the EXTRACT function.

The EXTRACT function is used to get the year, month, week, and other date information from a date or time value. Other DBMS such as PostgreSQL have the DATE_PART function for the same purpose.

%%sql
SELECT
    EXTRACT(SECOND FROM TIMESTAMP '2022-03-04 01:02:03') AS second_value,
    EXTRACT(MINUTE FROM TIMESTAMP '2022-03-04 01:02:03') AS minute_value,
    EXTRACT(DAY FROM TIMESTAMP '2022-03-04 01:02:03') AS day_value,
    EXTRACT(YEAR FROM TIMESTAMP '2022-03-04 01:02:03') AS year_value;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
second_value minute_value day_value year_value
3 2 4 2022

Exercise 7

Write an SQL query to get the customers who made a payment on 2007-04-30 between 15:00 and 16:00. Get the customer id and create a column that shows On time if the rental return was within the borrow time limit, and Late if it was overdue.

%%sql
SELECT
    fact_rental.customer_id,
    CASE
        WHEN (EXTRACT(HOUR FROM TIMEDIFF(fact_rental.return_date, fact_rental.rental_date))) > dim_film.rental_duration * 24 THEN 'Late'
        ELSE 'On time'
    END AS delivery    
FROM
    fact_rental
    INNER JOIN dim_customer ON fact_rental.customer_id = dim_customer.customer_id
    INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
WHERE
    fact_rental.payment_date BETWEEN '2007-04-30 15:00:00'
    AND '2007-04-30 16:00:00'
ORDER BY
    dim_customer.customer_id
;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
57 rows affected.
customer_id delivery
15 On time
30 Late
40 Late
41 Late
55 On time
76 On time
79 On time
82 Late
85 On time
96 On time
112 Late
124 Late
129 Late
131 Late
136 On time
151 On time
174 On time
178 Late
180 Late
186 On time
195 On time
217 Late
221 Late
225 Late
225 On time
243 Late
267 Late
296 On time
304 Late
314 Late
321 On time
332 Late
336 On time
363 Late
363 Late
366 Late
369 On time
373 Late
390 On time
414 On time
427 Late
428 On time
432 On time
439 Late
476 Late
503 On time
506 On time
529 On time
538 On time
555 Late
555 Late
571 Late
573 Late
576 Late
578 On time
579 On time
588 On time
Expected Output

Note: Not all of the records are shown here.

customer_id delivery
15 On time
30 Late
40 Late
41 Late
55 On time
76 On time
79 On time
82 Late
85 On time
96 On time

3.2 - SUBSTRING

The SUBSTRING function is used to get part of a string value by defining a start position and a length. Here is an example. 

SELECT
    column1,
    SUBSTRING(column1, 1, 3) AS part_column1
FROM
    table_name;

Example of the query result:

column1 part_column1
1value 1val
2value 2val
3value 3val

Exercise 8

Write an SQL query to get the initials of the staff using the first_name and last_name columns. Use CONCAT() function to join the substrings.

Example: John Lennon -> JL

%%sql
SELECT
    CONCAT(
        SUBSTRING(first_name, 1, 1),
        SUBSTRING(last_name, 1, 1)
    ) AS initials
FROM
    dim_staff;

 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
2 rows affected.
initials
MH
JS
Expected Output
initials
MH
JS

3.3 - Window Functions

Window functions aggregate the query results without reducing the number of rows returned by the query.

Considering data from the sales department, let’s say you want to get the maximum amount of units sold for every month without grouping the rows. For example, you have the following data.

id month units
1 1 1
2 1 2
3 1 3
4 2 2
5 2 4
6 2 6
7 3 4
8 3 8
9 3 12
10 4 8
11 4 16
12 4 24

If you apply this query to get the desired result:

SELECT
    id,
    month,
    units,
    MAX(units) OVER (
        PARTITION BY month
    ) AS units_maximum
FROM sales;

Then you should get the following output.

id month units units_maximum
1 1 1 3
2 1 2 3
3 1 3 3
4 2 2 6
5 2 4 6
6 2 6 6
7 3 4 12
8 3 8 12
9 3 12 12
10 4 8 24
11 4 16 24
12 4 24 24

Exercise 9

Write an SQL query to get the most paid movie by rating. Get the title of the movie, the rating, and the amount collected.

Optional Exercise 9.1

Join the fact_rental and dim_film tables through the film_id, and select the film title and rating from dim_film. Then take the SUM() of the fact_rental.amount column as amount. Group by title and rating. Limit your result to 10 rows.

%%sql
SELECT
    dim_film.title,
    dim_film.rating,
    SUM(fact_rental.amount) AS amount
FROM
    fact_rental
    INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
GROUP BY
    title,
    rating
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
title rating amount
Academy Dinosaur PG 33.79
Ace Goldfinger G 52.93
Adaptation Holes NC-17 34.89
Affair Prejudice G 83.79
African Egg G 47.89
Agent Truman PG 111.81
Airplane Sierra PG-13 82.85
Airport Pollock R 86.85
Alabama Devil PG-13 71.88
Aladdin Calendar NC-17 131.77
Expected Output

Note: Not all of the records are shown here.

title rating amount
Academy Dinosaur PG 33.79
Ace Goldfinger G 52.93
Adaptation Holes NC-17 34.89
Affair Prejudice G 83.79

Optional Exercise 9.2

Create a CTE named movies_amount_rating with the previous query (without the LIMIT statement). As query expression, select the title, rating, amount and use the RANK() window function to assign a rank to each row based on a partition over the rating and ordered by the amount in descending order. Name this last column as rank_movies. Limit your results to 10.

%%sql
WITH movies_amount_rating AS(
    SELECT
        dim_film.title,
        dim_film.rating,
        SUM(fact_rental.amount) AS amount
    FROM
        fact_rental
        INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
    GROUP BY
        title,
        rating
)
SELECT
    title,
    rating,
    amount,
    RANK() over (
        PARTITION BY rating
        ORDER BY
            amount DESC
    ) AS rank_movies
FROM
    movies_amount_rating
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
title rating amount rank_movies
Saturday Lambs G 190.74 1
Torque Bound G 169.76 2
Dogma Family G 168.72 3
Cat Coneheads G 159.73 4
Secrets Paradise G 139.82 5
Garden Island G 134.82 6
Easy Gladiator G 132.80 7
Heartbreakers Bright G 132.80 7
Midsummer Groundhog G 129.82 9
Beauty Grease G 125.79 10
Expected Output

Note: Not all of the records are shown here.

title rating amount rank_movies
Saturday Lambs G 190.74 1
Torque Bound G 169.76 2
Dogma Family G 168.72 3
Cat Coneheads G 159.73 4

Optional Exercise 9.3

Create a new CTE named movies_ranking with the previous query expression (do not include the LIMIT); you should have 2 CTEs at this point. As a query expression, select the title, rating, amount from movies_ranking; filter to get only the rank_movies equals to 1. This will give you the final result for this exercise.

%%sql
WITH movies_amount_rating AS(
    SELECT
        dim_film.title,
        dim_film.rating,
        SUM(fact_rental.amount) AS amount
    FROM
        fact_rental
        INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
    GROUP BY
        title,
        rating
),
movies_ranking AS (
    SELECT
        title,
        rating,
        amount,
        RANK() over (
            PARTITION BY rating
            ORDER BY
                amount DESC
        ) AS rank_movies
    FROM
        movies_amount_rating
)
SELECT
    title,
    rating,
    amount
FROM
    movies_ranking
WHERE
    rank_movies = 1;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
5 rows affected.
title rating amount
Saturday Lambs G 190.74
Telegraph Voyage PG 215.75
Innocent Usual PG-13 191.74
Velvet Terminator R 152.77
Zorro Ark NC-17 199.72
Expected Output
title rating amount
Saturday Lambs G 190.74
Telegraph Voyage PG 215.75
Innocent Usual PG-13 191.74
Velvet Terminator R 152.77
Zorro Ark NC-17 199.72

Exercise 10

Write an SQL query to get the 10 actors with the highest number of films produced. Get the first name, last name, number of films and rank.

Optional Exercise 10.1

Use the dim_actor as the main table for this query. Perform the following joins:

Then, select the distinct dim_actor.actor_id, dim_actor.first_name, dim_actor.last_name and dim_film.film_id combinations. Limit your results to 20 rows; To compare with the expected output, you can order by dim_actor.actor_id and dim_film.film_id.

%%sql
SELECT DISTINCT 
    dim_actor.actor_id,
    dim_actor.first_name,
    dim_actor.last_name,
    dim_film.film_id
FROM
    dim_actor
    INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
    INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
    INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
ORDER BY
    dim_actor.actor_id,
    dim_film.film_id
LIMIT 20;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
20 rows affected.
actor_id first_name last_name film_id
1 Penelope Guiness 1
1 Penelope Guiness 23
1 Penelope Guiness 25
1 Penelope Guiness 106
1 Penelope Guiness 140
1 Penelope Guiness 166
1 Penelope Guiness 277
1 Penelope Guiness 361
1 Penelope Guiness 438
1 Penelope Guiness 499
1 Penelope Guiness 506
1 Penelope Guiness 509
1 Penelope Guiness 605
1 Penelope Guiness 635
1 Penelope Guiness 749
1 Penelope Guiness 832
1 Penelope Guiness 939
1 Penelope Guiness 970
1 Penelope Guiness 980
2 Nick Wahlberg 3
Expected Output

Note: Not all of the records are shown here.

actor_id first_name last_name film_id
1 Penelope Guiness 1
1 Penelope Guiness 23
1 Penelope Guiness 25
1 Penelope Guiness 106

Optional Exercise 10.2

Create a CTE with the previous query and name it actor_film (without the ORDER BY and LIMIT statements). Then, in the query expression select the actor_id, first_name, last_name and count the number of films from the actor_film CTE result with the function COUNT(*) giving the column name films. Remember to group by the actor_id. Limit your results to 10 rows. To compare with the expected output, you can order by actor_id.

%%sql
WITH actor_film AS (
    SELECT DISTINCT 
        dim_actor.actor_id,
        dim_actor.first_name,
        dim_actor.last_name,
        dim_film.film_id
    FROM
        dim_actor
        INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
        INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
        INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
)
SELECT
    actor_id,
    first_name,
    last_name,
    COUNT(*) AS films
FROM
    actor_film
GROUP BY
    actor_id
ORDER BY
    actor_id
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
actor_id first_name last_name films
1 Penelope Guiness 19
2 Nick Wahlberg 24
3 Ed Chase 21
4 Jennifer Davis 19
5 Johnny Lollobrigida 28
6 Bette Nicholson 20
7 Grace Mostel 29
8 Matthew Johansson 19
9 Joe Swank 23
10 Christian Gable 22
Expected Output

Note: Not all of the records are shown here.

actor_id first_name last_name films
1 Penelope Guiness 19
2 Nick Wahlberg 24
3 Ed Chase 21
4 Jennifer Davis 19

Optional Exercise 10.3

Alongside the actor_film CTE, add the previous query as actor_film_count (without the ORDER BY and LIMIT statements). In the query expression, select the actor’s first and last name (first_name and last_name), also the count of films films and use the ROW_NUMBER() window function over a window ordered by the film count in descending order and by the first and last name in ascending order; name this column as actor_rank. Limit this result to 10.

%%sql
WITH actor_film AS (
    SELECT DISTINCT 
        dim_actor.actor_id,
        dim_actor.first_name,
        dim_actor.last_name,
        dim_film.film_id
    FROM
        dim_actor
        INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
        INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
        INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
),
actor_film_count AS (
SELECT
    actor_id,
    first_name,
    last_name,
    COUNT(*) AS films
FROM
    actor_film
GROUP BY
    actor_id
)
SELECT
    first_name,
    last_name,
    films,
    ROW_NUMBER() over (
        ORDER BY
            films DESC, first_name, last_name ASC
    ) AS actor_rank
FROM
    actor_film_count
LIMIT 10;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
first_name last_name films actor_rank
Gina Degeneres 42 1
Mary Keitel 40 2
Walter Torn 39 3
Matthew Carrey 37 4
Sandra Kilmer 36 5
Angela Witherspoon 35 6
Henry Berry 35 7
Scarlett Damon 35 8
Val Bolger 35 9
Vivien Basinger 35 10
Expected Output

Note: Not all of the records are shown here.

first_name last_name films actor_rank
Gina Degeneres 42 1
Mary Keitel 40 2
Walter Torn 39 3
Matthew Carrey 37 4
Sandra Kilmer 36 5
Angela Witherspoon 35 6

Optional Exercise 10.4

Create a third CTE statement named actors_rank using the previous query expression (without the LIMIT statement). Then, select all columns from the actors_rank result and filter by the top 10 actors with the help of the actor_rank column. Order by actor_rank in ascending way. This will give the final result for this exercise.

%%sql
WITH actor_film AS (
    SELECT DISTINCT 
        dim_actor.actor_id,
        dim_actor.first_name,
        dim_actor.last_name,
        dim_film.film_id
    FROM
        dim_actor
        INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
        INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
        INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
),
actor_film_count AS (
SELECT
    actor_id,
    first_name,
    last_name,
    COUNT(*) AS films
FROM
    actor_film
GROUP BY
    actor_id
),
actors_rank AS (
    SELECT
        first_name,
        last_name,
        films,
        ROW_NUMBER() over (
            ORDER BY
                films DESC, first_name, last_name ASC
        ) AS actor_rank
    FROM
        actor_film_count
)
SELECT
    *
FROM
    actors_rank
WHERE
    actor_rank <= 10
ORDER BY
    actor_rank ASC;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
first_name last_name films actor_rank
Gina Degeneres 42 1
Mary Keitel 40 2
Walter Torn 39 3
Matthew Carrey 37 4
Sandra Kilmer 36 5
Angela Witherspoon 35 6
Henry Berry 35 7
Scarlett Damon 35 8
Val Bolger 35 9
Vivien Basinger 35 10
Expected Output
first_name last_name films actor_rank
Gina Degeneres 42 1
Mary Keitel 40 2
Walter Torn 39 3
Matthew Carrey 37 4
Sandra Kilmer 36 5
Angela Witherspoon 35 6
Henry Berry 35 7
Scarlett Damon 35 8
Val Bolger 35 9
Vivien Basinger 35 10

3.4 - LAG

The LAG function is a window function used to get the value of a row that comes before the current row at a given offset. Have a look at the example:

SELECT
    id,
    column1
    LAG(column1) OVER (ORDER BY id) AS prev_column1
FROM table_name;

Query result:

id column1 prev_column1
1 a  
2 b a
3 c b

The LAG function can also take an optional second parameter which specifies the offset from the current row. By default, this offset is 1, but you can set it to any integer value. For example, LAG(column1, 2) OVER (ORDER BY id) would return the value from two rows before the current row.

Exercise 11

Filtering rental orders with payments, write an SQL query to get the amount spent by month of the customer ID 388. Get the total amount of the current month and, the total amount of the previous month, and calculate the difference between those values.

Optional Exercise 11.1

Taking the fact_rental table, use the EXTRACT() function to obtain the MONTH from the payment_date column. Then, name the new column as month. Use the SUM() function over the column amount and name the resulting column as amount. Filter the corresponding customer_id and where payment_date is NOT NULL. Group by month. Order by month to compare with the expected output.

%%sql
SELECT
    EXTRACT(MONTH FROM payment_date) AS month,
    SUM(amount) AS amount
FROM
    fact_rental
WHERE
    customer_id = 388
    AND payment_date IS NOT NULL
GROUP BY
    month
ORDER BY
    month
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
4 rows affected.
month amount
2 26.95
3 50.89
4 35.89
5 0.99
Expected Output

Note: The order of the records can be different.

month amount
2 26.95
3 50.89
4 35.89
5 0.99

Optional Exercise 11.2

With the previous query, create a CTE named total_payment_amounts_sum (do not include the ORDER BY statement). In the query expression, extract the month and amount columns. Then, use the LAG function and set the first parameter as the amount column and the second parameter as 1; this function will work over a window ordered by month and this should be named previous_month_amount. Take the difference between the result of the LAG function with the same parameters and over the same window, and the current value of amount. Set the name for this value as difference. This will give you the final result for this exercise.

%%sql
WITH total_payment_amounts_sum AS (
    SELECT
        EXTRACT(MONTH FROM payment_date) AS month,
        SUM(amount) AS amount
    FROM
        fact_rental
    WHERE
        customer_id = 388
        AND payment_date IS NOT NULL
    GROUP BY
        month
)
SELECT
    month,
    amount,
    LAG(amount, 1) OVER (
        ORDER BY
            month
    ) AS previous_month_amount,
    LAG(amount, 1) OVER (
        ORDER BY
            month
    ) - amount AS difference
FROM
    total_payment_amounts_sum;
 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
4 rows affected.
month amount previous_month_amount difference
2 26.95 None None
3 50.89 26.95 -23.94
4 35.89 50.89 15.00
5 0.99 35.89 34.90
Expected Output
month amount previous_month_amount difference
2 26.95 None None
3 50.89 26.95 -23.94
4 35.89 50.89 15.00
5 0.99 35.89 34.90

3.5 - LEAD

The LEAD function is a window function used to get the value of a row that comes after the current row at a given offset.

SELECT
    id,
    column1
    LEAD(column1) OVER (ORDER BY id) AS next_column1
FROM table_name;

Query result:

id column1 next_column1
1 a b
2 b c
3 c d

The LEAD function can also take an optional second parameter which specifies the offset from the current row. By default, this offset is 1, but you can set it to any integer value. For example, LEAD(column1, 2) OVER (ORDER BY id) would return the value from two rows after the current row.

#### Exercise 12

Filtering rental orders with payments, write an SQL query to get the amount spent by month of the customer_id 388. Get the total amount of the current month and the total amount of the next month (name it as next_month_amount), then calculate the difference between those values.

To complete this exercise, you can follow the same process as in the previous exercise, but change the LAG function with the LEAD function.

%%sql
WITH total_payment_amounts_sum AS (
    SELECT
        EXTRACT(MONTH FROM payment_date) AS month,
        SUM(amount) AS amount
    FROM
        fact_rental
    WHERE
        customer_id = 388
        AND payment_date IS NOT NULL
    GROUP BY
        month
)
SELECT
    month,
    amount,
    LEAD(amount, 1) OVER (
        ORDER BY
            month
    ) AS previous_month_amount,
    LEAD(amount, 1) OVER (
        ORDER BY
            month
    ) - amount AS difference
FROM
    total_payment_amounts_sum;

 * mysql+pymysql://:***@1264ded65b07:3306/sakila_star
4 rows affected.
month amount previous_month_amount difference
2 26.95 50.89 23.94
3 50.89 35.89 -15.00
4 35.89 0.99 -34.90
5 0.99 None None
Expected Output
month amount next_month_amount difference
2 26.95 50.89 23.94
3 50.89 35.89 -15.00
4 35.89 0.99 -34.90
5 0.99 None None

During this lab, you’ve written several SQL queries to manipulate data from the DVD rental sample database. Firstly, you applied different conditions to the WHERE clause using logical operators. Then, you built more complex queries using different techniques such as Common Table Expressions, CASE statements, and so on. Finally, you applied window functions to summarize data using a specific window. Overall, you now have different tools to process data using SQL in a desired way.