As a Data Engineer, you will have to manipulate data to process it and help the organization find valuable insights. To do so, you should learn the basics of SQL to work with data in different ways.
This time, you will be working with a DVD rental sample dataset where you will find information about stores, customers, and rented films. Below you will find the diagram with the relationships between the tables and a brief description of each one.
The purpose of this assignment is to answer business questions using SQL language to query the database. The idea is that you get the same answer that is shown in each exercise.
After this assignment you’ll be able to:
DATE_PART
function to manipulate date values.LAG
and LEAD
functions to access values from other rows when making a calculation.CASE
statement to return different values based on conditions.<a name=’1’></a>
You will be working with a modified version of the Sakila Sample Database, which is licensed under the New BSD license.
For learning purposes, let’s assume that the data belongs to Rentio, which is a fictitious company dedicated to renting movies to clients from all around the world. The company has a Dimensional Model created to perform analytical queries based on the data of the staff and the stores, including inventory, films, and paying customers. The management team would like to make informed decisions based on the available data.
Your task will be to design and build analytical queries based on Rentio’s Dimensional Model stored in their Data Warehouse.
<a name=’1.1’></a>
Rentio’s Dimensional Model includes the following tables.
fact_rental
: Contains the amount of the payment for a particular rental registry as the metric. This rental registry associates a film with its corresponding category, a customer, a rental date and a payment date. It is also related to the store where the film was rented and the staff member who rented it.dim_film
: Contains information about films including the title, language, rating, and release year.dim_customer
: Contains customer data including name, address, and email.dim_store
: Contains store data such as the manager and store address.dim_staff
: Contains staff data such as first name, last name, stores where they work, and staff status (active or not).dim_category
: Stores the relationship between the film and the category it belongs to. Contains the category_id
and name
for each category: Action, Comedy, Documentary, Sci-Fi, among others.dim_actor
: Contains the actor’s data such as first and last name.bridge_actor
: A bridge table relating each actor_id
with each rental_id
.Here you can find the entity-relationship model (ERM) of the star schema showing all the tables and how they are related:
To interact with SQL Databases within a JupyterLab notebook, you will use the SQL “magic” offered by the ipython-sql
extension. JupyterLab defines “magic” as special commands prefixed with %
. Here, we’ll employ the load_ext
magic to load the ipython-sql
extension.
# Loading the SQL module
%load_ext sql
The provided magic command loads the ipython-sql
extension, enabling connection to databases supported by SQLAlchemy. In this example, you’ll connect to an existing MySQL database. However, to proceed, it’s essential to obtain your credentials and establish a connection to the MySQL database, which is done in the src/env
file.
import os
import socket
from dotenv import load_dotenv
load_dotenv('./src/env', override=True)
DBHOST = socket.gethostname()
DBPORT = os.getenv('DBPORT')
DBNAME = os.getenv('DBNAME')
DBUSER = os.getenv('DBUSER')
DBPASSWORD = os.getenv('DBPASSWORD')
connection_url = f'mysql+pymysql://{DBUSER}:{DBPASSWORD}@{DBHOST}:{DBPORT}/{DBNAME}'
%sql {connection_url}
Some of the queries in this notebook are quite complicated. You will find optional exercises that will help you to reach the final query step-by-step.
WHERE
clause is used to filter data based on a condition. If you want to use multiple conditions you will need to use logical operators.
The AND
operator returns the records for those that meet all the conditions applied in the operator.
SELECT
*
FROM table_name
WHERE
column1 = 'value1'
AND column2 < 'value2'
AND column3 > 'value3';
The OR
operator returns the records for those that meet at least one of the conditions applied in the operator.
SELECT
*
FROM table_name
WHERE
column1 = 'value1'
OR column2 < 'value2'
OR column3 > 'value3';
The NOT
operator is used in a condition to reverse its result.
SELECT
*
FROM table_name
WHERE NOT column1 = 'value1';
<a name=’ex01’></a>
Write an SQL query to get the total amount obtained from the renting of Travel, Family, and Children films during the months of June and July of 2005. Present the results grouped and ordered by store id and category name.
FROM
. You will need to join fact_rental
and dim_category
tables by the category_id
column.WHERE
. To get the three different category names, you can pass an array ('Travel', 'Family', 'Children')
. The rental_date
should be BETWEEN
'2005-06-01'
and '2005-08-01'
. Note that you will need to use the AND
clause between those two conditions.store_id
and the category name and order by the same two columns.SELECT
clause to show store_id
, category name and SUM()
of the amount
.Note: Remember that a good practice when exploring your tables for the first time is to set a LIMIT
, mostly if you want to get all columns from a table with the *
wildcard.
%%sql
SELECT
store_id,
dim_category.name AS category_name,
SUM(amount) AS total_amount
FROM
fact_rental
INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
WHERE
dim_category.name IN ('Travel', 'Family', 'Children')
AND fact_rental.rental_date BETWEEN '2005-06-01' AND '2005-08-01'
GROUP BY
store_id,
dim_category.name
ORDER BY
store_id,
dim_category.name;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
6 rows affected.
store_id | category_name | total_amount |
---|---|---|
1 | Children | 1004.38 |
1 | Family | 1182.09 |
1 | Travel | 901.91 |
2 | Children | 973.39 |
2 | Family | 1144.00 |
2 | Travel | 1050.52 |
store_id | category_name | total_amount |
---|---|---|
1 | Children | 1004.38 |
1 | Family | 1182.09 |
1 | Travel | 901.91 |
2 | Children | 973.39 |
2 | Family | 1144.00 |
2 | Travel | 1050.52 |
A Common Table Expression (CTE) is used to create a temporary result set that can be referenced in another query. It only exists during the execution of a query.
WITH temp_results AS (
SELECT
column1,
column2
FROM table1
WHERE column2 = 'value1'
)
SELECT
*
FROM table2
INNER JOIN temp_results
ON temp_results.column1 = table2.column1;
Write an SQL query using a CTE to get the average number of films per category. Then, calculate the average rounded down and rounded up.
You can start by creating a query to extract the category_id
and film_id
from the fact_rental
. Use DISTINCT
and LIMIT
. To compare your results with the expected output, you can order by the category_id
and film_id
.
%%sql
SELECT DISTINCT
category_id,
film_id
FROM
fact_rental
ORDER BY
category_id,
film_id
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
category_id | film_id |
---|---|
1 | 19 |
1 | 21 |
1 | 29 |
1 | 56 |
1 | 67 |
1 | 97 |
1 | 105 |
1 | 111 |
1 | 115 |
1 | 126 |
Note: Not all of the records are shown here.
category_id | film_id |
---|---|
1 | 19 |
1 | 21 |
1 | 29 |
… | … |
Let’s practice writing Common Table Expressions (CTEs). Use optional exercise 2.1 without the LIMIT
and ORDER BY
statements as the code base to create a temporary result table named film_category
. From the CTE film_category
, select the category_id
and aggregate the number of films in each category using the COUNT()
function. Name the output column as films
. Perform grouping by category_id
. To check your result against the expected output, add the ORDER BY
column category_id
and LIMIT
for the top 10 rows.
%%sql
WITH film_category AS (
SELECT DISTINCT
category_id,
film_id
FROM
fact_rental
)
SELECT
category_id,
COUNT(1) AS films
FROM
film_category
GROUP BY
category_id
ORDER BY
category_id
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
category_id | films |
---|---|
1 | 61 |
2 | 64 |
3 | 58 |
4 | 54 |
5 | 56 |
6 | 63 |
7 | 61 |
8 | 67 |
9 | 67 |
10 | 58 |
Note: Not all of the records are shown here.
category_id | films |
---|---|
1 | 61 |
2 | 64 |
3 | 58 |
4 | 54 |
5 | 56 |
6 | 63 |
… | … |
#### Optional Exercise 2.3
In this exercise, create a query with 2 CTEs. The first CTE uses the code for film_category
as in optional exercise 2.2, and the second CTE is based on the code for the SELECT
part of the query for optional exercise 2.2, naming the second CTE as film_category_count
. Here is the scheme showing how to use two CTEs in the query:
WITH temp_results AS (
SELECT
column1,
column2
FROM table1
WHERE column2 = 'value1'
),
temp_results_2 AS (
SELECT
*
FROM table2
INNER JOIN temp_results
ON temp_results.column1 = table2.column1;
)
SELECT
*
FROM temp_results_2
To calculate the average number of unique films from all categories, average_by_category
, use the CTE film_category_count
, averaging all the film counts with the AVG()
function applied to films
column.
%%sql
WITH film_category AS (
SELECT DISTINCT
category_id,
film_id
FROM
fact_rental
),
film_category_count AS (
SELECT
category_id,
COUNT(1) AS films
FROM
film_category
GROUP BY
category_id
ORDER BY
category_id
)
SELECT
AVG(films) AS average_by_category
FROM
film_category_count;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
average_by_category |
---|
59.8750 |
average_by_category |
---|
59.8750 |
Use the query from the optional exercise 2.3. Wrap the last part into a CTE with the name films_average_by_category
. From films_average_by_category
select average_by_category
, and also apply the FLOOR()
and CEIL()
functions to average_by_category
to create separate columns for each function’s result. This will give you the final output.
%%sql
WITH film_category AS (
SELECT DISTINCT
category_id,
film_id
FROM
fact_rental
),
film_category_count AS (
SELECT
category_id,
COUNT(1) AS films
FROM
film_category
GROUP BY
category_id
ORDER BY
category_id
),
films_average_by_category AS (
SELECT
AVG(films) AS average_by_category
FROM
film_category_count
)
SELECT
average_by_category AS average,
FLOOR(average_by_category) AS average_down,
CEIL(average_by_category) AS average_up
FROM
films_average_by_category;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
average | average_down | average_up |
---|---|---|
59.8750 | 59 | 60 |
average | average_down | average_up |
---|---|---|
59.875 | 59 | 60 |
A subquery is a query nested inside another query. The inner queries will be executed first and their results will be passed into the outer queries in order to be executed.
SELECT
*
FROM table1
WHERE column1 > (
SELECT AVG(column1)
FROM table1
);
Write an SQL query using a subquery to get the film categories that have the number of films above the average rounded up calculated in the previous exercise.
Start by using the following two CTEs from the previous exercise 2: film_category
and film_category_count
. Then from the film_category_count
CTE, select the average number of films, but enclose this result inside the CEIL
function. Name this result as average_by_category
.
%%sql
WITH film_category AS (
SELECT DISTINCT
category_id,
film_id
FROM
fact_rental
),
film_category_count AS (
SELECT
category_id,
COUNT(1) AS films
FROM
film_category
GROUP BY
category_id
ORDER BY
category_id
)
SELECT
CEIL(AVG(films)) AS average_by_category
FROM
film_category_count;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
average_by_category |
---|
60 |
average_by_category |
---|
60 |
Use the same two CTEs film_category
and film_category_count
. Then, select all columns from film_category_count
. Create a condition where you will compare the films
column with the result of the main query that you created in the previous optional exercise 3.1, but now you will use it as a subquery and you won’t need to rename the column as average_by_category
.
%%sql
WITH film_category AS (
SELECT DISTINCT
category_id,
film_id
FROM
fact_rental
),
film_category_count AS (
SELECT
category_id,
COUNT(1) AS films
FROM
film_category
GROUP BY
category_id
ORDER BY
category_id
)
SELECT
*
FROM
film_category_count
WHERE
films > (
SELECT
CEIL(AVG(films))
FROM
film_category_count
);
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
7 rows affected.
category_id | films |
---|---|
1 | 61 |
2 | 64 |
6 | 63 |
7 | 61 |
8 | 67 |
9 | 67 |
15 | 73 |
category_id | films |
---|---|
1 | 61 |
2 | 64 |
6 | 63 |
7 | 61 |
8 | 67 |
9 | 67 |
15 | 73 |
In the previous optional exercise, you’ve got nearly the final result, but now you need to load the category names instead of the IDs.
ORDER BY
statements in both of the CTEs.INNER JOIN
statement into the CTE film_category
joining the fact_rental
table with the table dim_category
based on the category_id
. In the same CTE change the selection of category_id
to dim_category.category_id
and add dim_category.name category,
to pull the category name as a category
column.film_category_count
select also category
in addition to category_id
.*
with the category
and films
. You can also add ORDER BY
statement using category
column.This should give you the final result.
%%sql
WITH film_category AS (
SELECT DISTINCT
fact_rental.category_id,
dim_category.name AS category,
film_id
FROM
fact_rental
INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
),
film_category_count AS (
SELECT
category_id,
category,
COUNT(1) AS films
FROM
film_category
GROUP BY
category_id
)
SELECT
category,
films
FROM
film_category_count
WHERE
films > (
SELECT
CEIL(AVG(films))
FROM
film_category_count
)
ORDER BY
category;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
7 rows affected.
category | films |
---|---|
Action | 61 |
Animation | 64 |
Documentary | 63 |
Drama | 61 |
Family | 67 |
Foreign | 67 |
Sports | 73 |
category | films |
---|---|
Action | 61 |
Animation | 64 |
Documentary | 63 |
Drama | 61 |
Family | 67 |
Foreign | 67 |
Sports | 73 |
CASE
StatementThe CASE statement works like an if statement in programming. It will evaluate some conditions and return a value when the first condition is met. If no conditions are met it will return the value used in the ELSE clause.
SELECT
CASE
WHEN column1 < 0 THEN 'value1'
WHEN column1 > 0 THEN 'value2'
...
ELSE 'value3'
END
column2
FROM table_name;
Write an SQL query to get the maximum purchase amount by customers on 2007-04-30
between 15:00
and 16:00
. Obtain the customer’s full name in capital letters, the maximum purchase amount, and the payment date. Then, create a column called value_rate
, and assign the low
label if the amount is between 0 and 3, the mid
label if it is between 3 and 6, and the high
label if it is above 6. Sort by the maximum purchase amount in descending order and full name in ascending order.
Start by creating a query to extract from the fact_rental
table these columns: the customer ID, the maximum amount
named as max_amount
(use function MAX()
) and from payment_date
column get only the date without hours. Use function DATE()
naming the resulting column as payment_date
. Constrain your results to the date times between ‘2007-04-30 15:00:00’ and ‘2007-04-30 16:00:00’. Remember to group by the customer identifier and by the payment_date
(only date, without hours, so the DATE()
function should be applied there as well).
Limit the output to 10 records just for this optional exercise.
%%sql
SELECT
customer_id,
MAX(amount) AS max_amount,
DATE(payment_date) AS payment_date
FROM
fact_rental
WHERE
payment_date BETWEEN '2007-04-30 15:00:00'
AND '2007-04-30 16:00:00'
GROUP BY
customer_id,
DATE(payment_date)
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
customer_id | max_amount | payment_date |
---|---|---|
555 | 7.99 | 2007-04-30 |
571 | 1.99 | 2007-04-30 |
174 | 4.99 | 2007-04-30 |
55 | 2.99 | 2007-04-30 |
186 | 2.99 | 2007-04-30 |
296 | 2.99 | 2007-04-30 |
82 | 8.99 | 2007-04-30 |
336 | 0.99 | 2007-04-30 |
538 | 0.99 | 2007-04-30 |
267 | 6.99 | 2007-04-30 |
Note: Not all of the records are shown here.
customer_id | max_amount | payment_date |
---|---|---|
555 | 7.99 | 2007-04-30 |
571 | 1.99 | 2007-04-30 |
174 | 4.99 | 2007-04-30 |
… | … | … |
Use the previous query as a CTE max_amount_customer
(without the LIMIT
statement). Then, create a query expression following these steps:
dim_customer
table with the CTE result on the customer_id
column.CONCAT
and UPPER
functions are used in the SELECT
statement to get the full name in capital letters.max_amount
and payment_date
.CASE WHEN
clause and compare the max_amount
column with the corresponding values.max_amount
in a descending way while order by full_name
in an ascending way.This should give you the expected final result.
%%sql
WITH max_amount_customer AS (
SELECT
customer_id,
MAX(amount) AS max_amount,
DATE(payment_date) AS payment_date
FROM
fact_rental
WHERE
payment_date BETWEEN '2007-04-30 15:00:00'
AND '2007-04-30 16:00:00'
GROUP BY
customer_id,
DATE(payment_date)
)
SELECT
CONCAT(
UPPER(dim_customer.first_name),
' ',
UPPER(dim_customer.last_name)
) AS full_name,
max_amount,
payment_date,
CASE
WHEN max_amount >= 0
AND max_amount < 3 THEN 'low'
WHEN max_amount >= 3
AND max_amount < 6 THEN 'mid'
WHEN max_amount >= 6 THEN 'high'
END AS value_rate
FROM
max_amount_customer
INNER JOIN dim_customer ON max_amount_customer.customer_id = dim_customer.customer_id
ORDER BY
max_amount DESC,
full_name ASC;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
54 rows affected.
full_name | max_amount | payment_date | value_rate |
---|---|---|---|
KATHRYN COLEMAN | 8.99 | 2007-04-30 | high |
LOUIS LEONE | 8.99 | 2007-04-30 | high |
ROY WHITING | 8.99 | 2007-04-30 | high |
CARRIE PORTER | 7.99 | 2007-04-30 | high |
DWIGHT LOMBARDI | 7.99 | 2007-04-30 | high |
MARION SNYDER | 7.99 | 2007-04-30 | high |
ALEXANDER FENNELL | 6.99 | 2007-04-30 | high |
BESSIE MORRISON | 6.99 | 2007-04-30 | high |
BRANDON HUEY | 6.99 | 2007-04-30 | high |
DAVID ROYAL | 6.99 | 2007-04-30 | high |
DERRICK BOURQUE | 6.99 | 2007-04-30 | high |
GEORGE LINTON | 6.99 | 2007-04-30 | high |
MARGIE WADE | 6.99 | 2007-04-30 | high |
ROSA REYNOLDS | 6.99 | 2007-04-30 | high |
BYRON BOX | 5.99 | 2007-04-30 | mid |
LYDIA BURKE | 5.99 | 2007-04-30 | mid |
STEPHANIE MITCHELL | 5.99 | 2007-04-30 | mid |
ANGEL BARCLAY | 4.99 | 2007-04-30 | mid |
ARLENE HARVEY | 4.99 | 2007-04-30 | mid |
DIANA ALEXANDER | 4.99 | 2007-04-30 | mid |
ERIK GUILLEN | 4.99 | 2007-04-30 | mid |
FRED WHEAT | 4.99 | 2007-04-30 | mid |
VANESSA SIMS | 4.99 | 2007-04-30 | mid |
WILLARD LUMPKIN | 4.99 | 2007-04-30 | mid |
YVONNE WATKINS | 4.99 | 2007-04-30 | mid |
MONICA HICKS | 3.99 | 2007-04-30 | mid |
SHEILA WELLS | 3.99 | 2007-04-30 | mid |
STACY CUNNINGHAM | 3.99 | 2007-04-30 | mid |
AGNES BISHOP | 2.99 | 2007-04-30 | low |
ANNE POWELL | 2.99 | 2007-04-30 | low |
DARYL LARUE | 2.99 | 2007-04-30 | low |
DORIS REED | 2.99 | 2007-04-30 | low |
EDWIN BURK | 2.99 | 2007-04-30 | low |
HELEN HARRIS | 2.99 | 2007-04-30 | low |
HERBERT KRUGER | 2.99 | 2007-04-30 | low |
HOLLY FOX | 2.99 | 2007-04-30 | low |
IRENE PRICE | 2.99 | 2007-04-30 | low |
JESUS MCCARTNEY | 2.99 | 2007-04-30 | low |
LESLIE SEWARD | 2.99 | 2007-04-30 | low |
MARION OCAMPO | 2.99 | 2007-04-30 | low |
MEGAN PALMER | 2.99 | 2007-04-30 | low |
MELISSA KING | 2.99 | 2007-04-30 | low |
MORRIS MCCARTER | 2.99 | 2007-04-30 | low |
RAMONA HALE | 2.99 | 2007-04-30 | low |
SHAWN HEATON | 2.99 | 2007-04-30 | low |
VINCENT RALSTON | 2.99 | 2007-04-30 | low |
AMANDA CARTER | 1.99 | 2007-04-30 | low |
JOHNNIE CHISHOLM | 1.99 | 2007-04-30 | low |
STEPHEN QUALLS | 1.99 | 2007-04-30 | low |
ANITA MORALES | 0.99 | 2007-04-30 | low |
JOSHUA MARK | 0.99 | 2007-04-30 | low |
KEVIN SCHULER | 0.99 | 2007-04-30 | low |
RACHEL BARNES | 0.99 | 2007-04-30 | low |
TED BREAUX | 0.99 | 2007-04-30 | low |
Note: Not all of the records are shown here.
full_name | max_amount | payment_date | value_rate |
---|---|---|---|
KATHRYN COLEMAN | 8.99 | 2007-04-30 | high |
LOUIS LEONE | 8.99 | 2007-04-30 | high |
ROY WHITING | 8.99 | 2007-04-30 | high |
CARRIE PORTER | 7.99 | 2007-04-30 | high |
DWIGHT LOMBARDI | 7.99 | 2007-04-30 | high |
MARION SNYDER | 7.99 | 2007-04-30 | high |
ALEXANDER FENNELL | 6.99 | 2007-04-30 | high |
BESSIE MORRISON | 6.99 | 2007-04-30 | high |
BRANDON HUEY | 6.99 | 2007-04-30 | high |
DAVID ROYAL | 6.99 | 2007-04-30 | high |
DERRICK BOURQUE | 6.99 | 2007-04-30 | high |
GEORGE LINTON | 6.99 | 2007-04-30 | high |
MARGIE WADE | 6.99 | 2007-04-30 | high |
ROSA REYNOLDS | 6.99 | 2007-04-30 | high |
BYRON BOX | 5.99 | 2007-04-30 | mid |
LYDIA BURKE | 5.99 | 2007-04-30 | mid |
STEPHANIE MITCHELL | 5.99 | 2007-04-30 | mid |
ANGEL BARCLAY | 4.99 | 2007-04-30 | mid |
ARLENE HARVEY | 4.99 | 2007-04-30 | mid |
DIANA ALEXANDER | 4.99 | 2007-04-30 | mid |
… | … | … | … |
A pivot table is a term used to describe a rotated table with one or more categories as columns used to aggregate the items by a specific value.
For example, let’s say you want to pivot the following table by category
while adding the value
.
id | category | value |
---|---|---|
1 | category1 | 1.0 |
1 | category1 | 1.5 |
1 | category2 | 1.0 |
1 | category2 | 1.5 |
2 | category1 | 2.0 |
2 | category1 | 2.5 |
2 | category2 | 2.0 |
2 | category2 | 2.5 |
3 | category1 | 3.0 |
3 | category1 | 3.5 |
3 | category2 | 3.0 |
3 | category2 | 3.5 |
… | … | … |
Then you should get the following result.
id | category1 | category2 |
---|---|---|
1 | 2.5 | 2.5 |
2 | 4.5 | 4.5 |
3 | 6.5 | 6.5 |
… | … | … |
Based on rental orders, write an SQL query to get the customer’s full name, film category, and payment amount by customer and film category. Sort the results by the customer’s full name, film category, and payment amount in ascending order.
fact_rental
table as the main table for the queries. You will have to join with two additional tables in the following way:
fact_rental
with dim_customer
by the customer_id
.fact_rental
with dim_category
by the category_id
.SELECT
clause, use the CONCAT
function to join the first and last name of each customer.dim_category.name
as category
and sum the amount
column, naming it as amount
.full_name
and category
columns; order by full_name
, category
and amount
.%%sql
SELECT
CONCAT(
dim_customer.first_name,
' ',
dim_customer.last_name
) AS full_name,
dim_category.name AS category,
SUM(amount) AS amount
FROM
fact_rental
INNER JOIN dim_customer ON fact_rental.customer_id = dim_customer.customer_id
INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
GROUP BY
full_name,
category
ORDER BY
full_name,
category,
amount
LIMIT 30;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
30 rows affected.
full_name | category | amount |
---|---|---|
Aaron Selby | Action | 4.99 |
Aaron Selby | Classics | 7.98 |
Aaron Selby | Comedy | 7.98 |
Aaron Selby | Drama | 5.99 |
Aaron Selby | Family | 8.98 |
Aaron Selby | Foreign | 8.97 |
Aaron Selby | Games | 6.99 |
Aaron Selby | Horror | 10.98 |
Aaron Selby | Music | None |
Aaron Selby | Travel | 25.94 |
Adam Gooch | Action | 2.99 |
Adam Gooch | Animation | 10.99 |
Adam Gooch | Children | 6.97 |
Adam Gooch | Classics | 3.99 |
Adam Gooch | Comedy | 8.99 |
Adam Gooch | Family | 1.98 |
Adam Gooch | Foreign | 13.98 |
Adam Gooch | Games | 10.98 |
Adam Gooch | Horror | 12.97 |
Adam Gooch | Music | 6.99 |
Adam Gooch | New | 4.99 |
Adam Gooch | Sports | 11.98 |
Adrian Clary | Children | 2.98 |
Adrian Clary | Classics | 2.99 |
Adrian Clary | Drama | 7.99 |
Adrian Clary | Family | 6.97 |
Adrian Clary | Foreign | 2.99 |
Adrian Clary | Games | 15.96 |
Adrian Clary | Horror | 3.99 |
Adrian Clary | Music | 17.97 |
Note: Not all of the records are shown here.
full_name | category | amount |
---|---|---|
Aaron Selby | Action | 4.99 |
Aaron Selby | Classics | 7.98 |
Aaron Selby | Comedy | 7.98 |
Aaron Selby | Drama | 5.99 |
Aaron Selby | Family | 8.98 |
Aaron Selby | Foreign | 8.97 |
Aaron Selby | Games | 6.99 |
Aaron Selby | Horror | 10.98 |
Aaron Selby | Music | None |
Aaron Selby | Travel | 25.94 |
Adam Gooch | Action | 2.99 |
Adam Gooch | Animation | 10.99 |
Adam Gooch | Children | 6.97 |
Adam Gooch | Classics | 3.99 |
Adam Gooch | Comedy | 8.99 |
Adam Gooch | Family | 1.98 |
Adam Gooch | Foreign | 13.98 |
Adam Gooch | Games | 10.98 |
Adam Gooch | Horror | 12.97 |
Adam Gooch | Music | 6.99 |
… | … | … |
Using previous results, write an SQL query to create a pivot table that shows the total amount spent for each customer in each category. Also, fill the null values with a 0.
LIMIT
statement); name it as customer_category_sum
.full_name
of each customer. Then, for each category, use a MAX()
function combined with a CASE WHEN
statement to select the amount
for that category or return 0 if the category doesn’t match. For example, if the category is Family
, the statement should look like MAX(CASE WHEN category = 'Family' THEN amount ELSE 0 END) AS "Family"
.full_name
.This approach ensures that each customer’s total spending in each category is accurately pivoted into separate columns.
%%sql
WITH customer_category_sum AS (
SELECT
CONCAT(
dim_customer.first_name,
' ',
dim_customer.last_name
) AS full_name,
dim_category.name AS category,
SUM(amount) AS amount
FROM
fact_rental
INNER JOIN dim_customer ON fact_rental.customer_id = dim_customer.customer_id
INNER JOIN dim_category ON fact_rental.category_id = dim_category.category_id
GROUP BY
full_name,
category
ORDER BY
full_name,
category,
amount
)
SELECT
full_name,
MAX(
CASE
WHEN category = 'Family' THEN amount
ELSE 0
END
) AS "Family",
MAX(
CASE
WHEN category = 'Games' THEN amount
ELSE 0
END
) AS "Games",
MAX(
CASE
WHEN category = 'Animation' THEN amount
ELSE 0
END
) AS "Animation",
MAX(
CASE
WHEN category = 'Classics' THEN amount
ELSE 0
END
) AS "Classics",
MAX(
CASE
WHEN category = 'Documentary' THEN amount
ELSE 0
END
) AS "Documentary",
MAX(
CASE
WHEN category = 'Sports' THEN amount
ELSE 0
END
) AS "Sports",
MAX(
CASE
WHEN category = 'New' THEN amount
ELSE 0
END
) AS "New",
MAX(
CASE
WHEN category = 'Children' THEN amount
ELSE 0
END
) AS "Children",
MAX(
CASE
WHEN category = 'Music' THEN amount
ELSE 0
END
) AS "Music",
MAX(
CASE
WHEN category = 'Travel' THEN amount
ELSE 0
END
) AS "Travel"
FROM
customer_category_sum
GROUP BY
full_name
ORDER BY
full_name
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
full_name | Family | Games | Animation | Classics | Documentary | Sports | New | Children | Music | Travel |
---|---|---|---|---|---|---|---|---|---|---|
Aaron Selby | 8.98 | 6.99 | 0.00 | 7.98 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 25.94 |
Adam Gooch | 1.98 | 10.98 | 10.99 | 3.99 | 0.00 | 11.98 | 4.99 | 6.97 | 6.99 | 0.00 |
Adrian Clary | 6.97 | 15.96 | 0.00 | 2.99 | 0.00 | 6.98 | 0.00 | 2.98 | 17.97 | 0.00 |
Agnes Bishop | 3.98 | 7.98 | 8.98 | 9.98 | 4.99 | 0.00 | 0.00 | 2.99 | 0.00 | 11.98 |
Alan Kahn | 8.99 | 4.99 | 12.98 | 14.97 | 4.99 | 9.98 | 9.98 | 0.99 | 2.99 | 0.00 |
Albert Crouse | 9.98 | 1.98 | 3.98 | 0.00 | 14.98 | 8.98 | 2.99 | 0.00 | 13.97 | 0.00 |
Alberto Henning | 6.97 | 0.99 | 0.00 | 0.00 | 7.97 | 0.00 | 2.99 | 0.00 | 10.97 | 4.99 |
Alex Gresham | 2.99 | 14.98 | 0.99 | 4.99 | 13.98 | 32.93 | 6.99 | 2.99 | 9.96 | 9.98 |
Alexander Fennell | 23.95 | 5.98 | 17.97 | 2.99 | 0.00 | 2.99 | 2.99 | 11.97 | 8.98 | 4.99 |
Alfred Casillas | 4.98 | 3.98 | 0.00 | 4.99 | 11.98 | 11.98 | 0.00 | 7.98 | 8.97 | 9.98 |
Note: Not all of the records are shown here.
full_name | Family | Games | Animation | Classics | Documentary | Sports | New | Children | Music | Travel |
---|---|---|---|---|---|---|---|---|---|---|
Aaron Selby | 8.98 | 6.99 | 0 | 7.98 | 0 | 0 | 0 | 0 | 0 | 25.94 |
Adam Gooch | 1.98 | 10.98 | 10.99 | 3.99 | 0 | 11.98 | 4.99 | 6.97 | 6.99 | 0 |
Adrian Clary | 6.97 | 15.96 | 0 | 2.99 | 0 | 6.98 | 0 | 2.98 | 17.97 | 0 |
Agnes Bishop | 3.98 | 7.98 | 8.98 | 9.98 | 4.99 | 0 | 0 | 2.99 | 0 | 11.98 |
Alan Kahn | 8.99 | 4.99 | 12.98 | 14.97 | 4.99 | 9.98 | 9.98 | 0.99 | 2.99 | 0 |
Albert Crouse | 9.98 | 1.98 | 3.98 | 0 | 14.98 | 8.98 | 2.99 | 0 | 13.97 | 0 |
Alberto Henning | 6.97 | 0.99 | 0 | 0 | 7.97 | 0 | 2.99 | 0 | 10.97 | 4.99 |
… | … | … | … | … | … | … | … | … | … | … |
Depending on the Database Management System (DBMS) that you use, there can be some differences with the function names to handle dates. As the DBMS that you are using here is MySQL, it provides the EXTRACT
function.
The EXTRACT
function is used to get the year, month, week, and other date information from a date or time value. Other DBMS such as PostgreSQL have the DATE_PART
function for the same purpose.
%%sql
SELECT
EXTRACT(SECOND FROM TIMESTAMP '2022-03-04 01:02:03') AS second_value,
EXTRACT(MINUTE FROM TIMESTAMP '2022-03-04 01:02:03') AS minute_value,
EXTRACT(DAY FROM TIMESTAMP '2022-03-04 01:02:03') AS day_value,
EXTRACT(YEAR FROM TIMESTAMP '2022-03-04 01:02:03') AS year_value;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
1 rows affected.
second_value | minute_value | day_value | year_value |
---|---|---|---|
3 | 2 | 4 | 2022 |
Write an SQL query to get the customers who made a payment on 2007-04-30
between 15:00
and 16:00
. Get the customer id and create a column that shows On time
if the rental return was within the borrow time limit, and Late
if it was overdue.
fact_rental
table with:
dim_customer
table on the customer_id
.dim_film
table on the film_id
.At the SELECT
clause add dim_customer.customer_id
. Then use a CASE WHEN
statement to differentiate between the On time
and Late
rentals. In that statement, you will use the TIMEDIFF
function to compare the return and rental dates. Use the EXTRACT
function to get the hour from the previous result:
EXTRACT(HOUR FROM TIMEDIFF(fact_rental.return_date, fact_rental.rental_date))
Compare this with the rental_duration
from the dim_film
table. Take into account that extracting the hour from a TIMEDIFF
will give you a result in hours, while the rental duration is in days so you will have to multiply the rental duration by 24 to keep the same units.
fact_rental.payment_date
was BETWEEN 2007-04-30 15:00:00
and 2007-04-30 16:00:00
.dim_customer.customer_id
.%%sql
SELECT
fact_rental.customer_id,
CASE
WHEN (EXTRACT(HOUR FROM TIMEDIFF(fact_rental.return_date, fact_rental.rental_date))) > dim_film.rental_duration * 24 THEN 'Late'
ELSE 'On time'
END AS delivery
FROM
fact_rental
INNER JOIN dim_customer ON fact_rental.customer_id = dim_customer.customer_id
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
WHERE
fact_rental.payment_date BETWEEN '2007-04-30 15:00:00'
AND '2007-04-30 16:00:00'
ORDER BY
dim_customer.customer_id
;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
57 rows affected.
customer_id | delivery |
---|---|
15 | On time |
30 | Late |
40 | Late |
41 | Late |
55 | On time |
76 | On time |
79 | On time |
82 | Late |
85 | On time |
96 | On time |
112 | Late |
124 | Late |
129 | Late |
131 | Late |
136 | On time |
151 | On time |
174 | On time |
178 | Late |
180 | Late |
186 | On time |
195 | On time |
217 | Late |
221 | Late |
225 | Late |
225 | On time |
243 | Late |
267 | Late |
296 | On time |
304 | Late |
314 | Late |
321 | On time |
332 | Late |
336 | On time |
363 | Late |
363 | Late |
366 | Late |
369 | On time |
373 | Late |
390 | On time |
414 | On time |
427 | Late |
428 | On time |
432 | On time |
439 | Late |
476 | Late |
503 | On time |
506 | On time |
529 | On time |
538 | On time |
555 | Late |
555 | Late |
571 | Late |
573 | Late |
576 | Late |
578 | On time |
579 | On time |
588 | On time |
Note: Not all of the records are shown here.
customer_id | delivery |
---|---|
15 | On time |
30 | Late |
40 | Late |
41 | Late |
55 | On time |
76 | On time |
79 | On time |
82 | Late |
85 | On time |
96 | On time |
… | … |
SUBSTRING
The SUBSTRING
function is used to get part of a string value by defining a start position and a length. Here is an example.
SELECT
column1,
SUBSTRING(column1, 1, 3) AS part_column1
FROM
table_name;
Example of the query result:
column1 | part_column1 |
---|---|
1value | 1val |
2value | 2val |
3value | 3val |
… | … |
Write an SQL query to get the initials of the staff using the first_name
and last_name
columns. Use CONCAT()
function to join the substrings.
Example: John Lennon -> JL
%%sql
SELECT
CONCAT(
SUBSTRING(first_name, 1, 1),
SUBSTRING(last_name, 1, 1)
) AS initials
FROM
dim_staff;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
2 rows affected.
initials |
---|
MH |
JS |
initials |
---|
MH |
JS |
Window functions aggregate the query results without reducing the number of rows returned by the query.
Considering data from the sales department, let’s say you want to get the maximum amount of units sold for every month without grouping the rows. For example, you have the following data.
id | month | units |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 2 |
5 | 2 | 4 |
6 | 2 | 6 |
7 | 3 | 4 |
8 | 3 | 8 |
9 | 3 | 12 |
10 | 4 | 8 |
11 | 4 | 16 |
12 | 4 | 24 |
… | … | … |
If you apply this query to get the desired result:
SELECT
id,
month,
units,
MAX(units) OVER (
PARTITION BY month
) AS units_maximum
FROM sales;
Then you should get the following output.
id | month | units | units_maximum |
---|---|---|---|
1 | 1 | 1 | 3 |
2 | 1 | 2 | 3 |
3 | 1 | 3 | 3 |
4 | 2 | 2 | 6 |
5 | 2 | 4 | 6 |
6 | 2 | 6 | 6 |
7 | 3 | 4 | 12 |
8 | 3 | 8 | 12 |
9 | 3 | 12 | 12 |
10 | 4 | 8 | 24 |
11 | 4 | 16 | 24 |
12 | 4 | 24 | 24 |
… | … | … | … |
Write an SQL query to get the most paid movie by rating. Get the title of the movie, the rating, and the amount collected.
Join the fact_rental
and dim_film
tables through the film_id
, and select the film title
and rating
from dim_film
. Then take the SUM()
of the fact_rental.amount
column as amount
. Group by title
and rating
. Limit your result to 10 rows.
%%sql
SELECT
dim_film.title,
dim_film.rating,
SUM(fact_rental.amount) AS amount
FROM
fact_rental
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
GROUP BY
title,
rating
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
title | rating | amount |
---|---|---|
Academy Dinosaur | PG | 33.79 |
Ace Goldfinger | G | 52.93 |
Adaptation Holes | NC-17 | 34.89 |
Affair Prejudice | G | 83.79 |
African Egg | G | 47.89 |
Agent Truman | PG | 111.81 |
Airplane Sierra | PG-13 | 82.85 |
Airport Pollock | R | 86.85 |
Alabama Devil | PG-13 | 71.88 |
Aladdin Calendar | NC-17 | 131.77 |
Note: Not all of the records are shown here.
title | rating | amount |
---|---|---|
Academy Dinosaur | PG | 33.79 |
Ace Goldfinger | G | 52.93 |
Adaptation Holes | NC-17 | 34.89 |
Affair Prejudice | G | 83.79 |
… | … | … |
Create a CTE named movies_amount_rating
with the previous query (without the LIMIT
statement). As query expression, select the title
, rating
, amount
and use the RANK()
window function to assign a rank to each row based on a partition over the rating
and ordered by the amount
in descending order. Name this last column as rank_movies
. Limit your results to 10.
%%sql
WITH movies_amount_rating AS(
SELECT
dim_film.title,
dim_film.rating,
SUM(fact_rental.amount) AS amount
FROM
fact_rental
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
GROUP BY
title,
rating
)
SELECT
title,
rating,
amount,
RANK() over (
PARTITION BY rating
ORDER BY
amount DESC
) AS rank_movies
FROM
movies_amount_rating
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
title | rating | amount | rank_movies |
---|---|---|---|
Saturday Lambs | G | 190.74 | 1 |
Torque Bound | G | 169.76 | 2 |
Dogma Family | G | 168.72 | 3 |
Cat Coneheads | G | 159.73 | 4 |
Secrets Paradise | G | 139.82 | 5 |
Garden Island | G | 134.82 | 6 |
Easy Gladiator | G | 132.80 | 7 |
Heartbreakers Bright | G | 132.80 | 7 |
Midsummer Groundhog | G | 129.82 | 9 |
Beauty Grease | G | 125.79 | 10 |
Note: Not all of the records are shown here.
title | rating | amount | rank_movies |
---|---|---|---|
Saturday Lambs | G | 190.74 | 1 |
Torque Bound | G | 169.76 | 2 |
Dogma Family | G | 168.72 | 3 |
Cat Coneheads | G | 159.73 | 4 |
… | … | … | … |
Create a new CTE named movies_ranking
with the previous query expression (do not include the LIMIT
); you should have 2 CTEs at this point. As a query expression, select the title
, rating
, amount
from movies_ranking
; filter to get only the rank_movies
equals to 1. This will give you the final result for this exercise.
%%sql
WITH movies_amount_rating AS(
SELECT
dim_film.title,
dim_film.rating,
SUM(fact_rental.amount) AS amount
FROM
fact_rental
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
GROUP BY
title,
rating
),
movies_ranking AS (
SELECT
title,
rating,
amount,
RANK() over (
PARTITION BY rating
ORDER BY
amount DESC
) AS rank_movies
FROM
movies_amount_rating
)
SELECT
title,
rating,
amount
FROM
movies_ranking
WHERE
rank_movies = 1;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
5 rows affected.
title | rating | amount |
---|---|---|
Saturday Lambs | G | 190.74 |
Telegraph Voyage | PG | 215.75 |
Innocent Usual | PG-13 | 191.74 |
Velvet Terminator | R | 152.77 |
Zorro Ark | NC-17 | 199.72 |
title | rating | amount |
---|---|---|
Saturday Lambs | G | 190.74 |
Telegraph Voyage | PG | 215.75 |
Innocent Usual | PG-13 | 191.74 |
Velvet Terminator | R | 152.77 |
Zorro Ark | NC-17 | 199.72 |
Write an SQL query to get the 10 actors with the highest number of films produced. Get the first name, last name, number of films and rank.
Use the dim_actor
as the main table for this query. Perform the following joins:
dim_actor
and bridge_actor
table on the actor_id
.fact_rental
and bridge_actor
on the rental_id
.dim_film
and fact_rental
on the film_id
.Then, select the distinct dim_actor.actor_id
, dim_actor.first_name
, dim_actor.last_name
and dim_film.film_id
combinations. Limit your results to 20 rows; To compare with the expected output, you can order by dim_actor.actor_id
and dim_film.film_id
.
%%sql
SELECT DISTINCT
dim_actor.actor_id,
dim_actor.first_name,
dim_actor.last_name,
dim_film.film_id
FROM
dim_actor
INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
ORDER BY
dim_actor.actor_id,
dim_film.film_id
LIMIT 20;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
20 rows affected.
actor_id | first_name | last_name | film_id |
---|---|---|---|
1 | Penelope | Guiness | 1 |
1 | Penelope | Guiness | 23 |
1 | Penelope | Guiness | 25 |
1 | Penelope | Guiness | 106 |
1 | Penelope | Guiness | 140 |
1 | Penelope | Guiness | 166 |
1 | Penelope | Guiness | 277 |
1 | Penelope | Guiness | 361 |
1 | Penelope | Guiness | 438 |
1 | Penelope | Guiness | 499 |
1 | Penelope | Guiness | 506 |
1 | Penelope | Guiness | 509 |
1 | Penelope | Guiness | 605 |
1 | Penelope | Guiness | 635 |
1 | Penelope | Guiness | 749 |
1 | Penelope | Guiness | 832 |
1 | Penelope | Guiness | 939 |
1 | Penelope | Guiness | 970 |
1 | Penelope | Guiness | 980 |
2 | Nick | Wahlberg | 3 |
Note: Not all of the records are shown here.
actor_id | first_name | last_name | film_id |
---|---|---|---|
1 | Penelope | Guiness | 1 |
1 | Penelope | Guiness | 23 |
1 | Penelope | Guiness | 25 |
1 | Penelope | Guiness | 106 |
… | … | … | … |
Create a CTE with the previous query and name it actor_film
(without the ORDER BY
and LIMIT
statements). Then, in the query expression select the actor_id
, first_name
, last_name
and count the number of films from the actor_film
CTE result with the function COUNT(*)
giving the column name films
. Remember to group by the actor_id
. Limit your results to 10 rows. To compare with the expected output, you can order by actor_id
.
%%sql
WITH actor_film AS (
SELECT DISTINCT
dim_actor.actor_id,
dim_actor.first_name,
dim_actor.last_name,
dim_film.film_id
FROM
dim_actor
INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
)
SELECT
actor_id,
first_name,
last_name,
COUNT(*) AS films
FROM
actor_film
GROUP BY
actor_id
ORDER BY
actor_id
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
actor_id | first_name | last_name | films |
---|---|---|---|
1 | Penelope | Guiness | 19 |
2 | Nick | Wahlberg | 24 |
3 | Ed | Chase | 21 |
4 | Jennifer | Davis | 19 |
5 | Johnny | Lollobrigida | 28 |
6 | Bette | Nicholson | 20 |
7 | Grace | Mostel | 29 |
8 | Matthew | Johansson | 19 |
9 | Joe | Swank | 23 |
10 | Christian | Gable | 22 |
Note: Not all of the records are shown here.
actor_id | first_name | last_name | films |
---|---|---|---|
1 | Penelope | Guiness | 19 |
2 | Nick | Wahlberg | 24 |
3 | Ed | Chase | 21 |
4 | Jennifer | Davis | 19 |
… | … | … | … |
Alongside the actor_film
CTE, add the previous query as actor_film_count
(without the ORDER BY
and LIMIT
statements). In the query expression, select the actor’s first and last name (first_name
and last_name
), also the count of films films
and use the ROW_NUMBER()
window function over a window ordered by the film count in descending order and by the first and last name in ascending order; name this column as actor_rank
. Limit this result to 10.
%%sql
WITH actor_film AS (
SELECT DISTINCT
dim_actor.actor_id,
dim_actor.first_name,
dim_actor.last_name,
dim_film.film_id
FROM
dim_actor
INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
),
actor_film_count AS (
SELECT
actor_id,
first_name,
last_name,
COUNT(*) AS films
FROM
actor_film
GROUP BY
actor_id
)
SELECT
first_name,
last_name,
films,
ROW_NUMBER() over (
ORDER BY
films DESC, first_name, last_name ASC
) AS actor_rank
FROM
actor_film_count
LIMIT 10;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
first_name | last_name | films | actor_rank |
---|---|---|---|
Gina | Degeneres | 42 | 1 |
Mary | Keitel | 40 | 2 |
Walter | Torn | 39 | 3 |
Matthew | Carrey | 37 | 4 |
Sandra | Kilmer | 36 | 5 |
Angela | Witherspoon | 35 | 6 |
Henry | Berry | 35 | 7 |
Scarlett | Damon | 35 | 8 |
Val | Bolger | 35 | 9 |
Vivien | Basinger | 35 | 10 |
Note: Not all of the records are shown here.
first_name | last_name | films | actor_rank |
---|---|---|---|
Gina | Degeneres | 42 | 1 |
Mary | Keitel | 40 | 2 |
Walter | Torn | 39 | 3 |
Matthew | Carrey | 37 | 4 |
Sandra | Kilmer | 36 | 5 |
Angela | Witherspoon | 35 | 6 |
… | … | … | … |
Create a third CTE statement named actors_rank
using the previous query expression (without the LIMIT
statement). Then, select all columns from the actors_rank
result and filter by the top 10 actors with the help of the actor_rank
column. Order by actor_rank
in ascending way. This will give the final result for this exercise.
%%sql
WITH actor_film AS (
SELECT DISTINCT
dim_actor.actor_id,
dim_actor.first_name,
dim_actor.last_name,
dim_film.film_id
FROM
dim_actor
INNER JOIN bridge_actor ON dim_actor.actor_id = bridge_actor.actor_id
INNER JOIN fact_rental ON bridge_actor.rental_id = fact_rental.rental_id
INNER JOIN dim_film ON fact_rental.film_id = dim_film.film_id
),
actor_film_count AS (
SELECT
actor_id,
first_name,
last_name,
COUNT(*) AS films
FROM
actor_film
GROUP BY
actor_id
),
actors_rank AS (
SELECT
first_name,
last_name,
films,
ROW_NUMBER() over (
ORDER BY
films DESC, first_name, last_name ASC
) AS actor_rank
FROM
actor_film_count
)
SELECT
*
FROM
actors_rank
WHERE
actor_rank <= 10
ORDER BY
actor_rank ASC;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
10 rows affected.
first_name | last_name | films | actor_rank |
---|---|---|---|
Gina | Degeneres | 42 | 1 |
Mary | Keitel | 40 | 2 |
Walter | Torn | 39 | 3 |
Matthew | Carrey | 37 | 4 |
Sandra | Kilmer | 36 | 5 |
Angela | Witherspoon | 35 | 6 |
Henry | Berry | 35 | 7 |
Scarlett | Damon | 35 | 8 |
Val | Bolger | 35 | 9 |
Vivien | Basinger | 35 | 10 |
first_name | last_name | films | actor_rank |
---|---|---|---|
Gina | Degeneres | 42 | 1 |
Mary | Keitel | 40 | 2 |
Walter | Torn | 39 | 3 |
Matthew | Carrey | 37 | 4 |
Sandra | Kilmer | 36 | 5 |
Angela | Witherspoon | 35 | 6 |
Henry | Berry | 35 | 7 |
Scarlett | Damon | 35 | 8 |
Val | Bolger | 35 | 9 |
Vivien | Basinger | 35 | 10 |
LAG
The LAG
function is a window function used to get the value of a row that comes before the current row at a given offset. Have a look at the example:
SELECT
id,
column1
LAG(column1) OVER (ORDER BY id) AS prev_column1
FROM table_name;
Query result:
id | column1 | prev_column1 |
---|---|---|
1 | a | |
2 | b | a |
3 | c | b |
… | … | … |
The LAG
function can also take an optional second parameter which specifies the offset from the current row. By default, this offset is 1, but you can set it to any integer value. For example, LAG(column1, 2) OVER (ORDER BY id)
would return the value from two rows before the current row.
Filtering rental orders with payments, write an SQL query to get the amount spent by month of the customer ID 388
. Get the total amount of the current month and, the total amount of the previous month, and calculate the difference between those values.
Taking the fact_rental
table, use the EXTRACT()
function to obtain the MONTH
from the payment_date
column. Then, name the new column as month
. Use the SUM()
function over the column amount
and name the resulting column as amount
. Filter the corresponding customer_id
and where payment_date
is NOT NULL
. Group by month
. Order by month
to compare with the expected output.
%%sql
SELECT
EXTRACT(MONTH FROM payment_date) AS month,
SUM(amount) AS amount
FROM
fact_rental
WHERE
customer_id = 388
AND payment_date IS NOT NULL
GROUP BY
month
ORDER BY
month
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
4 rows affected.
month | amount |
---|---|
2 | 26.95 |
3 | 50.89 |
4 | 35.89 |
5 | 0.99 |
Note: The order of the records can be different.
month | amount |
---|---|
2 | 26.95 |
3 | 50.89 |
4 | 35.89 |
5 | 0.99 |
With the previous query, create a CTE named total_payment_amounts_sum
(do not include the ORDER BY
statement). In the query expression, extract the month
and amount
columns. Then, use the LAG
function and set the first parameter as the amount
column and the second parameter as 1; this function will work over a window ordered by month and this should be named previous_month_amount
.
Take the difference between the result of the LAG
function with the same parameters and over the same window, and the current value of amount
. Set the name for this value as difference
. This will give you the final result for this exercise.
%%sql
WITH total_payment_amounts_sum AS (
SELECT
EXTRACT(MONTH FROM payment_date) AS month,
SUM(amount) AS amount
FROM
fact_rental
WHERE
customer_id = 388
AND payment_date IS NOT NULL
GROUP BY
month
)
SELECT
month,
amount,
LAG(amount, 1) OVER (
ORDER BY
month
) AS previous_month_amount,
LAG(amount, 1) OVER (
ORDER BY
month
) - amount AS difference
FROM
total_payment_amounts_sum;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
4 rows affected.
month | amount | previous_month_amount | difference |
---|---|---|---|
2 | 26.95 | None | None |
3 | 50.89 | 26.95 | -23.94 |
4 | 35.89 | 50.89 | 15.00 |
5 | 0.99 | 35.89 | 34.90 |
month | amount | previous_month_amount | difference |
---|---|---|---|
2 | 26.95 | None | None |
3 | 50.89 | 26.95 | -23.94 |
4 | 35.89 | 50.89 | 15.00 |
5 | 0.99 | 35.89 | 34.90 |
LEAD
The LEAD
function is a window function used to get the value of a row that comes after the current row at a given offset.
SELECT
id,
column1
LEAD(column1) OVER (ORDER BY id) AS next_column1
FROM table_name;
Query result:
id | column1 | next_column1 |
---|---|---|
1 | a | b |
2 | b | c |
3 | c | d |
… | … | … |
The LEAD
function can also take an optional second parameter which specifies the offset from the current row. By default, this offset is 1, but you can set it to any integer value. For example, LEAD(column1, 2) OVER (ORDER BY id)
would return the value from two rows after the current row.
#### Exercise 12
Filtering rental orders with payments, write an SQL query to get the amount spent by month of the customer_id
388
. Get the total amount of the current month and the total amount of the next month (name it as next_month_amount
), then calculate the difference between those values.
To complete this exercise, you can follow the same process as in the previous exercise, but change the LAG
function with the LEAD
function.
%%sql
WITH total_payment_amounts_sum AS (
SELECT
EXTRACT(MONTH FROM payment_date) AS month,
SUM(amount) AS amount
FROM
fact_rental
WHERE
customer_id = 388
AND payment_date IS NOT NULL
GROUP BY
month
)
SELECT
month,
amount,
LEAD(amount, 1) OVER (
ORDER BY
month
) AS previous_month_amount,
LEAD(amount, 1) OVER (
ORDER BY
month
) - amount AS difference
FROM
total_payment_amounts_sum;
* mysql+pymysql://:***@1264ded65b07:3306/sakila_star
4 rows affected.
month | amount | previous_month_amount | difference |
---|---|---|---|
2 | 26.95 | 50.89 | 23.94 |
3 | 50.89 | 35.89 | -15.00 |
4 | 35.89 | 0.99 | -34.90 |
5 | 0.99 | None | None |
month | amount | next_month_amount | difference |
---|---|---|---|
2 | 26.95 | 50.89 | 23.94 |
3 | 50.89 | 35.89 | -15.00 |
4 | 35.89 | 0.99 | -34.90 |
5 | 0.99 | None | None |
During this lab, you’ve written several SQL queries to manipulate data from the DVD rental sample database. Firstly, you applied different conditions to the WHERE clause using logical operators. Then, you built more complex queries using different techniques such as Common Table Expressions, CASE
statements, and so on. Finally, you applied window functions to summarize data using a specific window. Overall, you now have different tools to process data using SQL in a desired way.