In this lab, you will learn how to transform a database table from the “One Big Table” (OBT) form into the First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF). This process is fundamental to database normalization, which helps reduce data redundancy and improve data integrity.
As a Data Engineer, you may not frequently need to perform data normalization from scratch, but understanding the steps involved in this process is crucial. Typically, you will encounter source databases that are already normalized, and your task will often involve denormalizing this data to make it useful for extracting insights or solving business questions. This lab focuses on the opposite process: taking a dataset that has been loaded as a One Big Table and normalizing it up to the third normal form, which is common in transactional systems.
Normalization is a database design technique that organizes tables to minimize redundancy and dependency. It involves dividing large tables into smaller, less redundant tables and defining relationships between them. The goal is to isolate data so that additions, deletions, and modifications can be made in a single table and then propagated through the rest of the database using defined relationships.
Some of the benefits of having a normalized set of tables are the following:
The dataset that you are going to use is related to the typical classicmodels
dataset that you have used but has been transformed to generate One Big Table. The original data comes in a multiline JSON file where each JSON object has the following structure:
{
"orderNumber": 10100,
"orderDate": "2003-01-06",
"requiredDate": "2003-01-13",
"shippedDate": "2003-01-10",
"status": "Shipped",
"comments": null,
"orderDetails": [
{
"productCode": "S18_1749",
"quantityOrdered": 30,
"priceEach": 136.00
},
{
"productCode": "S18_2248",
"quantityOrdered": 50,
"priceEach": 55.09
},
{
"productCode": "S18_4409",
"quantityOrdered": 22,
"priceEach": 75.46
},
{
"productCode": "S24_3969",
"quantityOrdered": 49,
"priceEach": 35.29
}
],
"customer": {
"customerName": "Online Diecast Creations Co.",
"contactLastName": "Young",
"contactFirstName": "Dorothy",
"phone": "6035558647",
"addressLine1": "2304 Long Airport Avenue",
"addressLine2": null,
"city": "Nashua",
"state": "NH",
"postalCode": "62005",
"country": "USA",
"salesRepEmployeeNumber": 1216.00,
"creditLimit": 114200.00
}
}
This dataset has already been uploaded into the database that you are going to use by following this schema:
As you can see, a schema named classicmodels_obt
with a table named orders
was added. Notice, that the orderDetails
and customer
fields have been saved as JSON objects directly in the database, they are structured as a dictionary with key-value pairs holding information about each order and customer.
To explore the data, let’s import all the necessary packages and SQL extensions for running the %sql
magic commands used in this notebook:
import os
import json
import pandas as pd
import psycopg2
from dotenv import load_dotenv
from sqlalchemy import create_engine
pd.set_option('display.max_columns', 30)
%load_ext sql
In the AWS console, go to CloudFormation where two stacks have already been deployed. One is associated with your Cloud environment with a name prefix aws-cloud9
; and another with an alphanumeric ID. Click on the alphanumeric ID stack and it will take you to another screen page with details about this stack. In the Outputs tab, you will see the key PostgresEndpoint
and its corresponding Value column. Copy the value and edit the ./src/env
file, replacing the placeholder <RDS-ENDPOINT>
with the endpoint value. Save changes to the file.
Execute the following cell to load the environment variables and connect to the database:
load_dotenv('./src/env', override=True)
DBHOST = os.getenv('DBHOST')
DBPORT = os.getenv('DBPORT')
DBNAME = os.getenv('DBNAME')
DBUSER = os.getenv('DBUSER')
DBPASSWORD = os.getenv('DBPASSWORD')
connection_url = f"postgresql+psycopg2://{DBUSER}:{DBPASSWORD}@{DBHOST}:{DBPORT}/{DBNAME}"
%sql {connection_url}
Explore the loaded data:
%%sql
select count(*) from classicmodels_obt.orders;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
326 |
%%sql
select * from classicmodels_obt.orders limit 3;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
3 rows affected.
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | customer |
---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | [{'priceEach': 136.0, 'productCode': 'S18_1749', 'quantityOrdered': 30}, {'priceEach': 55.09, 'productCode': 'S18_2248', 'quantityOrdered': 50}, {'priceEach': 75.46, 'productCode': 'S18_4409', 'quantityOrdered': 22}, {'priceEach': 35.29, 'productCode': 'S24_3969', 'quantityOrdered': 49}] | {'city': 'Nashua', 'phone': '6035558647', 'state': 'NH', 'country': 'USA', 'postalCode': '62005', 'creditLimit': 114200.0, 'addressLine1': '2304 Long Airport Avenue', 'addressLine2': None, 'customerName': 'Online Diecast Creations Co.', 'contactLastName': 'Young', 'contactFirstName': 'Dorothy', 'salesRepEmployeeNumber': 1216.0} |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | [{'priceEach': 108.06, 'productCode': 'S18_2325', 'quantityOrdered': 25}, {'priceEach': 167.06, 'productCode': 'S18_2795', 'quantityOrdered': 26}, {'priceEach': 32.53, 'productCode': 'S24_1937', 'quantityOrdered': 45}, {'priceEach': 44.35, 'productCode': 'S24_2022', 'quantityOrdered': 46}] | {'city': 'Frankfurt', 'phone': '+49 69 66 90 2555', 'state': None, 'country': 'Germany', 'postalCode': '60528', 'creditLimit': 59700.0, 'addressLine1': 'Lyonerstr. 34', 'addressLine2': None, 'customerName': 'Blauer See Auto, Co.', 'contactLastName': 'Keitel', 'contactFirstName': 'Roland', 'salesRepEmployeeNumber': 1504.0} |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | [{'priceEach': 95.55, 'productCode': 'S18_1342', 'quantityOrdered': 39}, {'priceEach': 43.13, 'productCode': 'S18_1367', 'quantityOrdered': 41}] | {'city': 'NYC', 'phone': '2125551500', 'state': 'NY', 'country': 'USA', 'postalCode': '10022', 'creditLimit': 76400.0, 'addressLine1': '2678 Kingston Rd.', 'addressLine2': 'Suite 101', 'customerName': 'Vitachrome Inc.', 'contactLastName': 'Frick', 'contactFirstName': 'Michael', 'salesRepEmployeeNumber': 1286.0} |
The dataset has been generated into a One Big Table (OBT) from a multiline JSON file, consolidating all data into a single dataset for query processing. The orderdetails
and customer
fields are not basic data types, they are dictionaries with the following structure:
orderdetails
is a list of dictionaries/JSON objects about all the products under an individual order. Each entry on the list refers to a product, with keys on its product code, the quantity ordered and the unitary price.
customer
field contains a dictionary/JSON object where each key corresponds to a feature of the customer that placed the order, this includes information on personal details, contact, and location.
Now you will perform some basic transformations to the dataset to achieve the three different normal forms and separate the data in each form in a different schema.
Your first task is to create a First Normal Form (1NF) of this dataset and insert it into the database. For that, you will read the data using an SQL query and will transform the data using pandas
package.
The feature of the First Normal Form is that a relation in this form does not contain any multi-valued field. That means that each attribute/field/column contains only atomic or single values.
Given the current form that our dataset has, in order to achieve a 1NF you need to unnest the orderdetails
list and for each element, you will create a row. In addition, as each element is a JSON object or dictionary (which is a multi-valued field), you also have to unnest the data and create a new field or column for each key inside this dictionary. This last step needs to be replicated in the customer
JSON object field.
The final schema after the 1NF will look like the following image
2.1. Create the schema in which the relations in 1NF will be stored and then read the OBT dataset and save it as a pandas
dataframe.
%%sql
CREATE SCHEMA IF NOT EXISTS classicmodels_1nf;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
Have a look again at the original data:
result = %sql select * from classicmodels_obt.orders
df = result.DataFrame()
df.head()
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
326 rows affected.
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | customer | |
---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | [{'priceEach': 136.0, 'productCode': 'S18_1749... | {'city': 'Nashua', 'phone': '6035558647', 'sta... |
1 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | [{'priceEach': 108.06, 'productCode': 'S18_232... | {'city': 'Frankfurt', 'phone': '+49 69 66 90 2... |
2 | 10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | [{'priceEach': 95.55, 'productCode': 'S18_1342... | {'city': 'NYC', 'phone': '2125551500', 'state'... |
3 | 10103 | 2003-01-29 | 2003-02-07 | 2003-02-02 | Shipped | None | [{'priceEach': 214.3, 'productCode': 'S10_1949... | {'city': 'Stavern', 'phone': '07-98 9555', 'st... |
4 | 10104 | 2003-01-31 | 2003-02-09 | 2003-02-01 | Shipped | None | [{'priceEach': 131.44, 'productCode': 'S12_314... | {'city': 'Madrid', 'phone': '(91) 555 94 44', ... |
2.2. Create a new flat table with the dictionaries extracted from the customer
field of the df
dataframe. You can use the pandas
json_normalize()
method over the customer
field for this purpose. This function will return a dataframe with the key-value pairs as columns of the new table.
Note: In the cells where you see the comments ### START CODE HERE ###
and ### END CODE HERE ###
you need to complete the code replacing all None
. The rest of the cells are already complete, you just need to review and run the code.
### START CODE HERE ### (1 line of code)
customers_df = pd.json_normalize(df['customer']) # @REPLACE EQUALS pd.None(None['None'])
### END CODE HERE ###
customers_df.head()
city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 |
1 | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 |
2 | NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 |
3 | Stavern | 07-98 9555 | None | Norway | 4110 | 81700.0 | Erling Skakkes gate 78 | None | Baane Mini Imports | Bergulfsen | Jonas | 1504.0 |
4 | Madrid | (91) 555 94 44 | None | Spain | 28034 | 227600.0 | C/ Moralzarzal, 86 | None | Euro+ Shopping Channel | Freyre | Diego | 1370.0 |
city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber |
---|---|---|---|---|---|---|---|---|---|---|---|
Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 |
Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 |
NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 |
Stavern | 07-98 9555 | None | Norway | 4110 | 81700.0 | Erling Skakkes gate 78 | None | Baane Mini Imports | Bergulfsen | Jonas | 1504.0 |
Madrid | (91) 555 94 44 | None | Spain | 28034 | 227600.0 | C/ Moralzarzal, 86 | None | Euro+ Shopping Channel | Freyre | Diego | 1370.0 |
pd.json_normalize
creates a DataFrame where each dictionary in the customer
column is flattened into a row. The index of the original DataFrame is preserved, which is crucial for maintaining the correct relationship between the original rows and the new flattened rows. It is time to concatenate the two datasets.
2.3. You need to drop the customer
column from the original dataframe, using drop()
method and specifying the column. You should keep inplace
argument equal to True
.
Then concatenate the dataframe df
with customers_df
. For that, you will use pd.concat
method with the axis
parameter set as 1. Using pd.concat
with axis=1
joins the DataFrames column-wise, aligning rows by their index. Since the index is preserved, each row in the flattened customers_df
dataframe aligns correctly with its corresponding row in the original dataframe.
### START CODE HERE ### (2 lines of code)
df.drop(columns='customer', inplace=True) # @REPLACE df.None(columns='None', inplace=True)
df = pd.concat([df, customers_df], axis=1) # @REPLACE EQUALS pd.None([df, None], axis=1)
### END CODE HERE ###
df.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | [{'priceEach': 136.0, 'productCode': 'S18_1749... | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 |
1 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | [{'priceEach': 108.06, 'productCode': 'S18_232... | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 |
2 | 10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | [{'priceEach': 95.55, 'productCode': 'S18_1342... | NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 |
3 | 10103 | 2003-01-29 | 2003-02-07 | 2003-02-02 | Shipped | None | [{'priceEach': 214.3, 'productCode': 'S10_1949... | Stavern | 07-98 9555 | None | Norway | 4110 | 81700.0 | Erling Skakkes gate 78 | None | Baane Mini Imports | Bergulfsen | Jonas | 1504.0 |
4 | 10104 | 2003-01-31 | 2003-02-09 | 2003-02-01 | Shipped | None | [{'priceEach': 131.44, 'productCode': 'S12_314... | Madrid | (91) 555 94 44 | None | Spain | 28034 | 227600.0 | C/ Moralzarzal, 86 | None | Euro+ Shopping Channel | Freyre | Diego | 1370.0 |
Note: Some text is omitted.
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | text… | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | text… | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | text… | NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 |
10103 | 2003-01-29 | 2003-02-07 | 2003-02-02 | Shipped | None | text… | Stavern | 07-98 9555 | None | Norway | 4110 | 81700.0 | Erling Skakkes gate 78 | None | Baane Mini Imports | Bergulfsen | Jonas | 1504.0 |
10104 | 2003-01-31 | 2003-02-09 | 2003-02-01 | Shipped | None | text… | Madrid | (91) 555 94 44 | None | Spain | 28034 | 227600.0 | C/ Moralzarzal, 86 | None | Euro+ Shopping Channel | Freyre | Diego | 1370.0 |
2.4. Add an ID column with each entry holding a unique identifier for each customer in addition to the customer’s name. (The code below is complete; no change is required).
df['customerNumber'] = pd.factorize(df['customerName'])[0] + 1
df.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | [{'priceEach': 136.0, 'productCode': 'S18_1749... | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
1 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | [{'priceEach': 108.06, 'productCode': 'S18_232... | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 |
2 | 10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | [{'priceEach': 95.55, 'productCode': 'S18_1342... | NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 | 3 |
3 | 10103 | 2003-01-29 | 2003-02-07 | 2003-02-02 | Shipped | None | [{'priceEach': 214.3, 'productCode': 'S10_1949... | Stavern | 07-98 9555 | None | Norway | 4110 | 81700.0 | Erling Skakkes gate 78 | None | Baane Mini Imports | Bergulfsen | Jonas | 1504.0 | 4 |
4 | 10104 | 2003-01-31 | 2003-02-09 | 2003-02-01 | Shipped | None | [{'priceEach': 131.44, 'productCode': 'S12_314... | Madrid | (91) 555 94 44 | None | Spain | 28034 | 227600.0 | C/ Moralzarzal, 86 | None | Euro+ Shopping Channel | Freyre | Diego | 1370.0 | 5 |
Now that the customer
field has been transformed into atomic-valued fields, you need to do the same for the orderdetails
column.
2.5. Create a new row for each product in the same order using the pandas
explode()
method on orderdetails
column of the original df
dataframe. Make sure to set the ignore_index
parameter to True
.
### START CODE HERE ### (1 line of code)
df_exploded = df.explode('orderdetails', ignore_index=True) # @REPLACE EQUALS df.None('None', ignore_index=None)
### END CODE HERE ###
df_exploded.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | {'priceEach': 136.0, 'productCode': 'S18_1749'... | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | {'priceEach': 55.09, 'productCode': 'S18_2248'... | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | {'priceEach': 75.46, 'productCode': 'S18_4409'... | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | {'priceEach': 35.29, 'productCode': 'S24_3969'... | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | {'priceEach': 108.06, 'productCode': 'S18_2325... | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 |
ordernumber | orderdate | requireddate | shippeddate | status | comments | orderdetails | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | text… | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | text… | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | text… | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | text… | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | text… | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 |
The explode
function is used to transform each element of a list-like column into a separate row. Setting ignore_index=True
resets the index of the resulting dataframe, creating a new integer index that starts from 0.
2.6. Use again the json_normalize
over the orderdetails
column of the dataframe df_exploded
.
### START CODE HERE ### (1 line of code)
orderdetails_normalized = pd.json_normalize(df_exploded['orderdetails']) # @REPLACE EQUALS pd.None(None['None'])
### END CODE HERE ###
orderdetails_normalized.head()
priceEach | productCode | quantityOrdered | |
---|---|---|---|
0 | 136.00 | S18_1749 | 30 |
1 | 55.09 | S18_2248 | 50 |
2 | 75.46 | S18_4409 | 22 |
3 | 35.29 | S24_3969 | 49 |
4 | 108.06 | S18_2325 | 25 |
priceEach | productCode | quantityOrdered |
---|---|---|
136.00 | S18_1749 | 30 |
55.09 | S18_2248 | 50 |
75.46 | S18_4409 | 22 |
35.29 | S24_3969 | 49 |
108.06 | S18_2325 | 25 |
2.7. Finally, drop()
the orderdetails
column from the original df_exploded
dataframe keeping inplace
argument equal to True
. Then concat()
df_exploded
dataframe with the orderdetails_normalized
dataframe.
### START CODE HERE ### (2 lines of code)
df_exploded.drop(columns='orderdetails', inplace=True) # @REPLACE df_exploded.None(columns='None', inplace=True)
df_normalized = pd.concat([df_exploded, orderdetails_normalized], axis=1) # @REPLACE EQUALS pd.None([df_exploded, None], axis=1)
### END CODE HERE ###
# First Normal Form
df_normalized.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | priceEach | productCode | quantityOrdered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 136.00 | S18_1749 | 30 |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 55.09 | S18_2248 | 50 |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 75.46 | S18_4409 | 22 |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 35.29 | S24_3969 | 49 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 108.06 | S18_2325 | 25 |
ordernumber | orderdate | requireddate | shippeddate | status | comments | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | priceEach | productCode | quantityOrdered |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 136.00 | S18_1749 | 30 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 55.09 | S18_2248 | 50 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 75.46 | S18_4409 | 22 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 35.29 | S24_3969 | 49 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 108.06 | S18_2325 | 25 |
The resulting dataframe df_normalized
should be in the Frist Normal Form now.
2.8. Now that you have atomic values in your dataset, let’s create an additional identifier for each product in each order. You will call it Order Line Number (orderlinenumber
) and in combination with the ordernumber
column, they will create a composite primary key.
df_normalized['orderlinenumber'] = df_normalized.groupby('ordernumber').cumcount() + 1
df_normalized.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 136.00 | S18_1749 | 30 | 1 |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 55.09 | S18_2248 | 50 | 2 |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 75.46 | S18_4409 | 22 | 3 |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 35.29 | S24_3969 | 49 | 4 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 108.06 | S18_2325 | 25 | 1 |
2.9. With those transformations, you have finished your normalization up to 1NF. Let’s insert this dataset into your database. Drop the table if it has been loaded before to avoid an error.
%%sql
DROP TABLE IF EXISTS classicmodels_1nf.orders
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
Create table named orders
in the schema classicmodels_1nf
.
engine = create_engine(connection_url)
df_normalized.to_sql('orders', engine, schema='classicmodels_1nf', index=False)
996
Inspect the data that you just loaded.
%%sql
SELECT COUNT(*) FROM classicmodels_1nf.orders;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
2996 |
%%sql
SELECT * FROM classicmodels_1nf.orders LIMIT 10;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.
ordernumber | orderdate | requireddate | shippeddate | status | comments | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 136.0 | S18_1749 | 30 | 1 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 55.09 | S18_2248 | 50 | 2 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 75.46 | S18_4409 | 22 | 3 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 35.29 | S24_3969 | 49 | 4 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 108.06 | S18_2325 | 25 | 1 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 167.06 | S18_2795 | 26 | 2 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 32.53 | S24_1937 | 45 | 3 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 44.35 | S24_2022 | 46 | 4 |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 | 3 | 95.55 | S18_1342 | 39 | 1 |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | NYC | 2125551500 | NY | USA | 10022 | 76400.0 | 2678 Kingston Rd. | Suite 101 | Vitachrome Inc. | Frick | Michael | 1286.0 | 3 | 43.13 | S18_1367 | 41 | 2 |
In the second normal form (2NF), data must already be in the first normal form (1NF), and all non-key attributes must be fully functionally dependent on the primary key.
Here, by using the ordernumber
and orderlinenumber
columns as a composite primary key, you will create a unique identifier for each order line.
Additionally, you have a separate identifier for customers along with their complete information. Since it’s unnecessary to include all customer details in the same row as the order, you only need the customer identifier in the order table. The complete customer information can be moved to a separate table that is fully functional and dependent only on the customer identifier.
The final schema after the 2NF transformations will look like this:
3.1. Create the classicmodels_2nf
schema to store our tables and then read the 1NF dataset and transform it to create a 2NF version.
%%sql
CREATE SCHEMA IF NOT EXISTS classicmodels_2nf;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
Have a look again at the data in the 1NF:
result = %sql select * from classicmodels_1nf.orders
df_orders = result.DataFrame()
df_orders.head()
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
2996 rows affected.
ordernumber | orderdate | requireddate | shippeddate | status | comments | city | phone | state | country | postalCode | creditLimit | addressLine1 | addressLine2 | customerName | contactLastName | contactFirstName | salesRepEmployeeNumber | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 136.00 | S18_1749 | 30 | 1 |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 55.09 | S18_2248 | 50 | 2 |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 75.46 | S18_4409 | 22 | 3 |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | Nashua | 6035558647 | NH | USA | 62005 | 114200.0 | 2304 Long Airport Avenue | None | Online Diecast Creations Co. | Young | Dorothy | 1216.0 | 1 | 35.29 | S24_3969 | 49 | 4 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | Frankfurt | +49 69 66 90 2555 | None | Germany | 60528 | 59700.0 | Lyonerstr. 34 | None | Blauer See Auto, Co. | Keitel | Roland | 1504.0 | 2 | 108.06 | S18_2325 | 25 | 1 |
3.2. Extract all the information related to customers and create a table with the unique values for each customer.
df_orders
dataframe (see the list in the cell below) and make a copy of it with the copy()
method naming it df_customers
.drop_duplicates
. Make sure that you put argument inplace
equal to True
.customer_columns = ['customerNumber',
'customerName',
'contactLastName',
'contactFirstName',
'phone',
'addressLine1',
'addressLine2',
'postalCode',
'city',
'state',
'country',
'creditLimit',
'salesRepEmployeeNumber',
]
### START CODE HERE ### (2 lines of code)
df_customers = df_orders[customer_columns].copy() # @REPLACE EQUALS None[None].None()
df_customers.drop_duplicates(inplace=True) # @REPLACE None.None(inplace=None)
### END CODE HERE ###
df_customers.head()
customerNumber | customerName | contactLastName | contactFirstName | phone | addressLine1 | addressLine2 | postalCode | city | state | country | creditLimit | salesRepEmployeeNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Online Diecast Creations Co. | Young | Dorothy | 6035558647 | 2304 Long Airport Avenue | None | 62005 | Nashua | NH | USA | 114200.0 | 1216.0 |
4 | 2 | Blauer See Auto, Co. | Keitel | Roland | +49 69 66 90 2555 | Lyonerstr. 34 | None | 60528 | Frankfurt | None | Germany | 59700.0 | 1504.0 |
8 | 3 | Vitachrome Inc. | Frick | Michael | 2125551500 | 2678 Kingston Rd. | Suite 101 | 10022 | NYC | NY | USA | 76400.0 | 1286.0 |
10 | 4 | Baane Mini Imports | Bergulfsen | Jonas | 07-98 9555 | Erling Skakkes gate 78 | None | 4110 | Stavern | None | Norway | 81700.0 | 1504.0 |
26 | 5 | Euro+ Shopping Channel | Freyre | Diego | (91) 555 94 44 | C/ Moralzarzal, 86 | None | 28034 | Madrid | None | Spain | 227600.0 | 1370.0 |
customerNumber | customerName | contactLastName | contactFirstName | phone | addressLine1 | addressLine2 | postalCode | city | state | country | creditLimit | salesRepEmployeeNumber |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Online Diecast Creations Co. | Young | Dorothy | 6035558647 | 2304 Long Airport Avenue | None | 62005 | Nashua | NH | USA | 114200.0 | 1216.0 |
2 | Blauer See Auto, Co. | Keitel | Roland | +49 69 66 90 2555 | Lyonerstr. 34 | None | 60528 | Frankfurt | None | Germany | 59700.0 | 1504.0 |
3 | Vitachrome Inc. | Frick | Michael | 2125551500 | 2678 Kingston Rd. | Suite 101 | 10022 | NYC | NY | USA | 76400.0 | 1286.0 |
4 | Baane Mini Imports | Bergulfsen | Jonas | 07-98 9555 | Erling Skakkes gate 78 | None | 4110 | Stavern | None | Norway | 81700.0 | 1504.0 |
5 | Euro+ Shopping Channel | Freyre | Diego | (91) 555 94 44 | C/ Moralzarzal, 86 | None | 28034 | Madrid | None | Spain | 227600.0 | 1370.0 |
3.3. Now that you have your customers’ dataset, that information can be dropped from the original dataframe. The only necessary column to keep is the customerNumber
as it helps to relate the orders with customers’ information. Create a list of the columns which you need to drop from the df_orders
dataframe:
customer_columns.pop(0)
customer_columns
['customerName',
'contactLastName',
'contactFirstName',
'phone',
'addressLine1',
'addressLine2',
'postalCode',
'city',
'state',
'country',
'creditLimit',
'salesRepEmployeeNumber']
3.4. Drop the customer_columns
from the dataframe df_orders
. The inplace
argument should be equal to True
.
### START CODE HERE ### (1 line of code)
df_orders.drop(columns=customer_columns, inplace=True) # @REPLACE None.None(columns=None, inplace=None)
### END CODE HERE ###
df_orders.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 136.00 | S18_1749 | 30 | 1 |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 55.09 | S18_2248 | 50 | 2 |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 75.46 | S18_4409 | 22 | 3 |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 35.29 | S24_3969 | 49 | 4 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 108.06 | S18_2325 | 25 | 1 |
ordernumber | orderdate | requireddate | shippeddate | status | comments | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber |
---|---|---|---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 136.00 | S18_1749 | 30 | 1 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 55.09 | S18_2248 | 50 | 2 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 75.46 | S18_4409 | 22 | 3 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 35.29 | S24_3969 | 49 | 4 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 108.06 | S18_2325 | 25 | 1 |
You have created 2NF version of the relations. The result of this normalization step is two datasets or entities:
orders
: Containing only the information about each order.customers
: Containing only the complete information about the customers.3.5. The two previous entities can be related through the customerNumber
field. Insert the two previous tables into the classicmodels_2nf
schema. You will need to drop the tables before that in case they have been added before.
%%sql
DROP TABLE IF EXISTS classicmodels_2nf.orders
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
%%sql
DROP TABLE IF EXISTS classicmodels_2nf.customers
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
df_orders.to_sql('orders', engine, schema='classicmodels_2nf', index=False)
996
df_customers.to_sql('customers', engine, schema='classicmodels_2nf', index=False)
98
Explore the tables.
%%sql
SELECT COUNT(*) FROM classicmodels_2nf.orders;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
2996 |
%%sql
SELECT COUNT(*) FROM classicmodels_2nf.customers;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
98 |
%%sql
SELECT * FROM classicmodels_2nf.customers limit 10;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.
customerNumber | customerName | contactLastName | contactFirstName | phone | addressLine1 | addressLine2 | postalCode | city | state | country | creditLimit | salesRepEmployeeNumber |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Online Diecast Creations Co. | Young | Dorothy | 6035558647 | 2304 Long Airport Avenue | None | 62005 | Nashua | NH | USA | 114200.0 | 1216.0 |
2 | Blauer See Auto, Co. | Keitel | Roland | +49 69 66 90 2555 | Lyonerstr. 34 | None | 60528 | Frankfurt | None | Germany | 59700.0 | 1504.0 |
3 | Vitachrome Inc. | Frick | Michael | 2125551500 | 2678 Kingston Rd. | Suite 101 | 10022 | NYC | NY | USA | 76400.0 | 1286.0 |
4 | Baane Mini Imports | Bergulfsen | Jonas | 07-98 9555 | Erling Skakkes gate 78 | None | 4110 | Stavern | None | Norway | 81700.0 | 1504.0 |
5 | Euro+ Shopping Channel | Freyre | Diego | (91) 555 94 44 | C/ Moralzarzal, 86 | None | 28034 | Madrid | None | Spain | 227600.0 | 1370.0 |
6 | Danish Wholesale Imports | Petersen | Jytte | 31 12 3555 | Vinbæltet 34 | None | 1734 | Kobenhavn | None | Denmark | 83400.0 | 1401.0 |
7 | Down Under Souveniers, Inc | Graham | Mike | +64 9 312 5555 | 162-164 Grafton Road | Level 2 | None | Auckland | None | New Zealand | 88000.0 | 1612.0 |
8 | Rovelli Gifts | Rovelli | Giovanni | 035-640555 | Via Ludovico il Moro 22 | None | 24100 | Bergamo | None | Italy | 119600.0 | 1401.0 |
9 | Land of Toys Inc. | Lee | Kwai | 2125557818 | 897 Long Airport Avenue | None | 10022 | NYC | NY | USA | 114900.0 | 1323.0 |
10 | Cruz & Sons Co. | Cruz | Arnold | +63 2 555 3587 | 15 McCallum Street | NatWest Center #13-03 | 1227 MM | Makati City | None | Philippines | 81500.0 | 1621.0 |
%%sql
SELECT * FROM classicmodels_2nf.orders limit 10;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.
ordernumber | orderdate | requireddate | shippeddate | status | comments | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber |
---|---|---|---|---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 136.0 | S18_1749 | 30 | 1 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 55.09 | S18_2248 | 50 | 2 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 75.46 | S18_4409 | 22 | 3 |
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 35.29 | S24_3969 | 49 | 4 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 108.06 | S18_2325 | 25 | 1 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 167.06 | S18_2795 | 26 | 2 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 32.53 | S24_1937 | 45 | 3 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 44.35 | S24_2022 | 46 | 4 |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | 3 | 95.55 | S18_1342 | 39 | 1 |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | 3 | 43.13 | S18_1367 | 41 | 2 |
The features of 3NF are the following:
In the orders
table you have the following columns: ordernumber
, orderdate
, requireddate
, shippeddate
, status
, comments
, customerNumber
, priceEach
, productCode
, quantityOrdered
, and orderlinenumber
.
Let’s identify Transitive Dependencies:
orderdate
, requireddate
, shippeddate
, status
, comments
, salesRepEmployeeNumber
, and customerNumber
are dependent on ordernumber
.productCode
, priceEach
, and quantityOrdered
depend on the composite key (ordernumber
, orderlinenumber
).You can see that columns related with order-level information such as dates, comments
or even customerNumber
depend only on ordernumber
, which given that is part of the composite key (ordernumber
, orderlinenumber
), make those order-level fields to also depend on the composite key, and in particular on the orderlinenumber
field. This is a transitive relationship that you should get rid of to achieve 3NF.
To transform the relations into 3NF, you need to separate the table into two: one for order-level information and another for order line items or order details.
ordernumber
orderdate
, requireddate
, shippeddate
, status
, comments
, salesRepEmployeeNumber
, customerNumber
.ordernumber
, orderlinenumber
)productCode
, priceEach
, quantityOrdered
By organizing the data this way, you can ensure that each non-prime attribute depends only on the primary key, achieving 3NF. Let’s do it and upload our tables in the third normal form.
The final result of the 3NF normalization is the following
4.1. Create the classicmodels_3nf
schema to store your transformed tables there.
%%sql
CREATE SCHEMA IF NOT EXISTS classicmodels_3nf;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
4.2. You will need to read the orders
and customers
tables from the classicmodels_2nf
schema. Although you will not make any further changes to the customers
table in this step, you will upload it into the 3NF schema to keep all your datasets in the same place.
Read the orders
table into the df_orders
pandas dataframe:
result = %sql select * from classicmodels_2nf.orders
df_orders = result.DataFrame()
df_orders.head()
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
2996 rows affected.
ordernumber | orderdate | requireddate | shippeddate | status | comments | customerNumber | priceEach | productCode | quantityOrdered | orderlinenumber | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 136.00 | S18_1749 | 30 | 1 |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 55.09 | S18_2248 | 50 | 2 |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 75.46 | S18_4409 | 22 | 3 |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 | 35.29 | S24_3969 | 49 | 4 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 | 108.06 | S18_2325 | 25 | 1 |
Read the customers
table into df_customers
:
result = %sql select * from classicmodels_2nf.customers
df_customers = result.DataFrame()
df_customers.head()
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
98 rows affected.
customerNumber | customerName | contactLastName | contactFirstName | phone | addressLine1 | addressLine2 | postalCode | city | state | country | creditLimit | salesRepEmployeeNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Online Diecast Creations Co. | Young | Dorothy | 6035558647 | 2304 Long Airport Avenue | None | 62005 | Nashua | NH | USA | 114200.0 | 1216.0 |
1 | 2 | Blauer See Auto, Co. | Keitel | Roland | +49 69 66 90 2555 | Lyonerstr. 34 | None | 60528 | Frankfurt | None | Germany | 59700.0 | 1504.0 |
2 | 3 | Vitachrome Inc. | Frick | Michael | 2125551500 | 2678 Kingston Rd. | Suite 101 | 10022 | NYC | NY | USA | 76400.0 | 1286.0 |
3 | 4 | Baane Mini Imports | Bergulfsen | Jonas | 07-98 9555 | Erling Skakkes gate 78 | None | 4110 | Stavern | None | Norway | 81700.0 | 1504.0 |
4 | 5 | Euro+ Shopping Channel | Freyre | Diego | (91) 555 94 44 | C/ Moralzarzal, 86 | None | 28034 | Madrid | None | Spain | 227600.0 | 1370.0 |
4.3. In order to separate the classicmodels_2nf.orders
table into the orderdetails
and the order
table for the 3NF, let’s extract the columns columns_order_details
creating the orderdetails
dataframe as a copy from df_orders
:
columns_order_details = ["ordernumber", "orderlinenumber", "productCode", "quantityOrdered", "priceEach"]
### START CODE HERE ### (1 line of code)
df_orderdetails = df_orders[columns_order_details].copy() # @REPLACE EQUALS None[None].None()
### END CODE HERE ###
df_orderdetails.head()
ordernumber | orderlinenumber | productCode | quantityOrdered | priceEach | |
---|---|---|---|---|---|
0 | 10100 | 1 | S18_1749 | 30 | 136.00 |
1 | 10100 | 2 | S18_2248 | 50 | 55.09 |
2 | 10100 | 3 | S18_4409 | 22 | 75.46 |
3 | 10100 | 4 | S24_3969 | 49 | 35.29 |
4 | 10101 | 1 | S18_2325 | 25 | 108.06 |
ordernumber | orderlinenumber | productCode | quantityOrdered | priceEach |
---|---|---|---|---|
10100 | 1 | S18_1749 | 30 | 136.00 |
10100 | 2 | S18_2248 | 50 | 55.09 |
10100 | 3 | S18_4409 | 22 | 75.46 |
10100 | 4 | S24_3969 | 49 | 35.29 |
10101 | 1 | S18_2325 | 25 | 108.06 |
4.4. Create a list of the columns which you need to drop from the df_orders
dataframe - the ones associated with the orderdetails
table. You will keep the ordernumber
field as it is the key to relating the two tables:
columns_order_details.pop(0)
columns_order_details
['orderlinenumber', 'productCode', 'quantityOrdered', 'priceEach']
4.5. Create the df_orders
table with only the necessary columns, dropping the columns specified in the list columns_order_details
. Then drop the duplicated rows with the method drop_duplicates()
. You should keep inplace
argument equal to True
.
### START CODE HERE ### (2 lines of code)
df_orders.drop(columns=columns_order_details, inplace=True) # @REPLACE None.None(columns=None, inplace=None)
df_orders.drop_duplicates(inplace=True) # @REPLACE None.None(inplace=None)
### END CODE HERE ###
df_orders.head()
ordernumber | orderdate | requireddate | shippeddate | status | comments | customerNumber | |
---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 |
8 | 10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | 3 |
10 | 10103 | 2003-01-29 | 2003-02-07 | 2003-02-02 | Shipped | None | 4 |
26 | 10104 | 2003-01-31 | 2003-02-09 | 2003-02-01 | Shipped | None | 5 |
ordernumber | orderdate | requireddate | shippeddate | status | comments | customerNumber |
---|---|---|---|---|---|---|
10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | None | 1 |
10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 2 |
10102 | 2003-01-10 | 2003-01-18 | 2003-01-14 | Shipped | None | 3 |
10103 | 2003-01-29 | 2003-02-07 | 2003-02-02 | Shipped | None | 4 |
10104 | 2003-01-31 | 2003-02-09 | 2003-02-01 | Shipped | None | 5 |
Great! With those transformations you have achieved a 3NF from your initial OBT.
4.6. Let’s upload the data into the classicmodels_3nf
schema. Remember that you have three tables: customers
, orders
and orderdetails
.
%%sql
DROP TABLE IF EXISTS classicmodels_3nf.customers
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
%%sql
DROP TABLE IF EXISTS classicmodels_3nf.orders
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
%%sql
DROP TABLE IF EXISTS classicmodels_3nf.orderdetails
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
Done.
[]
df_customers.to_sql('customers', engine, schema='classicmodels_3nf', index=False)
98
df_orders.to_sql('orders', engine, schema='classicmodels_3nf', index=False)
326
df_orderdetails.to_sql('orderdetails', engine, schema='classicmodels_3nf', index=False)
996
Finally, you can take a look to each of the tables that have been stored in your database.
%%sql
select count(*) from classicmodels_3nf.customers;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
98 |
%%sql
select count(*) from classicmodels_3nf.orders;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
326 |
%%sql
select count(*) from classicmodels_3nf.orderdetails;
* postgresql+psycopg2://postgresuser:***@de-c4w1lab1-rds.c9g6s0k8ogtn.us-east-1.rds.amazonaws.com:5432/postgres
1 rows affected.
count |
---|
2996 |
In this lab, you have successfully transformed a One Big Table (OBT) into First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF). This process of normalization is crucial in designing efficient and reliable databases.
Throughout the lab, you have learned the importance of each normal form and the specific steps required to achieve them:
This lab has provided you with hands-on experience in transforming a dataset into a normalized form, highlighting the practical steps and considerations involved in database normalization. As you progress in your work as a Data Engineer, these skills will be invaluable in ensuring the quality and efficiency of the databases you design and maintain.