-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
40afdf5
commit 0cee092
Showing
1 changed file
with
314 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,314 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cognitive-hours", | ||
"metadata": {}, | ||
"source": [ | ||
"# Keys in Real Data: Cabrillo Courses \n", | ||
"\n", | ||
"In this lab we will examine some trade-offs when addressing problems with data. This lab builds on the schema that we imported in the [Using Real Data: Cabrillo Courses](http://www.lifealgorithmic.com/content/intro-to-rdbms/labs/realdata_1_cabrillo_import.html) lab. This is the first time I've used Jupyter for this lab and I've discovered that it's considerably faster. \n", | ||
"\n", | ||
"As usual, start by running setup code in the next cell: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "stupid-biodiversity", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%load_ext sql\n", | ||
"%config SqlMagic.autolimit=500\n", | ||
" \n", | ||
"import pandas as pd \n", | ||
"from sqlalchemy import create_engine" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "advised-queens", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 0: Import the Data (The Jupyter Way)\n", | ||
"\n", | ||
"Upload the CSV files that I provided into Jupyter Lab by dragging and dropping them from your computer to the files panel. The code in the next cell imports the CSVs and creates tables with corresponding names. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "occasional-vulnerability", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"datafiles = ['MasterCourseFile.csv', 'ProgramFile.csv', 'ProgramCourseFile.csv']\n", | ||
"url = 'sqlite:///real_data_2.sqlite3'\n", | ||
"engine = create_engine(url, echo=False)\n", | ||
"for f in datafiles:\n", | ||
" df = pd.read_csv(f)\n", | ||
" df.to_sql(f.replace('.csv',''), con=engine, if_exists='replace')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "boolean-fairy", | ||
"metadata": {}, | ||
"source": [ | ||
"**Much faster!!** Now verify that your tables are loaded using the query in the next cell:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "addressed-trustee", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"select * from ProgramFile" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "typical-bhutan", | ||
"metadata": {}, | ||
"source": [ | ||
"## Cleanup (if necessary) \n", | ||
"\n", | ||
"This cell enables you to re-run the whole notebook by removing the created tables:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "inside-organic", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"drop table if exists program_course; \n", | ||
"drop table if exists program; \n", | ||
"drop table if exists course;" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "established-spanking", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 1: Course Data \n", | ||
"\n", | ||
"The course data is luckily clean. Let's recreate the course table with a key: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "conventional-walker", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"PRAGMA foreign_keys = ON;\n", | ||
"\n", | ||
"drop table if exists course; \n", | ||
"create table course (\n", | ||
" ControlNumber char(12) primary key,\n", | ||
" CourseID char(12),\n", | ||
" TOPCode char(6), \n", | ||
" CreditStatus char(1), \n", | ||
" MaximumUnits float,\n", | ||
" MinimumUnits float, \n", | ||
" SAMCode char(1), \n", | ||
" Date date\n", | ||
");\n", | ||
"\n", | ||
"insert into course (ControlNumber, CourseID, TOPCode,\n", | ||
" CreditStatus, MaximumUnits, MinimumUnits, \n", | ||
" SAMCode, Date)\n", | ||
" select `Control Number`, `Course ID`, `TOP Code`, \n", | ||
" `Credit Status`, `Maximum Units`, `Minimum Units`, \n", | ||
" `SAM Status`, `Issue/Update Date`\n", | ||
" from MasterCourseFile;" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "particular-heather", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 2: Keys in the Program Data \n", | ||
"\n", | ||
"This query shows the problem with the program data: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "grave-broad", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"select `Program Control Number`, count(*) \n", | ||
" from ProgramFile\n", | ||
" group by `Program Control Number`\n", | ||
" having count(*) > 1;" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "generic-suicide", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"select * from ProgramFile\n", | ||
" where `Program Control Number` like ' ';" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "smooth-treasury", | ||
"metadata": {}, | ||
"source": [ | ||
"Some records don't have a control number. What can we do about it? Let's delete the rows and re-create the program table with a key." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "crazy-bowling", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"drop table if exists program;\n", | ||
"create table program ( \n", | ||
" ControlNumber char(5) primary key,\n", | ||
" Title varchar(64),\n", | ||
" TOPCode char(6), \n", | ||
" AwardType char(1), \n", | ||
" CreditType char(1), \n", | ||
" ApprovedDate date, \n", | ||
" Status varchar(16), \n", | ||
" InactiveDate date\n", | ||
");\n", | ||
"\n", | ||
"insert into program (ControlNumber, Title, TOPCode, AwardType, CreditType, ApprovedDate, Status, InactiveDate)\n", | ||
" select `Program Control Number`, `Title`, `TOP Code`, `Program Award`, \n", | ||
" `Credit Type`, IIF (`Approved Date` = '', NULL, `Approved Date`), \n", | ||
" TRIM(`Proposal Status`), IIF (`Inactive Date` = '', NULL, `Inactive Date`)\n", | ||
" from ProgramFile\n", | ||
" where `Program Control Number` != ' ';\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "voluntary-inspection", | ||
"metadata": {}, | ||
"source": [ | ||
"Note the `where` clause in the insert/select statement removes rows with empty control numbers." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "numerical-governor", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 3: Keys in the Intersection Table \n", | ||
"\n", | ||
"Now let's fix the `program_course` table to use our new key. This too will encounter problems with empty control numbers. Run this query to check for bad keys: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "experimental-label", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"select `Program Control Number`, `Course Control Number`, count(*) \n", | ||
" from ProgramCourseFile \n", | ||
" group by `Program Control Number`, `Course Control Number`\n", | ||
" having count(*) > 1;" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ideal-pound", | ||
"metadata": {}, | ||
"source": [ | ||
"There are courses that map to empty programs. Again, we're going to discard this data because it's hard to figure out where these courses belong." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "hired-opportunity", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%sql sqlite:///real_data_2.sqlite3\n", | ||
"\n", | ||
"drop table if exists program_course;\n", | ||
"create table program_course (\n", | ||
" ProgramControlNumber char(5), \n", | ||
" CourseControlNumber char(12),\n", | ||
" primary key (ProgramControlNumber, CourseControlNumber),\n", | ||
" foreign key (ProgramControlNumber) \n", | ||
" references program (ControlNumber),\n", | ||
" foreign key (CourseControlNumber) \n", | ||
" references course (ControlNumber)\n", | ||
");\n", | ||
"\n", | ||
"insert into program_course (ProgramControlNumber, CourseControlNumber)\n", | ||
" select `Program Control Number`, `Course Control Number` \n", | ||
" from ProgramCourseFile\n", | ||
" where `Program Control Number` != ' '; " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "english-deficit", | ||
"metadata": {}, | ||
"source": [ | ||
"## Turn In \n", | ||
"\n", | ||
"Download the `real_data_2.sqlite3` file and submit it on Canvas." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |