Skip to content
This repository has been archived by the owner on Apr 22, 2024. It is now read-only.

Create job manager, enable append tables, add metadata tables (#14, #33) #62

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f6abfff
Add manager WIP
ssciolla Apr 2, 2020
bd5a9ff
Change name of environ so it is not gitignored
ssciolla Apr 2, 2020
4ee35b9
Fix start_time var bug; fix spacing; implement HOW_STARTED
ssciolla Apr 2, 2020
a441b9b
Reorder log messages
ssciolla Apr 2, 2020
7cfafe2
Merge with master
ssciolla Apr 3, 2020
0ffb2d2
Merge branch 'master' into issue-33-create-job-manager
ssciolla Apr 3, 2020
6b3ee30
Remove unused imports; add wait log message
ssciolla Apr 3, 2020
74695f2
Enable APPEND_TABLE_NAMES
ssciolla Apr 4, 2020
bf1e1b3
Move gql_queries.py
ssciolla Apr 4, 2020
827bc51
Add job_run/data_source_status tables; update run_jobs and inventory …
ssciolla Apr 6, 2020
6a86dcd
Finish merge with master after canvas_course_usage changes
ssciolla Apr 6, 2020
b7cc874
Tweak imports; add condition with warning checking for no data sources
ssciolla Apr 6, 2020
599c719
Add JOB_NAMES to env_blank.json
ssciolla Apr 6, 2020
a1e7c02
Tidy up, add comment to environ
ssciolla Apr 6, 2020
3cb86e9
Bump up add_meta_tables migration to 9
ssciolla Apr 6, 2020
663450e
Extend sleep time, since first start-up takes longer
ssciolla Apr 6, 2020
a8e0003
Update overview; add new entries to env.json table; add new job imple…
ssciolla Apr 6, 2020
879d767
Tweak some language; remove extra returns in doc
ssciolla Apr 6, 2020
baca3a4
Fix and update type hints in run_jobs.py
ssciolla Apr 6, 2020
2fd2642
Fix numbering in new README section
ssciolla Apr 6, 2020
a1bfb76
Add missing bracket; use __members__ instead of try/except with KeyError
ssciolla Apr 7, 2020
ef1af52
Fix enum in run method
ssciolla Apr 7, 2020
eadda68
Fixed bug; parse JSON strings if found in environment variable
ssciolla Apr 7, 2020
086ada3
Fix logged value if overridden
ssciolla Apr 7, 2020
34f728a
Add Union to imports, fixing Codacy issues
ssciolla Apr 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ COPY . /app/
ENV TZ=America/Detroit
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

CMD ["python", "inventory.py"]
CMD ["python", "run_jobs.py"]

# Done!
39 changes: 30 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,10 @@

## Overview

The course-inventory application is designed to gather current-term Canvas LMS data about courses, enrollments, and users in order to inform leadership at the University of Michigan about the status and utilization of Canvas courses. Currently, the application collects data from the Canvas API and the Unizin Data Warehouse and then stores the data in an external MySQL database. Tableau dashboards and other processes then consume that data to generate reports and visualizations.

The course-inventory application is designed to gather current-term Canvas LMS data about courses, enrollments, users, and course activity -- as well as data about the usage of technology for remote learning, including BlueJeans, Zoom, and MiVideo -- in order to inform leadership at the University of Michigan about the status and utilization of tools for teaching and learning. Currently, the application collects data from various APIs and data services managed by Unizin Consortium and then stores the data in an external MySQL database. Tableau dashboards and other processes then consume that data to generate reports and visualizations.
ssciolla marked this conversation as resolved.
Show resolved Hide resolved

## Development


### Pre-requisities

The sections below provide instructions for configuring, installing, using, and changing the application. Depending on the environment you plan to run the application in, you may also need to install some or all of the following:
Expand All @@ -19,7 +17,6 @@ The sections below provide instructions for configuring, installing, using, and

While performing any of the actions described below, use a terminal, text editor, or file utility as necessary. Some sample command-line instructions are provided for some steps.


### Configuration

To configure the application before installation and usage (see the next section), you must first perform a few steps, including the creation of a configuration file called `env.json`. Complete the following items in order.
Expand Down Expand Up @@ -50,6 +47,7 @@ To configure the application before installation and usage (see the next section
**Key** | **Description**
----- | -----
`LOG_LEVEL` | The minimum level for log messages that will appear in output. `INFO` or `DEBUG` is recommended for most use cases; see [Python's logging module](https://docs.python.org/3/library/logging.html).
`JOB_NAMES` | The names of one or more jobs (not case sensitive) that have been implemented and defined in `run_jobs.py` (see the **Implementing a New Job** section below).
`CANVAS_ACCOUNT_ID` | The Canvas instance root account ID number associated with the courses for which data will be collected.
`CANVAS_TERM_ID` | The Canvas instance term ID number that will be used to limit the query for Canvas courses.
`API_BASE_URL` | The base URL for making requests using the U-M API Directory; the default value should be correct.
Expand All @@ -69,11 +67,10 @@ To configure the application before installation and usage (see the next section
`UDW` | An object containing the necessary credential information for connecting to the Unizin Data Warehouse, where data will be pulled from.
`CREATE_CSVS` | A boolean (`true` or `false`) indicating whether CSVs should be generated by the execution.
`INVENTORY_DB` | An object containing the necessary credential information for connecting to a MySQL database, where output data will be inserted.

`APPEND_TABLE_NAMES` | An array of strings identifying tables that accumulate data and from which records should never be dropped programmatically.

### Installation & Usage


#### With Docker

This project provides a `docker-compose.yml` file to help simplify the development and testing process. Invoking `docker-compose` will set up MySQL and a database in a container, and then it will create a separate container for the job, which will ultimately insert records into the MySQL container's database.
Expand Down Expand Up @@ -112,7 +109,6 @@ Use `^C` to stop the running MySQL container, or -- if you used the detached fla

Data in the MySQL database will persist after the container is stopped, as MySQL data is stored in a volume that is mapped to a `.data/` directory in the project. To completely reset the database, delete the `.data` directory.


#### With a Virtual Environment

You can also set up the application using `virtualenv` by doing the following:
Expand All @@ -135,10 +131,9 @@ You can also set up the application using `virtualenv` by doing the following:

4. Run the application.
```
python inventory.py
python run_jobs.py
```


#### OpenShift Deployment

Deploying the application as a job using OpenShift and Jenkins involves several steps, which are beyond the scope of
Expand All @@ -164,6 +159,32 @@ this README. However, a few details about how the job is configurd are provided
value: project_name
```

### Implementing a New Job

The application was designed with the goal of being extensible -- in order to aid collaboration, integrate new data sources, and satisfy new requirements. This is primarily made possible by enabling the creation of new jobs, which are managed by the `run_jobs.py` file (the starting point for Docker). When executed, the file will attempt to run all jobs provided in the value for the `JOB_NAMES` variable in `env.json`, but only jobs previously defined in the codebase will be actually executed.

Follow the steps below to implement a new job that can be executed from `run_jobs.py`. All the changes described below (minus the configuration changes) should be included in the pull request introducing the new job.

1. Place files used only by the new job within a separate, appropriately named package (e.g. `course_inventory` or `online_meetings`).

2. Make use of variables from the `env.json` configuration file by importing the `ENV` variable from `environ.py`.

3. Ensure you have one function or method defined that will kick off all other steps in the job, and have it return a list of dictionaries, with each naming a data source used during the job and providing a timestamp of when that data was updated (or collected).

These dictionaries will be used to create new records in the `data_source_status` table of the MySQL database. Each dictionary should have the following format:
```
{
"data_source_name": "SOME_DATA_SOURCE",
"data_updated_at": some_timestamp
}
```
If the data source provides a timestamp for the data, use that; otherwise, use the current time once all queries or requests to that data source have been made. For consistency, `some_timestamp` should be generated using [the `pandas` method `pd.to_datetime`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html), which accepts a number of time formats and objects and will return a `pd.Timestamp` object for single values. See the `run_course_inventory` entry function for the COURSE_INVENTORY job for an example.

4. Add a new entry to the `ValidJobName` enumeration within `run_jobs.py`. The name (on the left) should be in all capitals. The value (on the right) should be a period-delimited path string, where the first element is the package name, the second is the module or file name, and the third is the name of the job's entry method or function. See `run_jobs.py` for examples.

5. If you are introducing a new data source, you also need to add an entry to the `ValidDataSourceName` enumeration. The name should be all capitals; the value has no meaning for the application, so `auto()` is sufficient.

6. Add the job name to the `JOB_NAMES` environment variable.

### Database Management and Schema Changes

Expand Down
4 changes: 3 additions & 1 deletion config/env_blank.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
{
"LOG_LEVEL": "DEBUG",
"JOB_NAMES": ["COURSE_INVENTORY"],
"CANVAS_ACCOUNT_ID": 1,
"CANVAS_TERM_ID": 164,
"API_BASE_URL": "https://apigw.it.umich.edu",
Expand Down Expand Up @@ -30,5 +31,6 @@
"dbname": "course_inventory",
"user": "",
"password": ""
}
},
"APPEND_TABLE_NAMES": ["job_run", "data_source_status"]
}
File renamed without changes.
File renamed without changes.
67 changes: 38 additions & 29 deletions inventory.py → course_inventory/inventory.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,23 @@
# third-party libraries
import pandas as pd
import psycopg2
from psycopg2.extensions import connection
from requests import Response
from umich_api.api_utils import ApiUtil

# local libraries
from db.db_creator import DBCreator
from canvas.published_date import FetchPublishedDate
from canvas.async_enroll_gatherer import AsyncEnrollGatherer
from gql_queries import queries as QUERIES
from canvas.canvas_course_usage import CanvasCourseUsage
from environ import ENV
from .async_enroll_gatherer import AsyncEnrollGatherer
from .canvas_course_usage import CanvasCourseUsage
from .gql_queries import queries as QUERIES
from .published_date import FetchPublishedDate


# Initialize settings and globals

logger = logging.getLogger(__name__)

try:
config_path = os.getenv("ENV_PATH", os.path.join('config', 'secrets', 'env.json'))
with open(config_path) as env_file:
ENV = json.loads(env_file.read())
except FileNotFoundError:
logger.error('Configuration file could not be found; please add env.json to the config directory.')

logging.basicConfig(level=ENV.get('LOG_LEVEL', 'DEBUG'),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

ACCOUNT_ID = ENV.get('CANVAS_ACCOUNT_ID', 1)
TERM_ID = ENV['CANVAS_TERM_ID']

Expand All @@ -44,6 +36,7 @@

CREATE_CSVS = ENV.get('CREATE_CSVS', False)
INVENTORY_DB = ENV['INVENTORY_DB']
APPEND_TABLE_NAMES = ENV.get('APPEND_TABLE_NAMES', ['job_run', 'data_source_status'])


# Function(s)
Expand Down Expand Up @@ -138,8 +131,7 @@ def gather_course_data_from_api(account_id: int, term_id: int) -> pd.DataFrame:
return course_df


def pull_sis_user_data_from_udw(user_ids: Sequence[int]) -> pd.DataFrame:
udw_conn = psycopg2.connect(**ENV['UDW'])
def pull_sis_user_data_from_udw(user_ids: Sequence[int], conn: connection) -> pd.DataFrame:
users_string = ','.join([str(user_id) for user_id in user_ids])
user_query = f'''
SELECT u.canvas_id AS canvas_id,
Expand All @@ -150,13 +142,12 @@ def pull_sis_user_data_from_udw(user_ids: Sequence[int]) -> pd.DataFrame:
ON u.id=p.user_id
WHERE u.canvas_id in ({users_string});
'''
logger.info('Making user_dim query')
udw_user_df = pd.read_sql(user_query, udw_conn)
logger.info('Making user_dim and pseudonym_dim query against UDW')
udw_user_df = pd.read_sql(user_query, conn)
udw_user_df['sis_id'] = udw_user_df['sis_id'].map(process_sis_id, na_action='ignore')
# Found that the IDs are not necessarily unique, so dropping duplicates
udw_user_df = udw_user_df.drop_duplicates(subset=['canvas_id'])
logger.debug(udw_user_df.head())
udw_conn.close()
return udw_user_df


Expand All @@ -169,9 +160,9 @@ def process_sis_id(id: str) -> Union[int, None]:
return sis_id


def run_course_inventory() -> None:
def run_course_inventory() -> Sequence[Dict[str, Union[str, pd.Timestamp]]]:
ssciolla marked this conversation as resolved.
Show resolved Hide resolved
logger.info("* run_course_inventory")
start = time.time()
logger.info('Making requests against the Canvas API')

# Gather course data
course_df = gather_course_data_from_api(ACCOUNT_ID, TERM_ID)
Expand Down Expand Up @@ -214,11 +205,30 @@ def run_course_inventory() -> None:
enroll_delta = time.time() - enroll_start
logger.info(f'Duration of process (seconds): {enroll_delta}')

# Record data source info for Canvas API
canvas_data_source = {
'data_source_name': 'CANVAS_API',
'data_updated_at': pd.to_datetime(time.time(), unit='s', utc=True)
}

udw_conn = psycopg2.connect(**ENV['UDW'])

# Pull SIS user data from Unizin Data Warehouse
udw_user_ids = user_df['canvas_id'].to_list()
sis_user_df = pull_sis_user_data_from_udw(udw_user_ids)
sis_user_df = pull_sis_user_data_from_udw(udw_user_ids, udw_conn)
user_df = pd.merge(user_df, sis_user_df, on='canvas_id', how='left')

# Record data source info for UDW
udw_meta_df = pd.read_sql('SELECT * FROM unizin_metadata;', udw_conn)
udw_update_datetime_str = udw_meta_df.iloc[1, 1]
udw_update_datetime = pd.to_datetime(udw_update_datetime_str, format='%Y-%m-%d %H:%M:%S.%f%z')
ssciolla marked this conversation as resolved.
Show resolved Hide resolved
logger.info(f'Found canvasdatadate in UDW of {udw_update_datetime}')

udw_data_source = {
'data_source_name': 'UNIZIN_DATA_WAREHOUSE',
'data_updated_at': udw_update_datetime
}

# Produce output
num_course_records = len(course_df)
num_user_records = len(user_df)
Expand Down Expand Up @@ -250,11 +260,9 @@ def run_course_inventory() -> None:

# Empty tables (if any) in database, then migrate
logger.info('Emptying tables in DB')
db_creator_obj = DBCreator(INVENTORY_DB)
db_creator_obj = DBCreator(INVENTORY_DB, APPEND_TABLE_NAMES)
db_creator_obj.set_up()
if len(db_creator_obj.get_table_names()) > 0:
db_creator_obj.drop_records()
db_creator_obj.migrate()
db_creator_obj.drop_records()
db_creator_obj.tear_down()

# Insert gathered data
Expand All @@ -278,10 +286,11 @@ def run_course_inventory() -> None:
canvas_course_usage_df.to_sql('canvas_course_usage', db_creator_obj.engine, if_exists='append', index=False)
logger.info(f'Inserted data into canvas_course_usage table in {db_creator_obj.db_name}')

delta = time.time() - start
str_time = time.strftime("%H:%M:%S", time.gmtime(delta))
logger.info(f'Duration of run: {str_time}')
return [canvas_data_source, udw_data_source]


# Main Program

if __name__ == "__main__":
logging.basicConfig(level=ENV.get('LOG_LEVEL', 'DEBUG'))
run_course_inventory()
17 changes: 4 additions & 13 deletions create_db.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,21 @@
# standard libraries
import json, logging, os

# third-party libraries
from sqlalchemy import create_engine
import logging

# local libraries
from db.db_creator import DBCreator

from environ import ENV

# Initializing settings and global variables

logger = logging.getLogger(__name__)

try:
config_path = os.getenv("ENV_PATH", os.path.join('config', 'secrets', 'env.json'))
with open(config_path) as env_file:
ENV = json.loads(env_file.read())
except FileNotFoundError:
logger.error('Configuration file could not be found; please add env.json to the config directory.')

DB_PARAMS = ENV['INVENTORY_DB']
APPEND_TABLE_NAMES = ENV.get('APPEND_TABLE_NAMES', ['job_run'])
ssciolla marked this conversation as resolved.
Show resolved Hide resolved


# Main Program

if __name__ == '__main__':
logging.basicConfig(level=ENV.get('LOG_LEVEL', 'DEBUG'))
db_creator_obj = DBCreator(DB_PARAMS)
db_creator_obj = DBCreator(DB_PARAMS, APPEND_TABLE_NAMES)
db_creator_obj.set_up_database()
10 changes: 8 additions & 2 deletions db/db_creator.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,12 @@

class DBCreator:

def __init__(self, db_params: Dict[str, str]) -> None:
def __init__(
self,
db_params: Dict[str, str],
append_table_names: Sequence[str] = []
) -> None:

self.db_name = db_params['dbname']
self.conn = None
self.conn_str = (
Expand All @@ -27,6 +32,7 @@ def __init__(self, db_params: Dict[str, str]) -> None:
f"/{db_params['dbname']}?charset=utf8"
)
self.engine = create_engine(self.conn_str)
self.append_table_names = append_table_names

def set_up(self) -> None:
logger.debug('set_up')
Expand All @@ -51,7 +57,7 @@ def drop_records(self) -> None:
logger.debug('drop_records')
self.conn.execute('SET FOREIGN_KEY_CHECKS=0;')
for table_name in self.get_table_names():
if 'yoyo' not in table_name:
if 'yoyo' not in table_name and table_name not in self.append_table_names:
logger.debug(f'Table Name: {table_name}')
self.conn.execute(f'DELETE FROM {table_name};')
logger.info(f'Dropped records in {table_name} in {self.db_name}')
Expand Down
33 changes: 33 additions & 0 deletions db/migrations/0009.add_meta_tables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#
# file: migrations/0009.add_meta_tables.py
#
from yoyo import step

__depends__ = {'0008.canvas_usage_table'}

step('''
CREATE TABLE IF NOT EXISTS job_run
(
id INTEGER NOT NULL UNIQUE AUTO_INCREMENT,
job_name VARCHAR(50) NOT NULL,
started_at DATETIME NOT NULL,
ssciolla marked this conversation as resolved.
Show resolved Hide resolved
finished_at DATETIME NOT NULL,
PRIMARY KEY (id)
)
ENGINE=InnoDB
CHARACTER SET utf8mb4;
''')

step('''
CREATE TABLE IF NOT EXISTS data_source_status
(
id INTEGER NOT NULL UNIQUE AUTO_INCREMENT,
data_source_name VARCHAR(50) NOT NULL,
data_updated_at DATETIME NOT NULL,
job_run_id INTEGER NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY (job_run_id) REFERENCES job_run(id) ON DELETE CASCADE ON UPDATE CASCADE
)
ENGINE=InnoDB
CHARACTER SET utf8mb4;
''')
2 changes: 2 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ services:
dockerfile: Dockerfile
depends_on:
- mysql
environment:
- HOW_STARTED=DOCKER_COMPOSE
volumes:
- ${HOME}/secrets/course-inventory:/app/config/secrets
- ${HOME}/data/course-inventory:/app/data
Expand Down
Loading