tl-its-umich-edu · ssciolla · Apr 7, 2020 · Apr 2, 2020 · Apr 2, 2020 · Apr 2, 2020
diff --git a/Dockerfile b/Dockerfile
@@ -17,6 +17,6 @@ COPY . /app/
 ENV TZ=America/Detroit
 RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
 
-CMD ["python", "inventory.py"]
+CMD ["python", "run_jobs.py"]
 
 # Done!
diff --git a/README.md b/README.md
@@ -3,12 +3,10 @@
 
 ## Overview
 
-The course-inventory application is designed to gather current-term Canvas LMS data about courses, enrollments, and users in order to inform leadership at the University of Michigan about the status and utilization of Canvas courses. Currently, the application collects data from the Canvas API and the Unizin Data Warehouse and then stores the data in an external MySQL database. Tableau dashboards and other processes then consume that data to generate reports and visualizations.
-
+The course-inventory application is designed to gather current-term Canvas LMS data about courses, enrollments, users, and course activity -- as well as data about the usage of technology for remote learning, including BlueJeans, Zoom, and MiVideo -- in order to inform leadership at the University of Michigan about the status and utilization of tools for teaching and learning. Currently, the application collects data from various APIs and data services managed by Unizin Consortium and then stores the data in an external MySQL database. Tableau dashboards and other processes then consume that data to generate reports and visualizations.
 
 ## Development
 
-
 ### Pre-requisities
 
 The sections below provide instructions for configuring, installing, using, and changing the application. Depending on the environment you plan to run the application in, you may also need to install some or all of the following:
@@ -19,7 +17,6 @@ The sections below provide instructions for configuring, installing, using, and
 
 While performing any of the actions described below, use a terminal, text editor, or file utility as necessary. Some sample command-line instructions are provided for some steps.
 
-
 ### Configuration
 
 To configure the application before installation and usage (see the next section), you must first perform a few steps, including the creation of a configuration file called `env.json`. Complete the following items in order.
@@ -50,6 +47,7 @@ To configure the application before installation and usage (see the next section
     **Key** | **Description**
     ----- | -----
     `LOG_LEVEL` | The minimum level for log messages that will appear in output. `INFO` or `DEBUG` is recommended for most use cases; see [Python's logging module](https://docs.python.org/3/library/logging.html).
+    `JOB_NAMES` | The names of one or more jobs (not case sensitive) that have been implemented and defined in `run_jobs.py` (see the **Implementing a New Job** section below).
     `CANVAS_ACCOUNT_ID` | The Canvas instance root account ID number associated with the courses for which data will be collected.
     `CANVAS_TERM_ID` | The Canvas instance term ID number that will be used to limit the query for Canvas courses.
     `API_BASE_URL` | The base URL for making requests using the U-M API Directory; the default value should be correct.
@@ -69,11 +67,10 @@ To configure the application before installation and usage (see the next section
     `UDW` | An object containing the necessary credential information for connecting to the Unizin Data Warehouse, where data will be pulled from.
     `CREATE_CSVS` | A boolean (`true` or `false`) indicating whether CSVs should be generated by the execution.
     `INVENTORY_DB` | An object containing the necessary credential information for connecting to a MySQL database, where output data will be inserted.
-
+    `APPEND_TABLE_NAMES` | An array of strings identifying tables that accumulate data and from which records should never be dropped programmatically.
 
 ### Installation & Usage
 
-
 #### With Docker
 
 This project provides a `docker-compose.yml` file to help simplify the development and testing process. Invoking `docker-compose` will set up MySQL and a database in a container, and then it will create a separate container for the job, which will ultimately insert records into the MySQL container's database.
@@ -112,7 +109,6 @@ Use `^C` to stop the running MySQL container, or -- if you used the detached fla
 
 Data in the MySQL database will persist after the container is stopped, as MySQL data is stored in a volume that is mapped to a `.data/` directory in the project. To completely reset the database, delete the `.data` directory.
 
-
 #### With a Virtual Environment
 
 You can also set up the application using `virtualenv` by doing the following:
@@ -135,10 +131,9 @@ You can also set up the application using `virtualenv` by doing the following:
 
 4. Run the application.
     ```
-    python inventory.py
+    python run_jobs.py
     ```
 
-
 #### OpenShift Deployment
 
 Deploying the application as a job using OpenShift and Jenkins involves several steps, which are beyond the scope of
@@ -164,6 +159,32 @@ this README. However, a few details about how the job is configurd are provided
         value: project_name
   ```
 
+### Implementing a New Job
+
+The application was designed with the goal of being extensible -- in order to aid collaboration, integrate new data sources, and satisfy new requirements. This is primarily made possible by enabling the creation of new jobs, which are managed by the `run_jobs.py` file (the starting point for Docker). When executed, the file will attempt to run all jobs provided in the value for the `JOB_NAMES` variable in `env.json`, but only jobs previously defined in the codebase will be actually executed.
+
+Follow the steps below to implement a new job that can be executed from `run_jobs.py`. All the changes described below (minus the configuration changes) should be included in the pull request introducing the new job.
+
+1. Place files used only by the new job within a separate, appropriately named package (e.g. `course_inventory` or `online_meetings`).
+
+2. Make use of variables from the `env.json` configuration file by importing the `ENV` variable from `environ.py`.
+
+3. Ensure you have one function or method defined that will kick off all other steps in the job, and have it return a list of dictionaries, with each naming a data source used during the job and providing a timestamp of when that data was updated (or collected).
+
+    These dictionaries will be used to create new records in the `data_source_status` table of the MySQL database. Each dictionary should have the following format:
+    ```
+    {
+        "data_source_name": "SOME_DATA_SOURCE",
+        "data_updated_at": some_timestamp
+    }
+    ```
+    If the data source provides a timestamp for the data, use that; otherwise, use the current time once all queries or requests to that data source have been made. For consistency, `some_timestamp` should be generated using [the `pandas` method `pd.to_datetime`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html), which accepts a number of time formats and objects and will return a `pd.Timestamp` object for single values. See the `run_course_inventory` entry function for the COURSE_INVENTORY job for an example. 
+
+4. Add a new entry to the `ValidJobName` enumeration within `run_jobs.py`. The name (on the left) should be in all capitals. The value (on the right) should be a period-delimited path string, where the first element is the package name, the second is the module or file name, and the third is the name of the job's entry method or function. See `run_jobs.py` for examples.
+
+5. If you are introducing a new data source, you also need to add an entry to the `ValidDataSourceName` enumeration. The name should be all capitals; the value has no meaning for the application, so `auto()` is sufficient.
+
+6. Add the job name to the `JOB_NAMES` environment variable.
 
 ### Database Management and Schema Changes
 

diff --git a/config/env_blank.json b/config/env_blank.json
@@ -1,5 +1,6 @@
 {
   "LOG_LEVEL": "DEBUG",
+  "JOB_NAMES": ["COURSE_INVENTORY"],
   "CANVAS_ACCOUNT_ID": 1,
   "CANVAS_TERM_ID": 164,
   "API_BASE_URL": "https://apigw.it.umich.edu",
@@ -30,5 +31,6 @@
     "dbname": "course_inventory",
     "user": "",
     "password": ""
-  }
+  },
+  "APPEND_TABLE_NAMES": ["job_run", "data_source_status"]
 }
diff --git a/canvas/__init__.py → course_inventory/__init__.py b/canvas/__init__.py → course_inventory/__init__.py
diff --git a/canvas/async_enroll_gatherer.py → course_inventory/async_enroll_gatherer.py b/canvas/async_enroll_gatherer.py → course_inventory/async_enroll_gatherer.py
diff --git a/canvas/canvas_course_usage.py → course_inventory/canvas_course_usage.py b/canvas/canvas_course_usage.py → course_inventory/canvas_course_usage.py
diff --git a/gql_queries.py → course_inventory/gql_queries.py b/gql_queries.py → course_inventory/gql_queries.py
diff --git a/inventory.py → course_inventory/inventory.py b/inventory.py → course_inventory/inventory.py
@@ -6,31 +6,23 @@
 # third-party libraries
 import pandas as pd
 import psycopg2
+from psycopg2.extensions import connection
 from requests import Response
 from umich_api.api_utils import ApiUtil
 
 # local libraries
 from db.db_creator import DBCreator
-from canvas.published_date import FetchPublishedDate
-from canvas.async_enroll_gatherer import AsyncEnrollGatherer
-from gql_queries import queries as QUERIES
-from canvas.canvas_course_usage import CanvasCourseUsage
+from environ import ENV
+from .async_enroll_gatherer import AsyncEnrollGatherer
+from .canvas_course_usage import CanvasCourseUsage
+from .gql_queries import queries as QUERIES
+from .published_date import FetchPublishedDate
 
 
 # Initialize settings and globals
 
 logger = logging.getLogger(__name__)
 
-try:
-    config_path = os.getenv("ENV_PATH", os.path.join('config', 'secrets', 'env.json'))
-    with open(config_path) as env_file:
-        ENV = json.loads(env_file.read())
-except FileNotFoundError:
-    logger.error('Configuration file could not be found; please add env.json to the config directory.')
-
-logging.basicConfig(level=ENV.get('LOG_LEVEL', 'DEBUG'),
-                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-
 ACCOUNT_ID = ENV.get('CANVAS_ACCOUNT_ID', 1)
 TERM_ID = ENV['CANVAS_TERM_ID']
 
@@ -44,6 +36,7 @@
 
 CREATE_CSVS = ENV.get('CREATE_CSVS', False)
 INVENTORY_DB = ENV['INVENTORY_DB']
+APPEND_TABLE_NAMES = ENV.get('APPEND_TABLE_NAMES', ['job_run', 'data_source_status'])
 
 
 # Function(s)
@@ -138,8 +131,7 @@ def gather_course_data_from_api(account_id: int, term_id: int) -> pd.DataFrame:
     return course_df
 
 
-def pull_sis_user_data_from_udw(user_ids: Sequence[int]) -> pd.DataFrame:
-    udw_conn = psycopg2.connect(**ENV['UDW'])
+def pull_sis_user_data_from_udw(user_ids: Sequence[int], conn: connection) -> pd.DataFrame:
     users_string = ','.join([str(user_id) for user_id in user_ids])
     user_query = f'''
         SELECT u.canvas_id AS canvas_id,
@@ -150,13 +142,12 @@ def pull_sis_user_data_from_udw(user_ids: Sequence[int]) -> pd.DataFrame:
             ON u.id=p.user_id
         WHERE u.canvas_id in ({users_string});
     '''
-    logger.info('Making user_dim query')
-    udw_user_df = pd.read_sql(user_query, udw_conn)
+    logger.info('Making user_dim and pseudonym_dim query against UDW')
+    udw_user_df = pd.read_sql(user_query, conn)
     udw_user_df['sis_id'] = udw_user_df['sis_id'].map(process_sis_id, na_action='ignore')
     # Found that the IDs are not necessarily unique, so dropping duplicates
     udw_user_df = udw_user_df.drop_duplicates(subset=['canvas_id'])
     logger.debug(udw_user_df.head())
-    udw_conn.close()
     return udw_user_df
 
 
@@ -169,9 +160,9 @@ def process_sis_id(id: str) -> Union[int, None]:
     return sis_id
 
 
-def run_course_inventory() -> None:
+def run_course_inventory() -> Sequence[Dict[str, Union[str, pd.Timestamp]]]:
     logger.info("* run_course_inventory")
-    start = time.time()
+    logger.info('Making requests against the Canvas API')
 
     # Gather course data
     course_df = gather_course_data_from_api(ACCOUNT_ID, TERM_ID)
@@ -214,11 +205,30 @@ def run_course_inventory() -> None:
     enroll_delta = time.time() - enroll_start
     logger.info(f'Duration of process (seconds): {enroll_delta}')
 
+    # Record data source info for Canvas API
+    canvas_data_source = {
+        'data_source_name': 'CANVAS_API',
+        'data_updated_at': pd.to_datetime(time.time(), unit='s', utc=True)
+    }
+
+    udw_conn = psycopg2.connect(**ENV['UDW'])
+
     # Pull SIS user data from Unizin Data Warehouse
     udw_user_ids = user_df['canvas_id'].to_list()
-    sis_user_df = pull_sis_user_data_from_udw(udw_user_ids)
+    sis_user_df = pull_sis_user_data_from_udw(udw_user_ids, udw_conn)
     user_df = pd.merge(user_df, sis_user_df, on='canvas_id', how='left')
 
+    # Record data source info for UDW
+    udw_meta_df = pd.read_sql('SELECT * FROM unizin_metadata;', udw_conn)
+    udw_update_datetime_str = udw_meta_df.iloc[1, 1]
+    udw_update_datetime = pd.to_datetime(udw_update_datetime_str, format='%Y-%m-%d %H:%M:%S.%f%z')
+    logger.info(f'Found canvasdatadate in UDW of {udw_update_datetime}')
+
+    udw_data_source = {
+        'data_source_name': 'UNIZIN_DATA_WAREHOUSE',
+        'data_updated_at': udw_update_datetime
+    }
+
     # Produce output
     num_course_records = len(course_df)
     num_user_records = len(user_df)
@@ -250,11 +260,9 @@ def run_course_inventory() -> None:
 
     # Empty tables (if any) in database, then migrate
     logger.info('Emptying tables in DB')
-    db_creator_obj = DBCreator(INVENTORY_DB)
+    db_creator_obj = DBCreator(INVENTORY_DB, APPEND_TABLE_NAMES)
     db_creator_obj.set_up()
-    if len(db_creator_obj.get_table_names()) > 0:
-        db_creator_obj.drop_records()
-    db_creator_obj.migrate()
+    db_creator_obj.drop_records()
     db_creator_obj.tear_down()
 
     # Insert gathered data
@@ -278,10 +286,11 @@ def run_course_inventory() -> None:
     canvas_course_usage_df.to_sql('canvas_course_usage', db_creator_obj.engine, if_exists='append', index=False)
     logger.info(f'Inserted data into canvas_course_usage table in {db_creator_obj.db_name}')
 
-    delta = time.time() - start
-    str_time = time.strftime("%H:%M:%S", time.gmtime(delta))
-    logger.info(f'Duration of run: {str_time}')
+    return [canvas_data_source, udw_data_source]
+
 
+# Main Program
 
 if __name__ == "__main__":
+    logging.basicConfig(level=ENV.get('LOG_LEVEL', 'DEBUG'))
     run_course_inventory()
diff --git a/canvas/published_date.py → course_inventory/published_date.py b/canvas/published_date.py → course_inventory/published_date.py
diff --git a/create_db.py b/create_db.py
@@ -1,30 +1,21 @@
 # standard libraries
-import json, logging, os
-
-# third-party libraries
-from sqlalchemy import create_engine
+import logging
 
 # local libraries
 from db.db_creator import DBCreator
-
+from environ import ENV
 
 # Initializing settings and global variables
 
 logger = logging.getLogger(__name__)
 
-try:
-    config_path = os.getenv("ENV_PATH", os.path.join('config', 'secrets', 'env.json'))
-    with open(config_path) as env_file:
-        ENV = json.loads(env_file.read())
-except FileNotFoundError:
-    logger.error('Configuration file could not be found; please add env.json to the config directory.')
-
 DB_PARAMS = ENV['INVENTORY_DB']
+APPEND_TABLE_NAMES = ENV.get('APPEND_TABLE_NAMES', ['job_run'])
 
 
 # Main Program
 
 if __name__ == '__main__':
     logging.basicConfig(level=ENV.get('LOG_LEVEL', 'DEBUG'))
-    db_creator_obj = DBCreator(DB_PARAMS)
+    db_creator_obj = DBCreator(DB_PARAMS, APPEND_TABLE_NAMES)
     db_creator_obj.set_up_database()
diff --git a/db/db_creator.py b/db/db_creator.py
@@ -15,7 +15,12 @@
 
 class DBCreator:
 
-    def __init__(self, db_params: Dict[str, str]) -> None:
+    def __init__(
+        self,
+        db_params: Dict[str, str],
+        append_table_names: Sequence[str] = []
+    ) -> None:
+
         self.db_name = db_params['dbname']
         self.conn = None
         self.conn_str = (
@@ -27,6 +32,7 @@ def __init__(self, db_params: Dict[str, str]) -> None:
             f"/{db_params['dbname']}?charset=utf8"
         )
         self.engine = create_engine(self.conn_str)
+        self.append_table_names = append_table_names
 
     def set_up(self) -> None:
         logger.debug('set_up')
@@ -51,7 +57,7 @@ def drop_records(self) -> None:
         logger.debug('drop_records')
         self.conn.execute('SET FOREIGN_KEY_CHECKS=0;')
         for table_name in self.get_table_names():
-            if 'yoyo' not in table_name:
+            if 'yoyo' not in table_name and table_name not in self.append_table_names:
                 logger.debug(f'Table Name: {table_name}')
                 self.conn.execute(f'DELETE FROM {table_name};')
                 logger.info(f'Dropped records in {table_name} in {self.db_name}')

diff --git a/db/migrations/0009.add_meta_tables.py b/db/migrations/0009.add_meta_tables.py
@@ -0,0 +1,33 @@
+#
+# file: migrations/0009.add_meta_tables.py
+#
+from yoyo import step
+
+__depends__ = {'0008.canvas_usage_table'}
+
+step('''
+    CREATE TABLE IF NOT EXISTS job_run
+    (
+        id INTEGER NOT NULL UNIQUE AUTO_INCREMENT,
+        job_name VARCHAR(50) NOT NULL,
+        started_at DATETIME NOT NULL,
+        finished_at DATETIME NOT NULL,
+        PRIMARY KEY (id)
+    )
+    ENGINE=InnoDB
+    CHARACTER SET utf8mb4;
+''')
+
+step('''
+    CREATE TABLE IF NOT EXISTS data_source_status
+    (
+        id INTEGER NOT NULL UNIQUE AUTO_INCREMENT,
+        data_source_name VARCHAR(50) NOT NULL,
+        data_updated_at DATETIME NOT NULL,
+        job_run_id INTEGER NOT NULL,
+        PRIMARY KEY (id),
+        FOREIGN KEY (job_run_id) REFERENCES job_run(id) ON DELETE CASCADE ON UPDATE CASCADE
+    )
+    ENGINE=InnoDB
+    CHARACTER SET utf8mb4;
+''')
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -23,6 +23,8 @@ services:
         dockerfile: Dockerfile
     depends_on:
       - mysql
+    environment:
+      - HOW_STARTED=DOCKER_COMPOSE
     volumes:
       - ${HOME}/secrets/course-inventory:/app/config/secrets
       - ${HOME}/data/course-inventory:/app/data