Merge pull request #43 from ECRL/dev

ML back-end, workflows, database additions/encoding, and more
ecrl · Jan 6, 2020 · 5a989e2 · 5a989e2
2 parents ca4d76d + 7fcfcee
commit 5a989e2
Show file tree

Hide file tree

Showing 23 changed files with 1,809 additions and 279 deletions.
diff --git a/README.md b/README.md
@@ -2,22 +2,22 @@
 
 # ECNet: scalable, retrainable and deployable machine learning projects for fuel property prediction
 
-[![GitHub version](https://badge.fury.io/gh/tjkessler%2FECNet.svg)](https://badge.fury.io/gh/tjkessler%2FECNet)
+[![GitHub version](https://badge.fury.io/gh/ecrl%2FECNet.svg)](https://badge.fury.io/gh/ecrl%2FECNet)
 [![PyPI version](https://badge.fury.io/py/ecnet.svg)](https://badge.fury.io/py/ecnet)
 [![status](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f/status.svg)](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f)
-[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/TJKessler/ECNet/master/LICENSE.txt)
+[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/ECRL/ECNet/master/LICENSE.txt)
 [![Documentation Status](https://readthedocs.org/projects/ecnet/badge/?version=latest)](https://ecnet.readthedocs.io/en/latest/?badge=latest)
 [![Build Status](https://dev.azure.com/uml-ecrl/package-management/_apis/build/status/ECRL.ECNet?branchName=master)](https://dev.azure.com/uml-ecrl/package-management/_build/latest?definitionId=1&branchName=master)
 
-**ECNet** is an open source Python package for creating scalable, retrainable and deployable machine learning projects with a focus on fuel property prediction. An ECNet __project__ is considered a collection of __pools__, where each pool contains a neural network that has been selected from a group of __candidate__ neural networks. Candidates are selected to represent pools based on their ability to optimize certain learning criteria (for example, performing optimially on unseen data). Each pool contributes a prediction derived from input data, and these predictions are averaged to calculate the project's final prediction. Using multiple pools allows a project to learn from a variety of learning and validation sets, which can reduce the project's prediction error. Projects can be saved and reused at a later time allowing additional training and deployable predictive models. 
+**ECNet** is an open source Python package for creating scalable, retrainable and deployable machine learning projects with a focus on fuel property prediction. An ECNet __project__ is considered a collection of __pools__, where each pool contains a neural network that has been selected from a group of __candidate__ neural networks. Candidates are selected to represent pools based on their ability to optimize certain learning criteria (for example, performing optimially on unseen data). Each pool contributes a prediction derived from input data, and these predictions are averaged to calculate the project's final prediction. Using multiple pools allows a project to learn from a variety of learning and validation sets, which can reduce the project's prediction error. Projects can be saved and reused at a later time allowing additional training and deployable predictive models.
 
 [T. Sennott et al.](https://doi.org/10.1115/ICEF2013-19185) have shown that neural networks can be applied to cetane number prediction with relatively little error. ECNet provides scientists an open source tool for predicting key fuel properties of potential next-generation biofuels, reducing the need for costly fuel synthesis and experimentation.
 
 <p align="center">
   <img align="center" src="docs/img/workflow_diagram.png" width="50%" height="50%">
 </p>
 
-Using ECNet, [T. Kessler et al.](https://doi.org/10.1016/j.fuel.2017.06.015) have increased the generalizability of neural networks to predict the cetane number for a variety of molecular classes represented in our [cetane number database](https://github.com/TJKessler/ECNet/tree/master/databases), and have increased the accuracy of neural networks for predicting the cetane number of underrepresented molecular classes through targeted database expansion.
+Using ECNet, [T. Kessler et al.](https://doi.org/10.1016/j.fuel.2017.06.015) have increased the generalizability of neural networks to predict the cetane number for a variety of molecular classes represented in our [cetane number database](https://github.com/ECRL/ECNet/tree/master/databases), and have increased the accuracy of neural networks for predicting the cetane number of underrepresented molecular classes through targeted database expansion.
 
 Future plans for ECNet include:
 - distributed candidate training for GPU's
@@ -34,4 +34,4 @@ To contribute to ECNet, make a pull request. Contributions should include tests
 
 To report problems with the software or feature requests, file an issue. When reporting problems, include information such as error messages, your OS/environment and Python version.
 
-For additional support/questions, contact Travis Kessler ([email protected]), Hernan Gelaf-Romer (hernan_gelafromer@student.uml.edu) and/or John Hunter Mack ([email protected]).
+For additional support/questions, contact Travis Kessler (Travis_Kessler@student.uml.edu) and/or John Hunter Mack ([email protected]).
diff --git a/databases/ysi_database_v2.0.csv b/databases/ysi_database_v2.0.csv
diff --git a/databases/ysi_database_v2.1.csv b/databases/ysi_database_v2.1.csv
diff --git a/ecnet/__init__.py b/ecnet/__init__.py
@@ -1,2 +1,2 @@
 from ecnet.server import Server
-__version__ = '3.2.3'
+__version__ = '3.3.0'
diff --git a/ecnet/models/mlp.py b/ecnet/models/mlp.py
@@ -2,184 +2,215 @@
 # -*- coding: utf-8 -*-
 #
 # ecnet/models/mlp.py
-# v.3.2.3
-# Developed in 2019 by Travis Kessler <[email protected]>
+# v.3.3.0
+# Developed in 2020 by Travis Kessler <[email protected]>
 #
 # Contains the "MultilayerPerceptron" (feed-forward neural network) class
 #
 
 # Stdlib imports
+from os import environ
 from re import compile, IGNORECASE
-from os import devnull, environ
-import sys
 
 # 3rd party imports
-from tensorflow import get_default_graph, logging
-from numpy import array
-stderr = sys.stderr
-sys.stderr = open(devnull, 'w')
-from keras.backend import clear_session, reset_uids
-from keras.callbacks import EarlyStopping
-from keras.layers import Dense
-from keras.losses import mean_squared_error
-from keras.metrics import mae
-from keras.models import load_model, Sequential
-from keras.optimizers import Adam
-sys.stderr = stderr
+from h5py import File
+from numpy import array, string_, zeros
+from tensorflow import config, Tensor
+from tensorflow.keras.callbacks import EarlyStopping
+from tensorflow.keras.layers import Dense
+from tensorflow.keras.losses import MeanSquaredError
+from tensorflow.keras.models import Model
+from tensorflow.keras.optimizers import Adam
 
 # ECNet imports
 from ecnet.utils.logging import logger
 
 environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
-logging.set_verbosity(logging.ERROR)
-
+config.experimental_run_functions_eagerly(True)
 H5_EXT = compile(r'.*\.h5', flags=IGNORECASE)
 
 
-class MultilayerPerceptron:
+def check_h5(filename: str):
+    ''' Ensures a given filename has an `.h5` extension
+
+    Args:
+        filename (str): filename to check
+    '''
+
+    if H5_EXT.match(filename) is None:
+        raise ValueError(
+            'Invalid filename/extension, must be `.h5`: {}'.format(
+                filename
+            )
+        )
+
+
+class MultilayerPerceptron(Model):
 
-    def __init__(self, filename: str='model.h5'):
-        '''MultilayerPerceptron object: fits neural network to supplied inputs
-        and targets
+    def __init__(self, filename: str = 'model.h5'):
+        ''' MultilayerPerceptron: Feed-forward neural network; variable number
+        of layers, variable size/activations of layers; handles training,
+        saving/loading of models
 
         Args:
-            filename (str): path to model save location (.h5 extension)
+            filename (str): filename/path for the model (default: `model.h5`)
         '''
 
-        if H5_EXT.match(filename) is None:
-            raise ValueError(
-                'Invalid filename/extension, must be `.h5`: {}'.format(
-                    filename
-                )
-            )
+        super(MultilayerPerceptron, self).__init__()
+        check_h5(filename)
         self._filename = filename
-        clear_session()
-        self._model = Sequential(name=filename.lower().replace('.h5', ''))
+        self._layers = []
 
     def add_layer(self, num_neurons: int, activation: str,
-                  input_dim: int=None):
-        '''Adds a fully-connected layer to the model
+                  input_dim: int = None):
+        ''' add_layer: adds a layer to the MLP; layers are added sequentially;
+        first layer must have input dimensionality specified
 
         Args:
-            num_neurons (int): number of neurons for the layer
-            activation (str): activation function for the layer (see Keras
-                activation function documentation)
-            input_dim (int): if not None (input layer), specifies input
-                dimensionality
+            num_neurons (int): number of neurons in the layer
+            activation (str): activation function used by the layer; refer to
+               https://www.tensorflow.org/api_docs/python/tf/keras/activations
+               for usable activation functions
+            input_dim (int): specify input dimensionality for the first layer;
+               all other layers depend on previous layers' size (number of
+               neurons), should be kept as `None` (default value)
         '''
 
-        self._model.add(Dense(
+        if len(self._layers) == 0:
+            if input_dim is None:
+                raise ValueError('First layer must have input_dim specified')
+
+        self._layers.append(Dense(
             units=num_neurons,
             activation=activation,
             input_shape=(input_dim,)
         ))
 
-    def fit(self, l_x: array, l_y: array, v_x: array=None, v_y: array=None,
-            epochs: int=1500, lr: float=0.001, beta_1: float=0.9,
-            beta_2: float=0.999, epsilon: float=0.0000001, decay: float=0.0,
-            v: int=0, batch_size: int=32):
-        '''Fits neural network to supplied inputs and targets
+    def call(self, x: Tensor) -> Tensor:
+        ''' call: used by Model.fit (parent) to perform feed-forward operations
 
         Args:
-            l_x (numpy.array): learning input data
-            l_y (numpy.array): learning target data
-            v_x (numpy.array): if not None, periodic validation is performed w/
-                these inputs
-            v_y (numpy.array): if not None, periodic validation is performed w/
-                these targets
-            epochs (int): number of learning epochs if not validating, maximum
-                number of learning epochs if performing periodic validation
-            lr (float): learning rate for Adam optimizer
-            beta_1 (float): beta_1 value for Adam optimizer
-            beta_2 (float): beta_2 value for Adam optimizer
-            epsilon (float): epsilon value for Adam optimizer
-            decay (float): learning rate decay for Adam optimizer
-            v (int): verbose training, `0` for no printing, `1` for printing
-            batch_size (int): number of learning samples per batch
+            x (tf.Tensor): data fed into first layer
+
+        Returns:
+            tf.Tensor: data resulting from last layer
         '''
 
-        self._model.compile(
-            loss=mean_squared_error,
-            optimizer=Adam(
-                lr=lr,
-                beta_1=beta_1,
-                beta_2=beta_2,
-                epsilon=epsilon,
-                decay=decay
-            ),
-            metrics=[mae]
-        )
+        for layer in self._layers:
+            x = layer(x)
+        return x
+
+    def fit(self, l_x: array, l_y: array, v_x: array = None, v_y: array = None,
+            epochs: int = 1500, lr: float = 0.001, beta_1: float = 0.9,
+            beta_2: float = 0.999, epsilon: float = 0.0000001,
+            decay: float = 0.0, v: int = 0, batch_size: int = 32,
+            patience: int = 128) -> tuple:
+        ''' fit: trains model using supplied data; may supply additional data
+        to use as validation set (determines learning cutoff); hyperparameters
+        for Adam optimization function, batch size, patience (if validating)
+        may be specified
+
+        Args:
+            l_x (np.array): learning input data; each sub-iterable is a sample
+            l_y (np.array): learning target data; each sub-iterable is a sample
+            v_x (np.array): validation input data (`None` for no validation)
+            v_y (np.array): validation target data (`None` for no validation)
+            epochs (int): number of training iterations, max iterations if
+                performing validation
+            lr (float): learning rate of Adam optimization fn
+            beta_1 (float): first moment estimate of Adam optimization fn
+            beta_2 (float): second moment estimate of Adam optimization fn
+            epsilon (float): number to prevent division by zero in Adam fn
+            decay (float): decay of learning date in Adam optimization fn
+            v (int): whether Model.fit (parent) is verbose (1 True, 0 False)
+            batch_size (int): size of each training batch
+            patience (int): maximum number of epochs to wait before better
+                validation loss found; if not found, training terminates, best
+                weights restored
+
+        Returns:
+            tuple: (list: learn losses, list: valid losses); each list equal
+                length; each list element represents loss at corresponding
+                epoch
+        '''
+
+        self.compile(optimizer=Adam(lr=lr, beta_1=beta_1, beta_2=beta_2,
+                                    epsilon=epsilon,
+                                    decay=decay),
+                     loss=MeanSquaredError())
 
         if v_x is not None and v_y is not None:
-            history = self._model.fit(
-                l_x,
-                l_y,
-                validation_data=(v_x, v_y),
-                callbacks=[EarlyStopping(
-                    monitor='val_loss',
-                    patience=250,
-                    verbose=v,
-                    mode='min',
-                    restore_best_weights=True
-                )],
-                epochs=epochs,
-                verbose=v,
-                batch_size=batch_size
-            )
-            epochs = len(history.history['loss'])
+
+            callback = EarlyStopping(monitor='val_loss', patience=patience,
+                                     restore_best_weights=True)
+            history = super().fit(l_x, l_y, batch_size=batch_size,
+                                  epochs=epochs, verbose=v,
+                                  callbacks=[callback],
+                                  validation_data=(v_x, v_y))
+            return (history.history['loss'], history.history['val_loss'])
+
         else:
-            self._model.fit(
-                l_x,
-                l_y,
-                epochs=epochs,
-                verbose=v,
-                batch_size=batch_size
-            )
 
-        logger.log('debug', 'Training complete after {} epochs'.format(epochs),
-                   call_loc='MLP')
+            history = super().fit(l_x, l_y, batch_size=batch_size,
+                                  epochs=epochs, verbose=v)
+            return (history.history['loss'], [None for _ in range(epochs)])
 
     def use(self, x: array) -> array:
-        '''Uses neural network to predict values for supplied data
+        ''' use: uses the model to predict values for supplied data
 
         Args:
-            x (numpy.array): input data to predict for
+            x (np.array): input data to predict for
 
-        Returns
-            numpy.array: predictions
+        Returns:
+            np.array: predicted values
         '''
 
-        with get_default_graph().as_default():
-            return self._model.predict(x)
+        return self.predict(x)
 
-    def save(self, filename: str=None):
-        '''Saves neural network to .h5 file
+    def save(self, filename: str = None):
+        ''' save: saves the model weights, architecture to either the filename/
+        path specified when object was created, or new, supplied filename/path
 
-        filename (str): if None, uses MultilayerPerceptron._filename;
-            otherwise, saves to this file
+        Args:
+            filename (str): new filepath if different than init filename/path
         '''
 
         if filename is None:
             filename = self._filename
-        if H5_EXT.match(filename) is None:
-            raise ValueError(
-                'Invalid filename/extension, must be `.h5`: {}'.format(
-                    filename
-                )
-            )
-        self._model.save(filename)
+        check_h5(filename)
+        self.save_weights(filename, save_format='h5')
+        input_size = self.layers[0].get_config()['batch_input_shape'][1]
+        layer_sizes = [l.get_config()['units'] for l in self.layers]
+        layer_activ = [l.get_config()['activation'] for l in self.layers]
+        with File(filename, 'a') as hf:
+            hf['mlp_input_size'] = input_size
+            hf['mlp_layer_sizes'] = layer_sizes
+            hf['mlp_layer_activ'] = string_(layer_activ)
+        hf.close()
         logger.log('debug', 'Model saved to {}'.format(filename),
                    call_loc='MLP')
 
-    def load(self, filename: str=None):
-        '''Loads neural network from .h5 file
+    def load(self, filename: str = None):
+        ''' load: loads a saved model, restoring the architecture/weights;
+        loads from filename/path specified during object initialization,
+        unless new filename/path specified
 
         Args:
-            filename (str): path to .h5 model file
+            filename (str): new filepath if different than init filename/path
         '''
 
         if filename is None:
             filename = self._filename
-        self._model = load_model(filename)
+        with File(filename, 'r') as hf:
+            input_size = hf.get('mlp_input_size').value
+            layer_sizes = hf.get('mlp_layer_sizes').value
+            layer_activ = hf.get('mlp_layer_activ').value
+        hf.close()
+        self.add_layer(layer_sizes[0], layer_activ[0].decode('ascii'),
+                       input_size)
+        for idx, layer in enumerate(layer_sizes[1:]):
+            self.add_layer(layer, layer_activ[idx].decode('ascii'))
+        self.build(input_shape=(None, input_size))
+        self.load_weights(filename)
         logger.log('debug', 'Model loaded from {}'.format(filename),
                    call_loc='MLP')