Skip to content

Commit

Permalink
Merge pull request #43 from ECRL/dev
Browse files Browse the repository at this point in the history
ML back-end, workflows, database additions/encoding, and more
  • Loading branch information
tjkessler authored Jan 6, 2020
2 parents ca4d76d + 7fcfcee commit 5a989e2
Show file tree
Hide file tree
Showing 23 changed files with 1,809 additions and 279 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,22 @@

# ECNet: scalable, retrainable and deployable machine learning projects for fuel property prediction

[![GitHub version](https://badge.fury.io/gh/tjkessler%2FECNet.svg)](https://badge.fury.io/gh/tjkessler%2FECNet)
[![GitHub version](https://badge.fury.io/gh/ecrl%2FECNet.svg)](https://badge.fury.io/gh/ecrl%2FECNet)
[![PyPI version](https://badge.fury.io/py/ecnet.svg)](https://badge.fury.io/py/ecnet)
[![status](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f/status.svg)](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f)
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/TJKessler/ECNet/master/LICENSE.txt)
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/ECRL/ECNet/master/LICENSE.txt)
[![Documentation Status](https://readthedocs.org/projects/ecnet/badge/?version=latest)](https://ecnet.readthedocs.io/en/latest/?badge=latest)
[![Build Status](https://dev.azure.com/uml-ecrl/package-management/_apis/build/status/ECRL.ECNet?branchName=master)](https://dev.azure.com/uml-ecrl/package-management/_build/latest?definitionId=1&branchName=master)

**ECNet** is an open source Python package for creating scalable, retrainable and deployable machine learning projects with a focus on fuel property prediction. An ECNet __project__ is considered a collection of __pools__, where each pool contains a neural network that has been selected from a group of __candidate__ neural networks. Candidates are selected to represent pools based on their ability to optimize certain learning criteria (for example, performing optimially on unseen data). Each pool contributes a prediction derived from input data, and these predictions are averaged to calculate the project's final prediction. Using multiple pools allows a project to learn from a variety of learning and validation sets, which can reduce the project's prediction error. Projects can be saved and reused at a later time allowing additional training and deployable predictive models.
**ECNet** is an open source Python package for creating scalable, retrainable and deployable machine learning projects with a focus on fuel property prediction. An ECNet __project__ is considered a collection of __pools__, where each pool contains a neural network that has been selected from a group of __candidate__ neural networks. Candidates are selected to represent pools based on their ability to optimize certain learning criteria (for example, performing optimially on unseen data). Each pool contributes a prediction derived from input data, and these predictions are averaged to calculate the project's final prediction. Using multiple pools allows a project to learn from a variety of learning and validation sets, which can reduce the project's prediction error. Projects can be saved and reused at a later time allowing additional training and deployable predictive models.

[T. Sennott et al.](https://doi.org/10.1115/ICEF2013-19185) have shown that neural networks can be applied to cetane number prediction with relatively little error. ECNet provides scientists an open source tool for predicting key fuel properties of potential next-generation biofuels, reducing the need for costly fuel synthesis and experimentation.

<p align="center">
<img align="center" src="docs/img/workflow_diagram.png" width="50%" height="50%">
</p>

Using ECNet, [T. Kessler et al.](https://doi.org/10.1016/j.fuel.2017.06.015) have increased the generalizability of neural networks to predict the cetane number for a variety of molecular classes represented in our [cetane number database](https://github.com/TJKessler/ECNet/tree/master/databases), and have increased the accuracy of neural networks for predicting the cetane number of underrepresented molecular classes through targeted database expansion.
Using ECNet, [T. Kessler et al.](https://doi.org/10.1016/j.fuel.2017.06.015) have increased the generalizability of neural networks to predict the cetane number for a variety of molecular classes represented in our [cetane number database](https://github.com/ECRL/ECNet/tree/master/databases), and have increased the accuracy of neural networks for predicting the cetane number of underrepresented molecular classes through targeted database expansion.

Future plans for ECNet include:
- distributed candidate training for GPU's
Expand All @@ -34,4 +34,4 @@ To contribute to ECNet, make a pull request. Contributions should include tests

To report problems with the software or feature requests, file an issue. When reporting problems, include information such as error messages, your OS/environment and Python version.

For additional support/questions, contact Travis Kessler ([email protected]), Hernan Gelaf-Romer (hernan_gelafromer@student.uml.edu) and/or John Hunter Mack ([email protected]).
For additional support/questions, contact Travis Kessler (Travis_Kessler@student.uml.edu) and/or John Hunter Mack ([email protected]).
568 changes: 568 additions & 0 deletions databases/ysi_database_v2.0.csv

Large diffs are not rendered by default.

568 changes: 568 additions & 0 deletions databases/ysi_database_v2.1.csv

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion ecnet/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
from ecnet.server import Server
__version__ = '3.2.3'
__version__ = '3.3.0'
263 changes: 147 additions & 116 deletions ecnet/models/mlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,184 +2,215 @@
# -*- coding: utf-8 -*-
#
# ecnet/models/mlp.py
# v.3.2.3
# Developed in 2019 by Travis Kessler <[email protected]>
# v.3.3.0
# Developed in 2020 by Travis Kessler <[email protected]>
#
# Contains the "MultilayerPerceptron" (feed-forward neural network) class
#

# Stdlib imports
from os import environ
from re import compile, IGNORECASE
from os import devnull, environ
import sys

# 3rd party imports
from tensorflow import get_default_graph, logging
from numpy import array
stderr = sys.stderr
sys.stderr = open(devnull, 'w')
from keras.backend import clear_session, reset_uids
from keras.callbacks import EarlyStopping
from keras.layers import Dense
from keras.losses import mean_squared_error
from keras.metrics import mae
from keras.models import load_model, Sequential
from keras.optimizers import Adam
sys.stderr = stderr
from h5py import File
from numpy import array, string_, zeros
from tensorflow import config, Tensor
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# ECNet imports
from ecnet.utils.logging import logger

environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
logging.set_verbosity(logging.ERROR)

config.experimental_run_functions_eagerly(True)
H5_EXT = compile(r'.*\.h5', flags=IGNORECASE)


class MultilayerPerceptron:
def check_h5(filename: str):
''' Ensures a given filename has an `.h5` extension
Args:
filename (str): filename to check
'''

if H5_EXT.match(filename) is None:
raise ValueError(
'Invalid filename/extension, must be `.h5`: {}'.format(
filename
)
)


class MultilayerPerceptron(Model):

def __init__(self, filename: str='model.h5'):
'''MultilayerPerceptron object: fits neural network to supplied inputs
and targets
def __init__(self, filename: str = 'model.h5'):
''' MultilayerPerceptron: Feed-forward neural network; variable number
of layers, variable size/activations of layers; handles training,
saving/loading of models
Args:
filename (str): path to model save location (.h5 extension)
filename (str): filename/path for the model (default: `model.h5`)
'''

if H5_EXT.match(filename) is None:
raise ValueError(
'Invalid filename/extension, must be `.h5`: {}'.format(
filename
)
)
super(MultilayerPerceptron, self).__init__()
check_h5(filename)
self._filename = filename
clear_session()
self._model = Sequential(name=filename.lower().replace('.h5', ''))
self._layers = []

def add_layer(self, num_neurons: int, activation: str,
input_dim: int=None):
'''Adds a fully-connected layer to the model
input_dim: int = None):
''' add_layer: adds a layer to the MLP; layers are added sequentially;
first layer must have input dimensionality specified
Args:
num_neurons (int): number of neurons for the layer
activation (str): activation function for the layer (see Keras
activation function documentation)
input_dim (int): if not None (input layer), specifies input
dimensionality
num_neurons (int): number of neurons in the layer
activation (str): activation function used by the layer; refer to
https://www.tensorflow.org/api_docs/python/tf/keras/activations
for usable activation functions
input_dim (int): specify input dimensionality for the first layer;
all other layers depend on previous layers' size (number of
neurons), should be kept as `None` (default value)
'''

self._model.add(Dense(
if len(self._layers) == 0:
if input_dim is None:
raise ValueError('First layer must have input_dim specified')

self._layers.append(Dense(
units=num_neurons,
activation=activation,
input_shape=(input_dim,)
))

def fit(self, l_x: array, l_y: array, v_x: array=None, v_y: array=None,
epochs: int=1500, lr: float=0.001, beta_1: float=0.9,
beta_2: float=0.999, epsilon: float=0.0000001, decay: float=0.0,
v: int=0, batch_size: int=32):
'''Fits neural network to supplied inputs and targets
def call(self, x: Tensor) -> Tensor:
''' call: used by Model.fit (parent) to perform feed-forward operations
Args:
l_x (numpy.array): learning input data
l_y (numpy.array): learning target data
v_x (numpy.array): if not None, periodic validation is performed w/
these inputs
v_y (numpy.array): if not None, periodic validation is performed w/
these targets
epochs (int): number of learning epochs if not validating, maximum
number of learning epochs if performing periodic validation
lr (float): learning rate for Adam optimizer
beta_1 (float): beta_1 value for Adam optimizer
beta_2 (float): beta_2 value for Adam optimizer
epsilon (float): epsilon value for Adam optimizer
decay (float): learning rate decay for Adam optimizer
v (int): verbose training, `0` for no printing, `1` for printing
batch_size (int): number of learning samples per batch
x (tf.Tensor): data fed into first layer
Returns:
tf.Tensor: data resulting from last layer
'''

self._model.compile(
loss=mean_squared_error,
optimizer=Adam(
lr=lr,
beta_1=beta_1,
beta_2=beta_2,
epsilon=epsilon,
decay=decay
),
metrics=[mae]
)
for layer in self._layers:
x = layer(x)
return x

def fit(self, l_x: array, l_y: array, v_x: array = None, v_y: array = None,
epochs: int = 1500, lr: float = 0.001, beta_1: float = 0.9,
beta_2: float = 0.999, epsilon: float = 0.0000001,
decay: float = 0.0, v: int = 0, batch_size: int = 32,
patience: int = 128) -> tuple:
''' fit: trains model using supplied data; may supply additional data
to use as validation set (determines learning cutoff); hyperparameters
for Adam optimization function, batch size, patience (if validating)
may be specified
Args:
l_x (np.array): learning input data; each sub-iterable is a sample
l_y (np.array): learning target data; each sub-iterable is a sample
v_x (np.array): validation input data (`None` for no validation)
v_y (np.array): validation target data (`None` for no validation)
epochs (int): number of training iterations, max iterations if
performing validation
lr (float): learning rate of Adam optimization fn
beta_1 (float): first moment estimate of Adam optimization fn
beta_2 (float): second moment estimate of Adam optimization fn
epsilon (float): number to prevent division by zero in Adam fn
decay (float): decay of learning date in Adam optimization fn
v (int): whether Model.fit (parent) is verbose (1 True, 0 False)
batch_size (int): size of each training batch
patience (int): maximum number of epochs to wait before better
validation loss found; if not found, training terminates, best
weights restored
Returns:
tuple: (list: learn losses, list: valid losses); each list equal
length; each list element represents loss at corresponding
epoch
'''

self.compile(optimizer=Adam(lr=lr, beta_1=beta_1, beta_2=beta_2,
epsilon=epsilon,
decay=decay),
loss=MeanSquaredError())

if v_x is not None and v_y is not None:
history = self._model.fit(
l_x,
l_y,
validation_data=(v_x, v_y),
callbacks=[EarlyStopping(
monitor='val_loss',
patience=250,
verbose=v,
mode='min',
restore_best_weights=True
)],
epochs=epochs,
verbose=v,
batch_size=batch_size
)
epochs = len(history.history['loss'])

callback = EarlyStopping(monitor='val_loss', patience=patience,
restore_best_weights=True)
history = super().fit(l_x, l_y, batch_size=batch_size,
epochs=epochs, verbose=v,
callbacks=[callback],
validation_data=(v_x, v_y))
return (history.history['loss'], history.history['val_loss'])

else:
self._model.fit(
l_x,
l_y,
epochs=epochs,
verbose=v,
batch_size=batch_size
)

logger.log('debug', 'Training complete after {} epochs'.format(epochs),
call_loc='MLP')
history = super().fit(l_x, l_y, batch_size=batch_size,
epochs=epochs, verbose=v)
return (history.history['loss'], [None for _ in range(epochs)])

def use(self, x: array) -> array:
'''Uses neural network to predict values for supplied data
''' use: uses the model to predict values for supplied data
Args:
x (numpy.array): input data to predict for
x (np.array): input data to predict for
Returns
numpy.array: predictions
Returns:
np.array: predicted values
'''

with get_default_graph().as_default():
return self._model.predict(x)
return self.predict(x)

def save(self, filename: str=None):
'''Saves neural network to .h5 file
def save(self, filename: str = None):
''' save: saves the model weights, architecture to either the filename/
path specified when object was created, or new, supplied filename/path
filename (str): if None, uses MultilayerPerceptron._filename;
otherwise, saves to this file
Args:
filename (str): new filepath if different than init filename/path
'''

if filename is None:
filename = self._filename
if H5_EXT.match(filename) is None:
raise ValueError(
'Invalid filename/extension, must be `.h5`: {}'.format(
filename
)
)
self._model.save(filename)
check_h5(filename)
self.save_weights(filename, save_format='h5')
input_size = self.layers[0].get_config()['batch_input_shape'][1]
layer_sizes = [l.get_config()['units'] for l in self.layers]
layer_activ = [l.get_config()['activation'] for l in self.layers]
with File(filename, 'a') as hf:
hf['mlp_input_size'] = input_size
hf['mlp_layer_sizes'] = layer_sizes
hf['mlp_layer_activ'] = string_(layer_activ)
hf.close()
logger.log('debug', 'Model saved to {}'.format(filename),
call_loc='MLP')

def load(self, filename: str=None):
'''Loads neural network from .h5 file
def load(self, filename: str = None):
''' load: loads a saved model, restoring the architecture/weights;
loads from filename/path specified during object initialization,
unless new filename/path specified
Args:
filename (str): path to .h5 model file
filename (str): new filepath if different than init filename/path
'''

if filename is None:
filename = self._filename
self._model = load_model(filename)
with File(filename, 'r') as hf:
input_size = hf.get('mlp_input_size').value
layer_sizes = hf.get('mlp_layer_sizes').value
layer_activ = hf.get('mlp_layer_activ').value
hf.close()
self.add_layer(layer_sizes[0], layer_activ[0].decode('ascii'),
input_size)
for idx, layer in enumerate(layer_sizes[1:]):
self.add_layer(layer, layer_activ[idx].decode('ascii'))
self.build(input_shape=(None, input_size))
self.load_weights(filename)
logger.log('debug', 'Model loaded from {}'.format(filename),
call_loc='MLP')
Loading

0 comments on commit 5a989e2

Please sign in to comment.