Skip to content

Commit

Permalink
Add Config class (#126)
Browse files Browse the repository at this point in the history
* Add thinc.config.Config class

* Update azure pipelines

* Fix Python2.7 support

* Add catalogue dependency

* Update test_config

* Import registry

* Use catalogue for Optimizer

* Fix load optimizer from config

* Add config.from_str method

* Add registry class

* Add config.md docs

* Use JSON values in config

* Remove currently not set initializers and wires registries

* Add SimpleEmbed class

* Register layers

* Register FeatureExtractor and LayerNorm layers

* Tweak config docs
  • Loading branch information
honnibal authored Dec 9, 2019
1 parent 3ce59db commit 2526a5a
Show file tree
Hide file tree
Showing 14 changed files with 599 additions and 31 deletions.
115 changes: 115 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Configuration files

You can describe Thinc models and experiments using a configuration format
based on Python's built-in `configparser` module. We add a few additional
conventions to the built-in format, to provide support for non-string values,
support nested objects, and to integrate with Thinc's *registry system*.

## Example config


```
[some_section]
key1 = "string value"
another_key = 1.0
# Comments, naturally
third_key = ["values", "are parsed with", "json.loads()"]
some_other_key =
{
"multiline values?": true
}
# Describe nested sections with a dot notation in the section names.
[some_section.subsection]
# This will be moved, producing:
# config["some_section"]["subsection"] = {"hi": true, "bye": false}
hi = true
bye = false
[another_section]
more_values = "yes!"
null_values = null
interpolation = ${some_section:third_key}
```

The config format has two main differences from the built-in `configparser`
module's behaviour:

* JSON-formatted values. Thinc passes all values through `json.loads()` to
interpret them. You can use atomic values like strings, floats, integers,
or booleans, or you can use complex objects such as lists or maps.

* Structured sections. Thinc uses a dot notation to build nested sections. If
you have a section named `[outer_section.subsection]`, Thinc will parse that
into a nested structure, placing `subsection` within `outer_section`

## Registry integration

Thinc's registry system lets you map string keys to functions. For instance,
let's say you want to define a new optimizer. You would define a function that
constructs it, and add it to the right register, like so:

```python

import thinc

@thinc.registry.optimizers.register("my_cool_optimizer.v1")
def make_my_optimizer(learn_rate, gamma):
return MyCoolOptimizer(learn_rate, gamma)

# Later you can retrieve your function by name:
create_optimizer = thinc.registry.optimizers.get("my_cool_optimizer.v1")
```

The registry lets you refer to your function by string name, which is
often more convenient than passing around the function itself. This is
especially useful for configuration files: you can provide the name of your
function and the arguments in the config file, and you'll have everything you
need to rebuild the object.

Since this is a common workflow, the registry system provides a shortcut for
it, the `registry.make_from_config()` function. To use it, you just need to
follow a simple convention in your config file.

If a section contains a key beginning with @, the `registry.make_from_config()`
function will interpret the rest of that key as the name of the registry. The
value will be interpreted as the name to lookup. The rest of the section will
be passed to your function as arguments. Here's a simple example:

```
[optimizer]
@optimizers = "my_cool_optimizer.v1"
learn_rate = 0.001
gamma = 1e-8
```

The `registry.make_from_config()` function will fetch your
`make_my_optimizer` function from the `optimizers` registry, call it using the
`learn_rate` and `gamma` arguments, and set the result of the function under
the key `"optimizer"`.

You can even use the `registry.make_from_config()` function to build recursive
structures. Let's say your optimizer supports some sort of fancy visualisation
plug-in that Thinc has never heard of. All you would need to do is create a new
registry, named something like `visualizers`, and register a constructor
function, such as `my_visualizer.v1`. You would also make a new version of your
optimizer constructor, to pass in the new value. Now you can describe the
visualizer plugin in your config, so you can use it as an argument to your optimizer:

```
[optimizer]
@optimizers = "my_cool_optimizer.v2"
learn_rate = 0.001
gamma = 1e-8
[optimizer.visualizer]
@visualizers = "my_visualizer.v1"
format = "jpeg"
host = "localhost"
port = "8080"
```

The `optimizer.visualizer` section will be placed under the
`optimizer` object, using the key `visualizer` (see "structured sections"
above). The `registry.make_from_config()` function will build the visualizer
first, so that the result value is ready for the optimizer.
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ preshed>=1.0.1,<3.1.0
blis>=0.4.0,<0.5.0
srsly>=0.0.6,<1.1.0
wasabi>=0.0.9,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third-party dependencies
numpy>=1.7.0
plac>=0.9.6,<1.2.0
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ def setup_package():
"blis>=0.4.0,<0.5.0",
"wasabi>=0.0.9,<1.1.0",
"srsly>=0.0.6,<1.1.0",
"catalogue>=0.0.7,<1.1.0",
# Third-party dependencies
"numpy>=1.7.0",
"plac>=0.9.6,<1.2.0",
Expand Down
1 change: 1 addition & 0 deletions thinc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
import numpy # noqa: F401

from .about import __name__, __version__ # noqa: F401
from ._registry import registry
71 changes: 71 additions & 0 deletions thinc/_registry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import catalogue


class registry(object):
optimizers = catalogue.create("thinc", "optimizers", entry_points=True)
schedules = catalogue.create("thinc", "schedules", entry_points=True)
layers = catalogue.create("thinc", "layers", entry_points=True)

@classmethod
def get(cls, name, key):
if not hasattr(cls, name):
raise ValueError("Unknown registry: %s" % name)
reg = getattr(cls, name)
func = reg.get(name)
if func is None:
raise ValueError("Could not find %s in %s" % (name, key))
return func

@classmethod
def make_optimizer(name, args, kwargs):
func = cls.optimizers.get(name)
return func(*args, **kwargs)

@classmethod
def make_schedule(name, args, kwargs):
func = cls.schedules.get(name)
return func(*args, **kwargs)

@classmethod
def make_initializer(name, args, kwargs):
func = cls.initializers.get(name)
return func(*args, **kwargs)

@classmethod
def make_layer(cls, name, args, kwargs):
func = cls.layers.get(name)
return func(*args, **kwargs)

@classmethod
def make_combinator(cls, name, args, kwargs):
func = cls.combinators.get(name)
return func(*args, **kwargs)

@classmethod
def make_transform(cls, name, args, kwargs):
func = cls.transforms.get(name)
return func(*args, **kwargs)

@classmethod
def make_from_config(cls, config, id_start="@"):
"""Unpack a config dictionary, creating objects from the registry
recursively.
"""
id_keys = [key for key in config.keys() if key.startswith(id_start)]
if len(id_keys) >= 2:
raise ValueError("Multiple registry keys in config: %s" % id_keys)
elif len(id_keys) == 0:
return config
else:
getter = cls.get(id_keys[0].replace(id_start, ""), config[id_keys[0]])
args = []
kwargs = {}
for key, value in config.items():
if isinstance(value, dict):
value = cls.make_from_config(value, id_start=id_start)
if isinstance(key, int) or key.isdigit():
args.append((int(key), value))
elif not key.startswith(id_start):
kwargs[key] = value
args = [value for key, value in sorted(args)]
return getter(*args, **kwargs)
39 changes: 39 additions & 0 deletions thinc/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from __future__ import unicode_literals

import configparser
import json
from pathlib import Path


class Config(dict):
def __init__(self, data=None):
dict.__init__(self)
if data is None:
data = {}
self.update(data)

def interpret_config(self, config):
for section, values in config.items():
parts = section.split(".")
node = self
for part in parts:
node = node.setdefault(part, {})
for key, value in values.items():
node[key] = json.loads(config.get(section, key))

def from_str(self, text):
config = configparser.ConfigParser(
interpolation=configparser.ExtendedInterpolation())
config.read_string(text)
for key in list(self.keys()):
self.pop(key)
self.interpret_config(config)
return self

def from_bytes(self, byte_string):
return self.from_str(byte_string.decode("utf8"))

def from_disk(self, path):
with Path(path).open("r", encoding="utf8") as file_:
text = file_.read()
return self.from_str(text)
23 changes: 22 additions & 1 deletion thinc/i2v.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,26 @@
from __future__ import unicode_literals

from .neural._classes.hash_embed import HashEmbed # noqa: F401
from .neural._classes.embed import Embed # noqa: F401
from .neural._classes.embed import Embed, SimpleEmbed # noqa: F401
from .neural._classes.static_vectors import StaticVectors # noqa: F401
from ._registry import registry


@registry.layers.register("HashEmbed.v1")
def make_HashEmbed(outputs, rows, column, seed=None):
return HashEmbed(outputs, rows, seed=seed, column=column)


@registry.layers.register("SimpleEmbed.v1")
def make_SimpleEmbed(outputs, rows, column):
return SimpleEmbed(outputs, rows, column)


@registry.layers.register("EmbedAndProject.v1")
def make_EmbedAndProject(outputs, rows, column):
return Embed(outputs, rows, column)


@registry.layers.register("StaticVectors.v1")
def make_StaticVectors(outputs, spacy_name, column, drop_factor=0.0):
return StaticVectors(nO=outputs, lang=spacy_name, column=column)
11 changes: 11 additions & 0 deletions thinc/misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,14 @@
from .neural._classes.feature_extracter import FeatureExtracter # noqa: F401
from .neural._classes.function_layer import FunctionLayer # noqa: F401
from .neural._classes.feed_forward import FeedForward # noqa: F401
from ._registry import registry


@registry.layers.register("FeatureExtractor.v1")
def make_FeatureExtractor(attrs):
return FeatureExtracter(attrs)


@registry.layers.register("LayerNorm.v1")
def make_LayerNorm(outputs=None, child=None):
return LayerNorm(nO=outputs, child=child)
65 changes: 57 additions & 8 deletions thinc/neural/_classes/embed.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,63 @@ def LSUVinit(model, X, y=None):
return X


@describe.attributes(
nM=Dimension("Vector dimensions"),
nV=Dimension("Number of vectors"),
vectors=Weights(
"Embedding table", lambda obj: (obj.nV, obj.nM), _uniform_init(-0.1, 0.1)
),
d_vectors=Gradient("vectors"),
)
class SimpleEmbed(Model):
name = "simple-embed"

def __init__(self, nO, nV=None, **kwargs):
Model.__init__(self, **kwargs)
self.column = kwargs.get("column", 0)
self.nO = nO
self.nV = nV

def predict(self, ids):
if ids.ndim == 2:
ids = ids[:, self.column]
ids = ids.copy()
ids[ids >= self.nV] = 0
return self.vectors[ids]

def begin_update(self, ids, drop=0.0):
if ids.ndim == 2:
ids = ids[:, self.column]
mask = self.ops.get_dropout_mask(ids.shape[0], drop)
if mask is not None:
ids = ids * (mask > 0)
ids[ids >= self.nV] = 0
vectors = self.vectors[ids]

def finish_update(gradients, sgd=None):
if hasattr(self.ops.xp, "scatter_add"):
self.ops.xp.scatter_add(self.d_vectors, ids, gradients)
else:
self.ops.xp.add.at(d_vectors, ids, gradients)
if sgd is not None:
sgd(self._mem.weights, self._mem.gradient, key=self.id)
return None

return vectors, finish_update

@contextlib.contextmanager
def use_params(self, params):
backup = None
weights = self._mem.weights
if self.id in params:
param = params[self.id]
backup = weights.copy()
weights[:] = param
yield
if backup is not None:
weights[:] = backup


@describe.on_data(LSUVinit)
@describe.attributes(
nM=Dimension("Vector dimensions"),
Expand All @@ -53,14 +110,6 @@ def LSUVinit(model, X, y=None):
class Embed(Model):
name = "embed"

# @property
# def input_shape(self):
# return (self.nB,)

# @property
# def output_shape(self):
# return (self.nB, self.nO)

@check.arg(1, is_int)
def __init__(self, nO, nM=None, nV=None, **kwargs):
Model.__init__(self, **kwargs)
Expand Down
Loading

0 comments on commit 2526a5a

Please sign in to comment.