Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different backends for nn #1264

Closed
albertz opened this issue Oct 26, 2022 · 5 comments
Closed

Support different backends for nn #1264

albertz opened this issue Oct 26, 2022 · 5 comments

Comments

@albertz
Copy link
Member

albertz commented Oct 26, 2022

Edit What is referred to as rc.nn was the RETURNN-common nn API. We decided to move this API over to RETURNN itself, where we just call it the "frontend API". Also see #1120 specifically about PyTorch.

Edit This issue is now part of #1120.

While thinking about how to integrate rc.nn better into RETURNN (#1185), specifically also how to directly construct the layers and not go over the net dict, and then also thinking about PyTorch in RETURNN (#1120), and how rc.nn can be useful for PyTorch as well, I came to the conclusion that we should design nn in a way that it is easy to switch out different backends. I'm currently thinking about:

Currently, only RETURNN net dict is supported, and a limited amount of RETURNN layers directly with TF eager mode.

All works via the nn.make_layer function, which gets a RETURNN layer dict, as you see it normally in the RETURNN net dict.

nn.make_layer is already too RETURNN specific. I'm thinking about a dedicated lower-level API in rc.nn, which can be switched out for different backends. This is similar as Keras had it some time ago, where they defined all the low-level functions they used, for both TF and Theano. Our _generated_layers.py is almost already like that. We maybe want to cleanup it a bit more and not really expose all RETURNN layers.

All the logic from NameCtx is maybe also only needed for the RETURNN net dict backend, because this is almost all about how the RETURNN net dict is constructed in the end and to figure out layer names. Some of the nn.Dim names (descriptions) also use it but this is anyway some aspect which I'm not totally happy with.

We anyway also wanted to abstract and move ExternData/Data/Dim to be framework independent (#1165). It's basically already like that, it just needs to be cleaned up a bit and moved.
(Despite, the Dim internals should also be cleaned up. See #975.)

TensorFlow directly could mean both graph mode and eager mode.
PyTorch is eager mode.

In case of eager mode, we should take extra care that it is efficient to run. We never really optimized this too much as so far it just created the graph and the runtime later would only operate on the graph. But in eager mode, it would get executed again and again. The RETURNN net dict would be way too much overhead for eager mode. For the other backends, we need to be careful. I'm not sure if Data is already too much overhead. Probably it needs to be optimized. And then, nn.Tensor is another wrapper. Currently it's an own data structure. But we might make it an alias to torch.Tensor or tensorflow.Tensor and somehow attach our meta information to it (all Data stuff, specifically the Dims). I'm not sure. But the overhead of rc.nn must be really minimal, otherwise it would not really be attractive. In case of PyTorch, I was thinking about using their named dimensions, and somehow keeping a global dimension tag register, by making sure that all names are unique. So that way you can always get the reference to the Dim object given a pure torch.Tensor.

In principle, I think it should be possible to have this efficient and the overhead minimal, so that PyTorch and TF eager mode are really also potential backends.

For now, we should just keep such potential plans in mind, when thinking about some internal design or the API of nn. It should not be too RETURNN or too TF specific, except maybe for the ExternData, Data and Dims.


So, effectively, what would rc.nn provide as advantage over just using PyTorch directly?

  • The Dim object, and very consistent use of it.
    • This allows for cleaner code in many parts, less errors, easier debugging.
    • This also allows for automatic batching, like jax.vmap, in an efficient and straightforward way.
  • Some amount of optimization in nn.Loop could make it more efficient.
  • Support of different backends. JAX or TF are probably slightly faster. For debugging, PyTorch or TF eager mode can be used, and later it can be switched to TF or JAX.

Related:

@albertz
Copy link
Member Author

albertz commented Oct 26, 2022

Note, I added the first-release tag, but only that we think about necessary API changes now, because once we have the first stable release, changing the API would be much more difficult.

@albertz
Copy link
Member Author

albertz commented Nov 11, 2022

Another aspect of eager vs graph (symbolic) mode:

I think for __call__ and other functions executed, this is all fine.

However, in __init__, there is an important difference. In each case, this is executed only once. With symbolic computation, represententing some value e.g. based on a parameter, for example weight normalized parameters, this is totally fine and the right thing to do for symbolic execution. However, in case of eager execution, only executing it once is not helpful. E.g. in PyTorch, weight normalization will use _forward_pre_hooks to calculate it again and again.

So far we only defined parameters in __init__, and maybe their initial values (nn.init.ParamInit) or maybe things like weight decay. This is fine for both eager and symbolic mode.

However, for any computation depending on a parameter which can potentially change, we need to think about this. It's not clear yet how to solve this. This becomes relevant for example for weight norm (#91).

(Edit Maybe it's not so much a problem when we wrap nn.Tensor anyway. It then can be either directly the tensor (when inside some __call__) or a symbolic representation (when inside some __init__, or for TF graph mode all the time). Post edit This might make the logic way too complex. We should think of more simple solutions.)

Edit I moved this to an own separate issue: rwth-i6/returnn_common#250

@albertz
Copy link
Member Author

albertz commented Feb 14, 2023

In case of PyTorch, I was thinking about using their named dimensions, and somehow keeping a global dimension tag register, by making sure that all names are unique.

Note on this: Named tensors in PyTorch is quite incomplete (e.g. pytorch/pytorch#94586) and its development was stopped (pytorch/pytorch#60832). So using torch.Tensor directly is not really an option. We need our own wrapper class (Data).

@albertz
Copy link
Member Author

albertz commented Feb 27, 2023

There is quite some overlap with the PyTorch backend issue (#1120). We decided to define most of the core nn functions as a "frontend API" directly in RETURNN, and also the RETURNN Data (renamed to Tensor) is supposed to be the main tensor class for this frontend API (already prepared in #1261). This frontend API including the implementation for RETURNN layers and PyTorch is part of RETURNN itself. So it makes more sense to move this issue over to RETURNN. Edit Done.

@albertz albertz transferred this issue from rwth-i6/returnn_common Feb 27, 2023
@albertz
Copy link
Member Author

albertz commented Mar 16, 2023

I think we can close this issue here, as most frontend API related discussions happens in #1120.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant