Restore compatibility with models handling fewer tokens #104

willdumm · 2025-01-11T00:38:26Z

This PR addresses #102. It still needs a unit test (started) and ideally at least a plan for how to apply the neutral model when we add additional tokens (there's some discussion in the issue about this).

matsen · 2025-01-13T17:41:33Z

tests/test_dnsm.py

 from netam.common import aa_idx_tensor_of_str_ambig, force_spawn
 from netam.models import TransformerBinarySelectionModelWiggleAct
 from netam.dnsm import DNSMBurrito, DNSMDataset


 def test_aa_idx_tensor_of_str_ambig():
    input_seq = "ACX"
-    expected_output = torch.tensor([0, 1, MAX_AA_TOKEN_IDX], dtype=torch.int)
+    expected_output = torch.tensor([0, 1, 20], dtype=torch.int)


Variable please.

matsen

Nice! A few comments / questions here.

matsen · 2025-01-14T11:48:49Z

netam/models.py

+        # Here we're ignoring sites containing tokens that have index greater
+        # than the embedding dimension.
+        if "embedding_dim" in self.hyperparameters:
+            consider_sites = aa_idxs < self.hyperparameters["embedding_dim"]


I'm not sure how to read consider_sites. I think these are meaningful_sites? or aa_sites?

I guess I'm a little confused about what this section of code is doing. It seems like we're making a big empty tensor, partially filling it, then trimming it back again.

I renamed to model_valid_sites.

This code is applying the model to an aa string, only considering the entries in that string that the model was trained to handle (marked by model_valid_sites). We construct a big empty result tensor of nans, then feed the amino acid string to the model, stripping out sites containing tokens that aren't valid for the model, then put the model's outputs into the result tensor at the sites that contain tokens that we fed to the model.

For instance, if we have a heavy chain sequence QQQQ^, then the model will see QQQQ, but the output will have first dimension size 5, matching the input sequence length, and the last output will be nans.

If we add a start token in the future, then we can still feed sequences containing it to our model. For instance, if we have 1QQQQ, the old model not handling the new 1 token will see QQQQ, but the returned model output will contain nans in the first site, and will have first dimension size 5, matching the input sequence length.

Great!

Let's make a free function that encapsulates the test

if "embedding_dim" in self.hyperparameters:

with a nice name that we can look for when we strip this if out. I'm anticipating that we'll have a release in which we assume that all models have an embedding_dim.

(In other news, we could just go in and add an embedding dim into all the old models and assume they have them, right?)

Do you mean a function that takes hyperparameters and aa_idxs and returns mode_valid_sites?

The only subclass that doesn't have embedding_dim right now is the Single model. I could just add an embedding dimension for that model that's pinned to MAX_AA_TOKEN_IDX so that it always grows with new tokens. Then I'd be able to remove this test.

Would that be preferable to making a free function?

I addressed this by adding an embedding_dim to the Single model.

netam/sequences.py

tests/test_backward_compat.py

willdumm · 2025-01-14T19:59:28Z

This now requires the companion PR https://github.com/matsengrp/dnsm-experiments-1/pull/77, which reverts the DASM output dimension to 20.

matsen

one little pending suggestion!

matsen · 2025-01-14T20:14:58Z

netam/models.py

+        # Here we're ignoring sites containing tokens that have index greater
+        # than the embedding dimension.
+        if "embedding_dim" in self.hyperparameters:
+            consider_sites = aa_idxs < self.hyperparameters["embedding_dim"]


Great!

Let's make a free function that encapsulates the test

if "embedding_dim" in self.hyperparameters:

with a nice name that we can look for when we strip this if out. I'm anticipating that we'll have a release in which we assume that all models have an embedding_dim.

(In other news, we could just go in and add an embedding dim into all the old models and assume they have them, right?)

willdumm added 4 commits January 9, 2025 16:09

add embedding_dim for wiggle model

d2d4e30

fix tests

006bf83

format and lint

0025351

add old models

ed60f8c

matsen reviewed Jan 13, 2025

View reviewed changes

matsen mentioned this pull request Jan 13, 2025

Bring together all functions doing tokenization into one place #106

Open

willdumm added 10 commits January 13, 2025 11:43

add start of new tests

fa5e445

fix old model test

24fe710

add old outputs

17918c4

add explanatory comment

2a16057

format and lint

516376d

add check for TODOs

2fc3808

fix tests

79a596a

Update build-and-test.yml

ac9b484

add aa ambig index

ef0a2fb

rewrite TODO check

f922f3d

matsen reviewed Jan 14, 2025

View reviewed changes

willdumm added 2 commits January 14, 2025 11:31

fix output dimension

666b765

format

057ad67

willdumm added 2 commits January 14, 2025 12:03

refactor AA_TOKEN_STR_SORTED

181f5cd

format

d4956bb

willdumm marked this pull request as ready for review January 14, 2025 20:06

willdumm changed the title ~~102 token back compat~~ Restore compatibility with models handling fewer tokens Jan 14, 2025

matsen reviewed Jan 14, 2025

View reviewed changes

matsen approved these changes Jan 14, 2025

View reviewed changes

add embedding_dim to single model

9d15324

willdumm merged commit 5cac227 into main Jan 14, 2025
2 checks passed

willdumm deleted the 102-token-back-compat branch January 14, 2025 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore compatibility with models handling fewer tokens #104

Restore compatibility with models handling fewer tokens #104

willdumm commented Jan 11, 2025

matsen Jan 13, 2025

matsen left a comment

matsen Jan 14, 2025

willdumm Jan 14, 2025

matsen Jan 14, 2025

willdumm Jan 14, 2025

willdumm Jan 14, 2025

willdumm commented Jan 14, 2025 •

edited

Loading

matsen left a comment

matsen Jan 14, 2025

Restore compatibility with models handling fewer tokens #104

Restore compatibility with models handling fewer tokens #104

Conversation

willdumm commented Jan 11, 2025

matsen Jan 13, 2025

Choose a reason for hiding this comment

matsen left a comment

Choose a reason for hiding this comment

matsen Jan 14, 2025

Choose a reason for hiding this comment

willdumm Jan 14, 2025

Choose a reason for hiding this comment

matsen Jan 14, 2025

Choose a reason for hiding this comment

willdumm Jan 14, 2025

Choose a reason for hiding this comment

willdumm Jan 14, 2025

Choose a reason for hiding this comment

willdumm commented Jan 14, 2025 • edited Loading

matsen left a comment

Choose a reason for hiding this comment

matsen Jan 14, 2025

Choose a reason for hiding this comment

willdumm commented Jan 14, 2025 •

edited

Loading