support migrations between different file ID managers #49

dlqqq · 2022-11-09T02:36:43Z

Description

Adds two methods to the base File ID manager that all implementations should override:
- export_rows(): a static method that returns a generator yielding a tuple of [id, path]
- import_rows(): an instance method that writes a tuple yielded by export_rows() to the database
Both ArbitraryFIDM and LocalFIDM support migrations between each other by default. Upon initialization, both implementations write their module name and class name to a "database manifest", which allows for subsequent reflection by the other FIDM. If a manifest already exists and it has a different module or class name, we perform a migration from the previous FIDM to the current FIDM. The migration strategy is roughly as follows:
- Read the old FIDM's module and class name from the manifest and store it.
- Backup the existing database by renaming it with the current timestamp, e.g. 2022-11-09T01:55:32-file_id_manager.db. Bind this to backup_db_path.
- define PrevManager = getattr(importlib.import_module(prev_module), prev_classname)
- finally, migrate the rows over:
```
for row in PrevManager.export_rows(backup_db_path):
  self.import_row(row)
```

The key principle is that the previous FIDM defines how to export the rows in its database, and the current FIDM defines how to import the rows into its database.

Limitations

Currently, when switching FIDMs, we rely on the fact that the previous FIDM writes its own manifest to the database file. If custom FIDMs forget to do this, the migration will fail. TODO: maybe move the manifest creation logic into BaseFileIdManager.__init__()?
Requires the previous FIDM implementation to use SQLite and not write some other database file format (e.g. MariaDB, PostgresQL) to the path. I think this is constraint is OK to impose.

Open questions

How do we communicate the creation of a backup to the user more clearly? I thought about adding a server log, but this could easily get lost in the noise.

Related issues

Closes Cannot switch file ID managers with same database #28.

dlqqq · 2022-11-09T02:38:33Z

I will need consensus on whether we're OK with imposing the constraint that all File ID manager implementations use SQLite. It's not clear if a general migration strategy would be possible otherwise. Furthermore, if we impose this constraint, we can merge the duplicate __init__() logic in ArbitraryFIDM and LocalFIDM into BaseFIDM, and then perform migrations for all custom FIDMs automatically (as long as custom FIDMs meaningfully implement the import and export methods).

codecov-commenter · 2022-11-09T03:14:33Z

Codecov Report

Base: 85.74% // Head: 85.48% // Decreases project coverage by -0.25% ⚠️

Coverage data is based on head (6fca139) compared to base (50dec2e).
Patch coverage: 86.99% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #49      +/-   ##
==========================================
- Coverage   85.74%   85.48%   -0.26%     
==========================================
  Files           4        4              
  Lines         519      620     +101     
  Branches       68       86      +18     
==========================================
+ Hits          445      530      +85     
- Misses         52       61       +9     
- Partials       22       29       +7

Impacted Files	Coverage Δ
jupyter_server_fileid/manager.py	`89.16% <86.99%> (-1.31%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

dlqqq · 2022-11-09T04:00:19Z

Interesting edge case when migrating from a local FIDM to arbitrary FIDM on Windows. Essentially, local FIDM knows that this is a local Windows path, and calls os.path.normpath() on it to map it to lowercase (since there's no distinction b/w upper and lower case in Windows filesystems).

However, arbitrary FIDM doesn't make any assumptions of the local filesystem and is case sensitive even when running on Windows, and hence is not able to return a relative path since it sees the lower-case content root as being different from the content root.

_______________________ test_migrate_local_to_arbitrary _______________________

fid_db_path = 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_migrate_local_to_arbitrar0\\data\\fileidmanager_test.db'
jp_root_dir = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_migrate_local_to_arbitrar0/root_dir')
test_path = 'test_path', test_path_child = 'test_path/child'

    def test_migrate_local_to_arbitrary(fid_db_path, jp_root_dir, test_path, test_path_child):
        local = LocalFileIdManager(db_path=fid_db_path, root_dir=str(jp_root_dir))
        local.con.execute("PRAGMA journal_mode = off")
    
        id_1 = local.index(test_path)
        id_2 = local.index(test_path_child)
        del local
        arbitrary = ArbitraryFileIdManager(db_path=fid_db_path, root_dir=str(jp_root_dir))
    
>       assert arbitrary.get_path(id_1) == test_path
E       AssertionError: assert 'c:/users/runneradmin/appdata/local/temp/pytest-of-runneradmin/pytest-0/test_migrate_local_to_arbitrar0/root_dir/test_path' == 'test_path'
E         - test_path
E         + c:/users/runneradmin/appdata/local/temp/pytest-of-runneradmin/pytest-0/test_migrate_local_to_arbitrar0/root_dir/test_path

dlqqq · 2022-11-09T04:17:52Z

I just patched the test to fix the above error. I'm not sure if this edge case is worth handling, given that the motivation is unclear (downgrading on a local filesystem) and that the workaround is as simple as providing a normalized contents root on Windows.

kevin-bates

Hi @dlqqq. I like the idea of adding import and export functionality, but don't think performing implicit migrations, one row at a time, is necessarily a good idea.

jupyter_server_fileid/manager.py

kevin-bates · 2022-11-09T18:46:05Z

jupyter_server_fileid/manager.py

+            self.log.info(
+                "ArbitraryFileIdManager : Database from different File ID manager detected. Migrating tables and indices."
+            )
+            self._migrate_database()


Could this lead to thrashing? For example, I use two classes of FIDs (e.g., LFID and AFID) configured on my system, neither of which have bothered to configure the db path (plausible if the LFID created the Files table since its schema is a superset of AFID), then won't the AFID trigger migration when it runs, and vice versa when the LFID starts?

I really think this kind of functionality should be in the user's face and performed by a CLI tool that also does import (from a source), export (to a destination), and migration using both. With a CLI tool, you can also support any database type or storage type and no need for a Manifest table. In addition, you could support migration between databases of the same type (e.g., SQLite).

If you're too strict about class names in the manifest, you could also run into issues where subclasses are treated as distinct classes, despite the fact that their schemas (and base functionality) are identical.

Yes, the migration is being done automatically at the present moment.

I really think this kind of functionality should be in the user's face and performed by a CLI tool that also does import (from a source), export (to a destination), and migration using both.

Wait, but what would import and export do? These CLI options don't make sense to me. Importing requires somewhere to export to, and vice versa.

I do agree that migrations should be made explicit and only done via a CLI tool. This allows us to convey the backup database path to the end user more easily. However, I'm wary of the old behavior this would lead to:

start server with ArbitraryFIDM

stop server and restart with LocalFIDM

File ID extension fails to load, but this can get easily lost in the logs

all extensions depending on File ID are now broken and we could lose data since we're no longer able to index files

Do we know of a way to stop the server from starting if the server extension fails? That way, we can force the user to either migrate their database or switch back to their old FIDM.

With a CLI tool, you can also support any database type or storage type and no need for a Manifest table. In addition, you could support migration between databases of the same type (e.g., SQLite). If you're too strict about class names in the manifest, you could also run into issues where subclasses are treated as distinct classes, despite the fact that their schemas (and base functionality) are identical.

Right, but without a manifest table, how would we force the server extension to fail at extension load time rather than at run time? We need some reflection in the database itself to tell users to migrate at load time rather than giving them some unreadable SQLite error at run time.

Wait, but what would import and export do? These CLI options don't make sense to me. Importing requires somewhere to export to, and vice versa.

Correct. Import and Export need sources and destinations and it would be up to the CLI tool to define the types of sources (FIDMs, csv files, jsons, etc.). Migration essentially reverses those, where the source is exported from and the destination is imported to and would minimally support FIDM "source types". I.e., the initial implementation of the tool would take FIDM classes from which instances are instantiated and used. Support for other types could be added as necessary.

Regarding the failure scenario bullets...
Why does the FileID service fail to load when switching to the LFIDM - is it because the schema from the AFIDM is not sufficient and they're both configured to use the same database file (which strikes me as an administration bug)?
Won't every call to the FID service then fail and wouldn't that only go unnoticed if the calling application ignored exceptions? I hope callers are robust enough to recognize exceptions, but I might be missing something here.

Correct. Import and Export need sources and destinations and it would be up to the CLI tool to define the types of sources (FIDMs, csv files, jsons, etc.).

Sorry if I'm not understanding you fully, but it almost sounds like you want two additional CLI options in addition to jupyter server fileid migrate:

jupyter server fileid import

jupyter server fileid export

My question is what would these options do differently than migrate? It seems importing a file ID database and exporting to a different one is the exact same as migrate.

Why does the FileID service fail to load when switching to the LFIDM - is it because the schema from the AFIDM is not sufficient and they're both configured to use the same database file (which strikes me as an administration bug)?

Yeah, so it'll fail at runtime because the database schemas are different. While this is partially an administration bug, there is a legitimate use case for migrations regardless. Right now, if you want to switch FIDM implementations, you're forced to use a separate DB file, meaning you just lost all the file IDs you previously indexed. Thus you also broke all extensions that depend on file ID.

Won't every call to the FID service then fail and wouldn't that only go unnoticed if the calling application ignored exceptions? I hope callers are robust enough to recognize exceptions, but I might be missing something here.

Right, but there remains the issue of how to communicate to the user that "because you switched file ID managers, the best way to fix these errors is to migrate your database". I would prefer the server to just not start if the file ID database hasn't been migrated rather than have the user try and debug this on their own. That way, we force the user to take one of three paths:

Revert back to the old File ID manager

Perform a migration with the CLI tool

Drop the old database

I don't want people to have to scratch their heads and figure out why file ID isn't working after switching implementations.

Sorry if I'm not understanding you fully, but it almost sounds like you want two additional CLI options

I'm just saying that since migration is the "pipe between export and import" a CLI tool could offer different sources. What we've been primarily talking about are sources of FIDM instances (since they front the database), but I'm saying that you could also support different output destinations where the CLI can store exported data, and different input sources where the CLI tool can get data for importing.

So I could perform an equivalent migration by exporting from FIDM1 to a file exported.json and importing from exported.json to FIDM2. That is, the CLI tool manages the "pipe" since it is the caller of FIDM1.export_rows() and FIDM2.import_rows(). This can then provide users the ability to have more choices for how they want to migrate (or backup) their FID databases.

One application for this may be that the user needs to perform some transformations on the exported rows (e.g., exported.json) prior to their import as part of their "migration plan".

We don't necessarily need to expose other sources and split migrate into export and import right away but should have that kind of thing in mind during implementation so that "the pipe" is extensible such that a user could manually intervene between "export" and "import" as part of their migration.

Thinking a bit more on this, the first non-FIDM source type to support would probably want to be csv, since you want the ability to read rows from a file and JSON doesn't really lend itself to that format (that I'm aware of). .csv would also be immediately usable in spreadsheet applications in which the transformations could be performed, etc.

I don't think we need to support import/export to some secondary file format in this PR. Let's limit the scope of this PR for just migrating between SQLite DBs. However, these are great ideas we should definitely think about implementing later down the road.

tests/test_manager.py

jupyter_server_fileid/manager.py

support migrations between different file ID managers

b364fa5

dlqqq added the enhancement New feature or request label Nov 9, 2022

dlqqq requested review from ellisonbg, davidbrochart and kevin-bates November 9, 2022 02:36

dlqqq added 2 commits November 9, 2022 02:42

change timestamp format to support Windows

fe5e0ea

normalize path when importing row

acd0e2a

normalize case in local to arbitrary test

b02ce6d

kevin-bates reviewed Nov 9, 2022

View reviewed changes

tests/test_manager.py Outdated Show resolved Hide resolved

dlqqq added 3 commits December 5, 2022 18:29

normalize case in root_dir validator for AFIM

6ab34e9

make export_rows return a dict, rename import_row => import_rows

4975d71

add pass statements to abstract methods

6fca139

kevin-bates reviewed Dec 5, 2022

View reviewed changes

jupyter_server_fileid/manager.py Show resolved Hide resolved

kevin-bates reviewed Dec 5, 2022

View reviewed changes

jupyter_server_fileid/manager.py Show resolved Hide resolved

dlqqq mentioned this pull request Dec 6, 2022

create SQLite mixin #53

Open

dlqqq mentioned this pull request Feb 16, 2023

jupyter-fileid script needs click #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support migrations between different file ID managers #49

support migrations between different file ID managers #49

dlqqq commented Nov 9, 2022 •

edited

Loading

dlqqq commented Nov 9, 2022

codecov-commenter commented Nov 9, 2022 •

edited by codecov bot

Loading

dlqqq commented Nov 9, 2022

dlqqq commented Nov 9, 2022

kevin-bates left a comment

kevin-bates Nov 9, 2022

dlqqq Nov 10, 2022 •

edited

Loading

kevin-bates Nov 10, 2022

dlqqq Nov 10, 2022

kevin-bates Nov 10, 2022

kevin-bates Nov 10, 2022

dlqqq Nov 11, 2022

support migrations between different file ID managers #49

Are you sure you want to change the base?

support migrations between different file ID managers #49

Conversation

dlqqq commented Nov 9, 2022 • edited Loading

Description

Limitations

Open questions

Related issues

dlqqq commented Nov 9, 2022

codecov-commenter commented Nov 9, 2022 • edited by codecov bot Loading

Codecov Report

dlqqq commented Nov 9, 2022

dlqqq commented Nov 9, 2022

kevin-bates left a comment

Choose a reason for hiding this comment

kevin-bates Nov 9, 2022

Choose a reason for hiding this comment

dlqqq Nov 10, 2022 • edited Loading

Choose a reason for hiding this comment

kevin-bates Nov 10, 2022

Choose a reason for hiding this comment

dlqqq Nov 10, 2022

Choose a reason for hiding this comment

kevin-bates Nov 10, 2022

Choose a reason for hiding this comment

kevin-bates Nov 10, 2022

Choose a reason for hiding this comment

dlqqq Nov 11, 2022

Choose a reason for hiding this comment

dlqqq commented Nov 9, 2022 •

edited

Loading

codecov-commenter commented Nov 9, 2022 •

edited by codecov bot

Loading

dlqqq Nov 10, 2022 •

edited

Loading