-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support migrations between different file ID managers #49
base: main
Are you sure you want to change the base?
Conversation
I will need consensus on whether we're OK with imposing the constraint that all File ID manager implementations use SQLite. It's not clear if a general migration strategy would be possible otherwise. Furthermore, if we impose this constraint, we can merge the duplicate |
Codecov ReportBase: 85.74% // Head: 85.48% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #49 +/- ##
==========================================
- Coverage 85.74% 85.48% -0.26%
==========================================
Files 4 4
Lines 519 620 +101
Branches 68 86 +18
==========================================
+ Hits 445 530 +85
- Misses 52 61 +9
- Partials 22 29 +7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Interesting edge case when migrating from a local FIDM to arbitrary FIDM on Windows. Essentially, local FIDM knows that this is a local Windows path, and calls However, arbitrary FIDM doesn't make any assumptions of the local filesystem and is case sensitive even when running on Windows, and hence is not able to return a relative path since it sees the lower-case content root as being different from the content root. _______________________ test_migrate_local_to_arbitrary _______________________
fid_db_path = 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_migrate_local_to_arbitrar0\\data\\fileidmanager_test.db'
jp_root_dir = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_migrate_local_to_arbitrar0/root_dir')
test_path = 'test_path', test_path_child = 'test_path/child'
def test_migrate_local_to_arbitrary(fid_db_path, jp_root_dir, test_path, test_path_child):
local = LocalFileIdManager(db_path=fid_db_path, root_dir=str(jp_root_dir))
local.con.execute("PRAGMA journal_mode = off")
id_1 = local.index(test_path)
id_2 = local.index(test_path_child)
del local
arbitrary = ArbitraryFileIdManager(db_path=fid_db_path, root_dir=str(jp_root_dir))
> assert arbitrary.get_path(id_1) == test_path
E AssertionError: assert 'c:/users/runneradmin/appdata/local/temp/pytest-of-runneradmin/pytest-0/test_migrate_local_to_arbitrar0/root_dir/test_path' == 'test_path'
E - test_path
E + c:/users/runneradmin/appdata/local/temp/pytest-of-runneradmin/pytest-0/test_migrate_local_to_arbitrar0/root_dir/test_path |
I just patched the test to fix the above error. I'm not sure if this edge case is worth handling, given that the motivation is unclear (downgrading on a local filesystem) and that the workaround is as simple as providing a normalized contents root on Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dlqqq. I like the idea of adding import
and export
functionality, but don't think performing implicit migrations, one row at a time, is necessarily a good idea.
self.log.info( | ||
"ArbitraryFileIdManager : Database from different File ID manager detected. Migrating tables and indices." | ||
) | ||
self._migrate_database() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this lead to thrashing? For example, I use two classes of FIDs (e.g., LFID and AFID) configured on my system, neither of which have bothered to configure the db path (plausible if the LFID created the Files table since its schema is a superset of AFID), then won't the AFID trigger migration when it runs, and vice versa when the LFID starts?
I really think this kind of functionality should be in the user's face and performed by a CLI tool that also does import
(from a source), export
(to a destination), and migration
using both. With a CLI tool, you can also support any database type or storage type and no need for a Manifest
table. In addition, you could support migration between databases of the same type (e.g., SQLite).
If you're too strict about class names in the manifest, you could also run into issues where subclasses are treated as distinct classes, despite the fact that their schemas (and base functionality) are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the migration is being done automatically at the present moment.
I really think this kind of functionality should be in the user's face and performed by a CLI tool that also does import (from a source), export (to a destination), and migration using both.
Wait, but what would import
and export
do? These CLI options don't make sense to me. Importing requires somewhere to export to, and vice versa.
I do agree that migrations should be made explicit and only done via a CLI tool. This allows us to convey the backup database path to the end user more easily. However, I'm wary of the old behavior this would lead to:
- start server with ArbitraryFIDM
- stop server and restart with LocalFIDM
- File ID extension fails to load, but this can get easily lost in the logs
- all extensions depending on File ID are now broken and we could lose data since we're no longer able to index files
Do we know of a way to stop the server from starting if the server extension fails? That way, we can force the user to either migrate their database or switch back to their old FIDM.
With a CLI tool, you can also support any database type or storage type and no need for a
Manifest
table. In addition, you could support migration between databases of the same type (e.g., SQLite). If you're too strict about class names in the manifest, you could also run into issues where subclasses are treated as distinct classes, despite the fact that their schemas (and base functionality) are identical.
Right, but without a manifest table, how would we force the server extension to fail at extension load time rather than at run time? We need some reflection in the database itself to tell users to migrate at load time rather than giving them some unreadable SQLite error at run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, but what would import and export do? These CLI options don't make sense to me. Importing requires somewhere to export to, and vice versa.
Correct. Import and Export need sources and destinations and it would be up to the CLI tool to define the types of sources (FIDMs, csv files, jsons, etc.). Migration essentially reverses those, where the source is exported from and the destination is imported to and would minimally support FIDM "source types". I.e., the initial implementation of the tool would take FIDM classes from which instances are instantiated and used. Support for other types could be added as necessary.
Regarding the failure scenario bullets...
Why does the FileID service fail to load when switching to the LFIDM - is it because the schema from the AFIDM is not sufficient and they're both configured to use the same database file (which strikes me as an administration bug)?
Won't every call to the FID service then fail and wouldn't that only go unnoticed if the calling application ignored exceptions? I hope callers are robust enough to recognize exceptions, but I might be missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Import and Export need sources and destinations and it would be up to the CLI tool to define the types of sources (FIDMs, csv files, jsons, etc.).
Sorry if I'm not understanding you fully, but it almost sounds like you want two additional CLI options in addition to jupyter server fileid migrate
:
jupyter server fileid import
jupyter server fileid export
My question is what would these options do differently than migrate
? It seems importing a file ID database and exporting to a different one is the exact same as migrate.
Why does the FileID service fail to load when switching to the LFIDM - is it because the schema from the AFIDM is not sufficient and they're both configured to use the same database file (which strikes me as an administration bug)?
Yeah, so it'll fail at runtime because the database schemas are different. While this is partially an administration bug, there is a legitimate use case for migrations regardless. Right now, if you want to switch FIDM implementations, you're forced to use a separate DB file, meaning you just lost all the file IDs you previously indexed. Thus you also broke all extensions that depend on file ID.
Won't every call to the FID service then fail and wouldn't that only go unnoticed if the calling application ignored exceptions? I hope callers are robust enough to recognize exceptions, but I might be missing something here.
Right, but there remains the issue of how to communicate to the user that "because you switched file ID managers, the best way to fix these errors is to migrate your database". I would prefer the server to just not start if the file ID database hasn't been migrated rather than have the user try and debug this on their own. That way, we force the user to take one of three paths:
- Revert back to the old File ID manager
- Perform a migration with the CLI tool
- Drop the old database
I don't want people to have to scratch their heads and figure out why file ID isn't working after switching implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm not understanding you fully, but it almost sounds like you want two additional CLI options
I'm just saying that since migration is the "pipe between export and import" a CLI tool could offer different sources. What we've been primarily talking about are sources of FIDM instances (since they front the database), but I'm saying that you could also support different output destinations where the CLI can store exported data, and different input sources where the CLI tool can get data for importing.
So I could perform an equivalent migration by exporting from FIDM1 to a file exported.json
and importing from exported.json
to FIDM2. That is, the CLI tool manages the "pipe" since it is the caller of FIDM1.export_rows()
and FIDM2.import_rows()
. This can then provide users the ability to have more choices for how they want to migrate (or backup) their FID databases.
One application for this may be that the user needs to perform some transformations on the exported rows (e.g., exported.json
) prior to their import as part of their "migration plan".
We don't necessarily need to expose other sources and split migrate
into export
and import
right away but should have that kind of thing in mind during implementation so that "the pipe" is extensible such that a user could manually intervene between "export" and "import" as part of their migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more on this, the first non-FIDM source type to support would probably want to be csv
, since you want the ability to read rows from a file and JSON doesn't really lend itself to that format (that I'm aware of). .csv
would also be immediately usable in spreadsheet applications in which the transformations could be performed, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to support import/export to some secondary file format in this PR. Let's limit the scope of this PR for just migrating between SQLite DBs. However, these are great ideas we should definitely think about implementing later down the road.
Description
export_rows()
: a static method that returns a generator yielding a tuple of[id, path]
import_rows()
: an instance method that writes a tuple yielded byexport_rows()
to the databaseArbitraryFIDM
andLocalFIDM
support migrations between each other by default. Upon initialization, both implementations write their module name and class name to a "database manifest", which allows for subsequent reflection by the other FIDM. If a manifest already exists and it has a different module or class name, we perform a migration from the previous FIDM to the current FIDM. The migration strategy is roughly as follows:2022-11-09T01:55:32-file_id_manager.db
. Bind this tobackup_db_path
.PrevManager = getattr(importlib.import_module(prev_module), prev_classname)
The key principle is that the previous FIDM defines how to export the rows in its database, and the current FIDM defines how to import the rows into its database.
Limitations
Currently, when switching FIDMs, we rely on the fact that the previous FIDM writes its own manifest to the database file. If custom FIDMs forget to do this, the migration will fail. TODO: maybe move the manifest creation logic into
BaseFileIdManager.__init__()
?Requires the previous FIDM implementation to use SQLite and not write some other database file format (e.g. MariaDB, PostgresQL) to the path. I think this is constraint is OK to impose.
Open questions
Related issues