Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address tag identification integrity issues #164

Closed
lewismc opened this issue Dec 21, 2022 · 15 comments
Closed

Address tag identification integrity issues #164

lewismc opened this issue Dec 21, 2022 · 15 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed
Milestone

Comments

@lewismc
Copy link
Member

lewismc commented Dec 21, 2022

Currently we use the filename as the primary identifier to determine 1) an initial tag submission, and 2) a new submission for an existing tag. If we encounter the same file (by name) then we simply increment the submission number.
This logic is prone to error because it ignores all file semantics apart from filename. This task will determine additional criteria which will prevent duplicate file entries and ensure the integrity of accurate tag submissions.

@lewismc lewismc changed the title Improve detection of tag reuse under different submission Address tag identification integrity issues Dec 21, 2022
@lewismc
Copy link
Member Author

lewismc commented Dec 21, 2022

@tagtuna can you please suggest some lookup logic we can use to determine file semantics? Thank you

@lewismc lewismc added bug Something isn't working help wanted Extra attention is needed labels Dec 21, 2022
@lewismc
Copy link
Member Author

lewismc commented Jan 15, 2023

Currently tag_id is always defined as

tag_id integer NOT NULL

I am going to propose that we assign a tag_id based on using the instrument_name from the global metadata. This is a fundamental change to the way we determine and assign tag_id. It would change the column type from above to

tag_id varying(50) NOT NULL

@lewismc
Copy link
Member Author

lewismc commented Jan 15, 2023

This also means that we always have a filename regardless of whether the submitted filename is different. This will be used when we do export of raw data.

@lewismc
Copy link
Member Author

lewismc commented Feb 28, 2023

At today's meeting we decided that this issue will be worked after #191 #194

@vtsontos
Copy link

This matter relates also to the issue of dataset versioning (same tag dataset potentially reprocessed with data and/or metadata changes) and multi-track data scenarios. As stated in the notes above, currently the filename is the sole criterion by which a dataset is ingested with a new tag_id or not. We should consider expanding to also consider some key/limited eTUFF metadata elements, but should additionally suggest a filename convention that includes some indication of Version number.
The issue of Versioning also brings up the procedural issue of whether and how to delete prior versions of a given tag dataset that may be in the system and may need to be replaced (or not if that versioning history and maintain all versions of data is important to the user).

@lewismc
Copy link
Member Author

lewismc commented Apr 16, 2023

@vtsontos can you please outline exactly what the key/limited eTUFF metadata elements are? If you want to combine defining metadata characteristics with file name convention as well, then please advise. Thanks

@lewismc lewismc self-assigned this Apr 16, 2023
@lewismc lewismc added this to the 0.11.0 milestone Apr 16, 2023
@vtsontos
Copy link

vtsontos commented Apr 16, 2023 via email

@tagtuna
Copy link
Contributor

tagtuna commented Apr 17, 2023

@lewismc @vtsontos A little late to the party, but this is my take

  1. Per our last discussion, on status quo, we need to keep track of both full file path (or url) and filename only. Right now the increment of tag_id is based on full file path instead of just filename.
  2. The key metadata fields that can help to assist with versioning
  • instrument_name (unique name, made clear to the user that this is the primary identifier)
  • serial_number (device internal ID)
  • ptt (satellite platform ID)
  • platform (species code/ common name that the device was deployed on)

@tagtuna
Copy link
Contributor

tagtuna commented Apr 17, 2023

I understand some form of file checksum combined with the filename and 4 above metadata attributes can be a first cut way to determine if this is a new file warranting a new tag_id to be assigned. Is this a good characterization?

@lewismc
Copy link
Member Author

lewismc commented Apr 19, 2023

Yes @tagtuna I will work on this right now. Thanks

@lewismc
Copy link
Member Author

lewismc commented Apr 20, 2023

@tagtuna we have noticed that in some of the older files we have, that serial_number is not present. Moving forward, can we rely on serial_number always being present or should the ingestion logic always check for this potentially being absent. If it is absent, does this mean that we could incorrectly correlate a data (file) with an existing dataset? Please advise. Thanks

@lewismc
Copy link
Member Author

lewismc commented Apr 20, 2023

@tagtuna can you exactly specify what features define a dataset? Thank you

@tagtuna
Copy link
Contributor

tagtuna commented Apr 20, 2023 via email

@renato2099
Copy link
Collaborator

It's definitely an oversight on my part that some files missed the serial
number. We do require having the serial number as a must-have metadata
attribute.

👍

However, only instrument_name is unique and the reliable way to connect any
files to a particular instrument in a given deployment.

Our discussion on including additional metadata attributes are to provide
other clues. Take this example: the same hardware was first used on a tuna
(recovered) and then reused on a shark

I see ... although the instrument_name is expected to be unique across deployments, it "might" not be the case due to humans in the loop.

One thing that is not clear to me @tagtuna is what we consider a "dataset". I guess it is a set of instrument_names (with their corresponding files) + additional metadata (species, location, etc), or is are we considering a dataset a single instrument_name + its corresponding files?

@tagtuna
Copy link
Contributor

tagtuna commented Apr 25, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
No open projects
Development

No branches or pull requests

4 participants