Address tag identification integrity issues #164

lewismc · 2022-12-21T15:36:58Z

Currently we use the filename as the primary identifier to determine 1) an initial tag submission, and 2) a new submission for an existing tag. If we encounter the same file (by name) then we simply increment the submission number.
This logic is prone to error because it ignores all file semantics apart from filename. This task will determine additional criteria which will prevent duplicate file entries and ensure the integrity of accurate tag submissions.

lewismc · 2022-12-21T16:29:14Z

@tagtuna can you please suggest some lookup logic we can use to determine file semantics? Thank you

lewismc · 2023-01-15T21:57:22Z

Currently tag_id is always defined as

tag_id integer NOT NULL

I am going to propose that we assign a tag_id based on using the instrument_name from the global metadata. This is a fundamental change to the way we determine and assign tag_id. It would change the column type from above to

tag_id varying(50) NOT NULL

lewismc · 2023-01-15T21:58:01Z

This also means that we always have a filename regardless of whether the submitted filename is different. This will be used when we do export of raw data.

lewismc · 2023-02-28T15:07:30Z

At today's meeting we decided that this issue will be worked after #191 #194

vtsontos · 2023-04-13T21:17:21Z

This matter relates also to the issue of dataset versioning (same tag dataset potentially reprocessed with data and/or metadata changes) and multi-track data scenarios. As stated in the notes above, currently the filename is the sole criterion by which a dataset is ingested with a new tag_id or not. We should consider expanding to also consider some key/limited eTUFF metadata elements, but should additionally suggest a filename convention that includes some indication of Version number.
The issue of Versioning also brings up the procedural issue of whether and how to delete prior versions of a given tag dataset that may be in the system and may need to be replaced (or not if that versioning history and maintain all versions of data is important to the user).

lewismc · 2023-04-16T18:23:58Z

@vtsontos can you please outline exactly what the key/limited eTUFF metadata elements are? If you want to combine defining metadata characteristics with file name convention as well, then please advise. Thanks

vtsontos · 2023-04-16T19:08:32Z

Including Tim as he is probably best positioned to define these Thanks Vardis

…

On Sun, Apr 16, 2023, 1:24 PM Lewis John McGibbney ***@***.***> wrote: @vtsontos <https://github.com/vtsontos> can you please outline exactly what the *key/limited eTUFF metadata elements* are? If you want to combine defining metadata characteristics with file name convention as well, then please advise. Thanks — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKEC6M6ONI7FBZDQAS3AN3XBQ2MRANCNFSM6AAAAAATFYGLDQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tagtuna · 2023-04-17T02:02:44Z

@lewismc @vtsontos A little late to the party, but this is my take

Per our last discussion, on status quo, we need to keep track of both full file path (or url) and filename only. Right now the increment of tag_id is based on full file path instead of just filename.
The key metadata fields that can help to assist with versioning

instrument_name (unique name, made clear to the user that this is the primary identifier)
serial_number (device internal ID)
ptt (satellite platform ID)
platform (species code/ common name that the device was deployed on)

tagtuna · 2023-04-17T02:09:53Z

I understand some form of file checksum combined with the filename and 4 above metadata attributes can be a first cut way to determine if this is a new file warranting a new tag_id to be assigned. Is this a good characterization?

lewismc · 2023-04-19T03:41:33Z

Yes @tagtuna I will work on this right now. Thanks

lewismc · 2023-04-20T04:42:24Z

@tagtuna we have noticed that in some of the older files we have, that serial_number is not present. Moving forward, can we rely on serial_number always being present or should the ingestion logic always check for this potentially being absent. If it is absent, does this mean that we could incorrectly correlate a data (file) with an existing dataset? Please advise. Thanks

lewismc · 2023-04-20T05:15:33Z

@tagtuna can you exactly specify what features define a dataset? Thank you

tagtuna · 2023-04-20T06:03:17Z

It's definitely an oversight on my part that some files missed the serial number. We do require having the serial number as a must-have metadata attribute. However, only instrument_name is unique and the reliable way to connect any files to a particular instrument in a given deployment. Our discussion on including additional metadata attributes are to provide other clues. Take this example: the same hardware was first used on a tuna (recovered) and then reused on a shark. Let's say the client failed to provide an unique instrument_name for these two deployments. We may be able to distinguish by seeing the tag was deployed on a shark in the second event

…

On Thu, Apr 20, 2023, 12:42 Lewis John McGibbney ***@***.***> wrote: @tagtuna <https://github.com/tagtuna> we have noticed that in some of the older files we have, that serial_number is not present. Moving forward, can we rely on serial_number always being present or should the ingestion logic always check for this potentially being absent. If it is absent, does this mean that we could incorrectly correlate a data (file) with an existing dataset? Please advise. Thanks — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC272JSBOJZLNGMQXXA4VXTXCC5DVANCNFSM6AAAAAATFYGLDQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

renato2099 · 2023-04-23T18:22:22Z

It's definitely an oversight on my part that some files missed the serial
number. We do require having the serial number as a must-have metadata
attribute.

👍

However, only instrument_name is unique and the reliable way to connect any
files to a particular instrument in a given deployment.

Our discussion on including additional metadata attributes are to provide
other clues. Take this example: the same hardware was first used on a tuna
(recovered) and then reused on a shark

I see ... although the instrument_name is expected to be unique across deployments, it "might" not be the case due to humans in the loop.

One thing that is not clear to me @tagtuna is what we consider a "dataset". I guess it is a set of instrument_names (with their corresponding files) + additional metadata (species, location, etc), or is are we considering a dataset a single instrument_name + its corresponding files?

tagtuna · 2023-04-25T00:19:51Z

A dataset is a set of 1) instrument_name + 2) additional metadata + 3) "returned" data files. This corresponds to a particular deployment of the hardware (instrument_name) on a studied animal (additional metadata). By "returned" data files, recorded observations were retrieved via satellite transmission or downloaded via computer cables. A track made up of pairs of (lat,lon) is available either when (a) recorded directly and returned as a specific data file (e.g., GPS fixes) or (b) calculated after the end of a deployment using different bits of the returned data files (referred to as "geolocation"). You can see if a track is calculated, you can use multiple methods or input parameters to drive that estimation process. In any case, having different/ new tracks does not contribute to a new "dataset". New tracks only add to an existing dataset or update the track information that already exists. Does that make sense?

…

On Mon, Apr 24, 2023 at 2:22 AM Renato Marroquin ***@***.***> wrote: It's definitely an oversight on my part that some files missed the serial number. We do require having the serial number as a must-have metadata attribute. 👍 However, only instrument_name is unique and the reliable way to connect any files to a particular instrument in a given deployment. Our discussion on including additional metadata attributes are to provide other clues. Take this example: the same hardware was first used on a tuna (recovered) and then reused on a shark I see ... although the instrument_name is expected to be unique across deployments, it "might" not be the case due to humans in the loop. One thing that is not clear to me @tagtuna <https://github.com/tagtuna> is what we consider a "dataset". I guess it is a set of instrument_names (with their corresponding files) + additional metadata (species, location, etc), or is are we considering a dataset a single instrument_name + its corresponding files? — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC272JSBAJO64SNNIUK33JTXCVXOTANCNFSM6AAAAAATFYGLDQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lewismc changed the title ~~Improve detection of tag reuse under different submission~~ Address tag identification integrity issues Dec 21, 2022

lewismc added this to ICCAT Product Drive Phase 2 (2022-10-15 --> 2023-05-27) Dec 21, 2022

lewismc added bug Something isn't working help wanted Extra attention is needed labels Dec 21, 2022

lewismc self-assigned this Apr 16, 2023

lewismc added this to the 0.11.0 milestone Apr 16, 2023

tagtuna mentioned this issue Apr 18, 2023

Scenarios for the acquisition of data files and file versioning #238

Closed

lewismc mentioned this issue May 19, 2023

ISSUE-238 Scenarios for the acquisition of data files and file versioning #265

Merged

lewismc modified the milestones: 0.11.0, 0.13.0 May 19, 2023

lewismc closed this as completed Jun 12, 2023

github-project-automation bot moved this to ✅ Done in ICCAT Product Drive Phase 2 (2022-10-15 --> 2023-05-27) Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address tag identification integrity issues #164

Address tag identification integrity issues #164

lewismc commented Dec 21, 2022 •

edited

Loading

lewismc commented Dec 21, 2022

lewismc commented Jan 15, 2023

lewismc commented Jan 15, 2023

lewismc commented Feb 28, 2023

vtsontos commented Apr 13, 2023

lewismc commented Apr 16, 2023

vtsontos commented Apr 16, 2023 via email

tagtuna commented Apr 17, 2023 •

edited

Loading

tagtuna commented Apr 17, 2023 •

edited

Loading

lewismc commented Apr 19, 2023

lewismc commented Apr 20, 2023

lewismc commented Apr 20, 2023

tagtuna commented Apr 20, 2023 via email

renato2099 commented Apr 23, 2023

tagtuna commented Apr 25, 2023 via email

Address tag identification integrity issues #164

Address tag identification integrity issues #164

Comments

lewismc commented Dec 21, 2022 • edited Loading

lewismc commented Dec 21, 2022

lewismc commented Jan 15, 2023

lewismc commented Jan 15, 2023

lewismc commented Feb 28, 2023

vtsontos commented Apr 13, 2023

lewismc commented Apr 16, 2023

vtsontos commented Apr 16, 2023 via email

tagtuna commented Apr 17, 2023 • edited Loading

tagtuna commented Apr 17, 2023 • edited Loading

lewismc commented Apr 19, 2023

lewismc commented Apr 20, 2023

lewismc commented Apr 20, 2023

tagtuna commented Apr 20, 2023 via email

renato2099 commented Apr 23, 2023

tagtuna commented Apr 25, 2023 via email

lewismc commented Dec 21, 2022 •

edited

Loading

tagtuna commented Apr 17, 2023 •

edited

Loading

tagtuna commented Apr 17, 2023 •

edited

Loading