Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When do we need media.csv? #275

Closed
peterdesmet opened this issue Sep 6, 2023 · 1 comment
Closed

When do we need media.csv? #275

peterdesmet opened this issue Sep 6, 2023 · 1 comment
Labels
camtrapdp/camtraptor To be decided if this is related to camtrapdp or camtraptor enhancement New feature or request

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Sep 6, 2023

The media.csv is often the largest file to read, which is why read_camtrap_dp() initially had an option to skip it while reading data. We plan to remove that parameter to avoid complexity in further functions and not put the burden on the user to make this choice.

Alternative options to speed up reading:

  1. Only select relevant columns from media (with col_select)
  2. Only read media.csv in functions that actually need it (mostly write_ functions)
  3. A combination of the two

For documentation, here's when I think we need media columns:

column write_camtrap_dp() write_dwc() other functions
mediaID yes yes yes (for reference)
deploymentID yes yes (for join) yes (for join)
captureMethod yes yes potentially (as filter)
timestamp yes yes yes (for join)
filePath yes yes potentially
filePublic yes yes potentially
fileName yes yes (for sorting) unlikely
fileMediatype yes yes potentially (as filter)
exifData yes no unlikely
favorite yes yes unlikely
mediaComments yes yes unlikely

So a potential solution could be to:

  1. read_camtrap_dp() uses col_select and only reads mediaID, deploymentID, captureMethod and timestamp. This will speed up this function.
  2. filter_ functions applied to deployments and observations also filter the media (on deploymentID, mediaID or a timestamp that falls between eventStart and eventEnd)
  3. write_ functions read the full media.csv (can still be found using the $directory) and join with (potentially filtered) media already in memory. This will slow down those functions.

I don't know what the performance gain would be and if users are more likely to be patient when using read_camtrap_dp() or write_ functions. What is certain is that far more users will use read_camtrap_dp() than the write_ functions, so any speed gain benefits more users. I would wait and see if we hear about performance issues before considering the approach described above.

@peterdesmet peterdesmet added the enhancement New feature or request label Sep 6, 2023
@peterdesmet peterdesmet added the camtrapdp/camtraptor To be decided if this is related to camtrapdp or camtraptor label Mar 6, 2024
@peterdesmet
Copy link
Member Author

Making media conditional adds complexity, so would always read media (as is done in https://inbo.github.io/camtrapdp/reference/read_camtrapdp.html).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
camtrapdp/camtraptor To be decided if this is related to camtrapdp or camtraptor enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant