Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eurosat #122

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Eurosat #122

wants to merge 17 commits into from

Conversation

Prateek0xeo
Copy link

#104
md5 mismatch issue solved

@cregouby
Copy link
Contributor

Hello @Prateek0xeo,

Thanks a lot for this contribution !

I have a few comment on it :

  • We do not want to host archive files in the code repository, as it would make the package too heavy, up to the point it would not pass the CRAN check. Could you remove those zip and tar.gz file from the pull-request please ?
  • the check fails with information that should be fixed :
==> devtools::check()

══ Documenting ══════════════════════════════════════════════════════════
ℹ Updating torchvision documentation
ℹ Loading torchvision
Avis : Objects listed as exports, but not present in namespace:
• inat2021_dataset
✖ dataset-eurosat.R:28: @export must be a single line, not 5.
ℹ The first line is "Download the EuroSAT dataset"
✖ transforms-magick.R:56: S3 method `get_image_size.magick-image` needs
  @export or @exportS3method tag.
Writing NAMESPACE
...
── R CMD check results ────────────────────── torchvision 0.6.0.9000 ────
Duration: 47.6s

❯ checking Rd \usage sections ... WARNING
  Documented arguments not in \usage in Rd file 'eurosat_dataset.Rd':
    ‘i’
  
  Functions with \usage entries need to have the appropriate \alias
  entries, and all their arguments documented.
  The \usage entries must correspond to syntactically valid R code.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.

Could you please fix those two as well ?

Thanks a lot in advance.

@Prateek0xeo
Copy link
Author

Sure! Thank you for the feedback, I'll improve upon the PR and make sure that all other Pull requests i make also reflect the same.

@cregouby
Copy link
Contributor

Then checking the use of ds <- eurosat_dataset(download = TRUE) I get an SSL error from the server certificate:

utils::download.file(self$resources$url, destfile = zip_file,  :
  URL 'https://madm.dfki.de/files/sentinel/EuroSAT.zip': statut 'SSL peer certificate or SSH remote key was not OK'

Thus libcurl do not download the zip file and the dataloader fails.

Error is confirmed by online SSL checkers :
image

@cregouby
Copy link
Contributor

I've contact the website owner for a fix / workaround.

@Prateek0xeo
Copy link
Author

@cregouby I have implemented all the changes you mentioned.
i have now removed the data files from the PR and improved Documents and ran the check cmd which gave no errors.
Pls check the PR once and tell me if there is anything else left unchecked so that i can move forward with rest of the Dataset loaders

@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 17, 2025

Hello @cregouby,

I referred to the PyTorch repository and tried implementing the new URL and MD5 checksum provided in the code, but it still didn't resolve the issue. While the SSL certificate problem was fixed there, the dataset still doesn't load as expected.

I also took inspiration from the spam-loader review by @dfalbel and applied similar improvements to this PR.

@cregouby
Copy link
Contributor

Hello @Prateek0xeo

I've recieved an answer from the website administrator:

Dear Christophe,
First of all, thank you for reaching out and for your interest in our dataset.
Unfortunately, this is not the first time this issue has occurred, so I decided to host the dataset on Zenodo as well.
For a permanent fix, I would recommend using the version hosted on Zenodo (https://zenodo.org/records/7711810).
If you have any questions or need further information, please feel free to contact me.
Kind regards,

I can see in the meantime that you switch to huggingface dataset, which fails in my case with a wrong MD5

> ds <- eurosat_dataset(download = TRUE)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1173  100  1173    0     0   1123      0  0:00:01  0:00:01 --:--:--  1124
Erreur dans self$download() : 
  Downloaded file has an invalid MD5 checksum. Please try again.
Called from: self$download()

We can find the download failure root cause in the zip file :

Browse[1]> readLines(zip_file)
[1] "Found. Redirecting to https://cdn-lfs.hf.co/repos/fc/1d/fc1dee780dee1dae2ad48856d0961ac6aa5dfcaaaa4fb3561be4aedf19b7ccc7/8ebea626349354c5328b142b96d0430e647051f26efc2dc974c843f25ecf70bd?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27EuroSAT.zip%3B+filename%3D%22EuroSAT.zip%22%3B&response-content-type=application%2Fzip&Expires=1737474034&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNzQ3NDAzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9mYy8xZC9mYzFkZWU3ODBkZWUxZGFlMmFkNDg4NTZkMDk2MWFjNmFhNWRmY2FhYWE0ZmIzNTYxYmU0YWVkZjE5YjdjY2M3LzhlYmVhNjI2MzQ5MzU0YzUzMjhiMTQyYjk2ZDA0MzBlNjQ3MDUxZjI2ZWZjMmRjOTc0Yzg0M2YyNWVjZjcwYmQ%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=buJTsafUfhVxwvReQoUSJcNl3VwU74g1YghMxdzBwM6G25WA3-eyL3M6edcgNDBDCEBGjaoJ026dpGx9RZ%7EXsyirvlbOdND95c1jOqIZU5Y3Lh67bSvZcbbSbgGOjq5OvHvKbMSmXBWvPGzG3Ody8Z8Fm%7ExN9mhW6jqwi6LaQ84pH2VW-bWLLXTyrzexXmrmY2FBplfM3ZGD%7EmiTx1JHu97SbY7S8x9GYAPW16vV6w%7EVIWJ5KztnTpF%7Ea78q8tYWasUttVF9q9SbMw2DJMLq3s7MyFZlxXnO6cSM4jMN1fowTxbjbV5-O6rSpf0jp0-FtpK5DfVQWEvTAnBZ3ZOg%7Ew__&Key-Pair-Id=K3RPWS32NSSJCE"

So see my remarks inline.

R/dataset-eurosat.R Outdated Show resolved Hide resolved
R/dataset-eurosat.R Outdated Show resolved Hide resolved
Copy link
Contributor

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Prateek0xeo

Thanks for submitting,
Please fix the remarks, and fix the tests.

question : as the huggingface dataset already includes a train / test / val split, is there a way to make those split available here with an option to the dataset like in mnist_dataset() ?
suggestion My proposal would be

#' @param split (character, optional): If `train` (default), creates the training dataset. Otherwise
#'   value should be either `test` or `val` for rerpectively test set or validation set.

tests/testthat/test-dataset-eurosat.R Outdated Show resolved Hide resolved
R/dataset-eurosat.R Outdated Show resolved Hide resolved
@Prateek0xeo
Copy link
Author

@cregouby I have already tried the data hosted on zenedo it is not working for me. I'll try to improvise on the huggingface dataset and make it work.

@Prateek0xeo
Copy link
Author

source("C:/Coding/torchvision/torchvision/R/dataset-eurosat.R")
ds <- eurosat_dataset(download = TRUE)
trying URL 'https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true'
Content type 'application/zip' length 94280567 bytes (89.9 MB)
downloaded 89.9 MB

sample <- ds[1]
print(sample$x)
, , 1

       [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]

[1,] 0.5843137 0.5764706 0.5725490 0.5764706 0.5843137 0.5921569 0.5843137
[2,] 0.5843137 0.5843137 0.5843137 0.5803922 0.5882353 0.5921569 0.5803922

the dataloader seems to be working now I'll push the code after self review

@Prateek0xeo
Copy link
Author

@cregouby In this commit dataloader is working. I am still working on figuring out the splits functionalilty until then pls review the new commit.

@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 20, 2025

Hello @cregouby
I contacted the dataset owner and received this message

Hi Prateek,

All of the datasets in https://huggingface.co/datasets/torchgeo/ are intended for use with the TorchGeo library. You can use it like so:

from torchgeo.datasets import EuroSAT

train_dataset = EuroSAT('data', split='train', download=True)
val_dataset = EuroSAT('data', split='val', download=True)
test_dataset = EuroSAT('data', split='test', download=True)
See https://torchgeo.readthedocs.io/en/latest/api/datasets.html#eurosat for documentation.

So i implemented the functionality using these URL Endpoints
List the split names for this dataset

curl -X GET
"https://datasets-server.huggingface.co/splits?dataset=torchgeo%2Feurosat"

but the Val split was not working and throwing an error
Screenshot 2025-01-20 210917

So i have contacted the Owner again for this issue https://huggingface.co/datasets/torchgeo/eurosat/discussions/2
The new PR should work once the dataset host is fixed
Pls go through the new commit and comment for any change

@Prateek0xeo
Copy link
Author

@cregouby there is a problem with image decoding with this dataset
ERROR - The dataset API returned a value for row$image that is not a valid Base64-encoded string, and the decoding step failed. This issue likely arises from the structure or format of the dataset returned by the API
Because of this i can't access the image.

@cregouby
Copy link
Contributor

cregouby commented Jan 21, 2025

Hello @Prateek0xeo

1 - The new download only gets the first 100 sample, which is not what we want. We want the full train dataset, i.e. 32.5k rows.
2 - You should not replace the previous test, but rather add the new tests in the same file.
3 - Image are jpg images, so the decoding cannot be made by base64. You should use jpeg:: that is already imported by the package.

I get the following error running the tests :

[ FAIL 2 | WARN 1 | SKIP 0 | PASS 3 ]

── Warning (test-dataset-eurosat.R:18:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
Skipping row 1 due to image decoding failure: I can only decode base64 strings
Backtrace:
    ▆
 1. ├─train_ds[1] at test-dataset-eurosat.R:18:3
 2. └─torch:::`[.dataset`(train_ds, 1) at test-dataset-eurosat.R:18:3
 3.   └─x$.getitem(y)
 4.     └─base::tryCatch(...) at torchvision_prateek/R/dataset-eurosat.R:82:5
 5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.           └─value[[3L]](cond)

── Failure (test-dataset-eurosat.R:19:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
!is.null(sample$x) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Image should not be null

── Failure (test-dataset-eurosat.R:20:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
!is.null(sample$y) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Label should not be null
[ FAIL 2 | WARN 1 | SKIP 0 | PASS 3 ]

In order to ensure the test runs correctly on all platforms, you should activate the github actions in the setup of your git repository, and then activate the R-CMD-check workflow.
image

So that for each or your commit, you will see test results over all platforms. All must be green, or you have to fix code.

@cregouby
Copy link
Contributor

Hello @Prateek0xeo
There is no need to change something in th e.github/workflow/ folder as everything is already here.
The setup I mentionned is to be done in your github repository.

@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 22, 2025 via email

tests/testthat/test-dataset-eurosat.R Outdated Show resolved Hide resolved
tests/testthat/test-dataset-eurosat.R Outdated Show resolved Hide resolved
tests/testthat/test-dataset-eurosat.R Outdated Show resolved Hide resolved
tests/testthat/test-dataset-eurosat.R Outdated Show resolved Hide resolved
@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 23, 2025

@cregouby This new commit has passed all tests.
I have used text files to split the zip.
train file -16200
test file - 5400
val file - 5400
this has been verfied

ds <- eurosat_dataset(root = "./data/eurosat", split = "val", download = TRUE)
Downloading EuroSAT ZIP...
trying URL 'https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true'
Content type 'application/zip' length 94280567 bytes (89.9 MB)
downloaded 89.9 MB

EuroSAT ZIP downloaded.
Extracting EuroSAT ZIP...
Extraction complete.
Downloading split text file: https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/eurosat-val.txt?download=true
trying URL 'https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/eurosat-val.txt?download=true'
Content type 'text/plain; charset=utf-8' length 104659 bytes (102 KB)
downloaded 102 KB

sample <- ds[1]
print(dim(sample$x))
[1] 64 64 3
print(sample$y)
[1] "Highway"
sample <- ds[3000]
print(dim(sample$x))
[1] 64 64 3
print(sample$y)
[1] "PermanentCrop"

other that this I still have to run R-CMD check
i have merged the dataloader with my main branch and trying to run tests now.
i am running into some pronlems with th yaml file
but i will give you a full update when it is done

@Prateek0xeo
Copy link
Author

@cregouby all R-CMD-check are passed

@cregouby
Copy link
Contributor

Hello @Prateek0xeo

Can you review my 5 suggested changes ? https://github.com/mlverse/torchvision/pull/122/files/de833e1abb1976fa7af915c6feb490c06128785e

Copy link
Author

@Prateek0xeo Prateek0xeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions are implemented

  1. the text file data is length is checked for exact amount of rows (3 suggestions)
    2.check added for tensor shape and tensor dtype of each x and y object (1 suggestion)
    What is the 5th suggestion? @cregouby

@cregouby
Copy link
Contributor

Hello @Prateek0xeo

Tests fails for me :

[ FAIL 2 | WARN 0 | SKIP 0 | PASS 5 ]

── Failure (test-dataset-eurosat.R:26:3): eurosat_dataset API handles splits correctly via API ──────────────────────────────────────────────────────────────────────────────────────────────
Train dataset should have exactly 16200 samples not equal to 16200.
1/1 mismatches
[1] 100 - 16200 == -16100

── Error (test-dataset-eurosat.R:28:3): eurosat_dataset API handles splits correctly via API ────────────────────────────────────────────────────────────────────────────────────────────────
Error in `initialize(...)`: Invalid split. Choose one of 'train', 'validation', or 'test'.
Backtrace:1. └─torch (local) eurosat_dataset(root = temp_root, split = "val", download = TRUE) at test-dataset-eurosat.R:28:3
 2.   └─d$new(...)
 3.     └─torchvision (local) initialize(...)
[ FAIL 2 | WARN 0 | SKIP 0 | PASS 5 ]

but it should'd be that hard to fix...

For readability, I would recommand to use the most advanced testthat capability. I.e. prefer expect_length() over expect_true(length()) and expect_match() instead of expect_true(!is.null()) and (expect_equal(type_of(), "character").
Please note that we never expect typeof() == as type are additive, but rather expect_is() which test for inheritance.

I would recommand you to mimic the tests that are performed on MNIST in tests/testthat/test-dataset-mnist.R L3-L28 for a good coverage (and specifically the ability to get torch_tensors out of transform = transform_to_tensor )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants