Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping between old and new filenames #6

Closed
junsukchoe opened this issue Dec 28, 2020 · 13 comments
Closed

Mapping between old and new filenames #6

junsukchoe opened this issue Dec 28, 2020 · 13 comments

Comments

@junsukchoe
Copy link

Hello,

It seems that the file names of ImageNetV2 have been changed.
Could you provide the mappings between old and new filenames?

Thanks!

@Vaishaal
Copy link
Collaborator

Hello! What are the file name changes you are seeing? I had changed the filenames in the public release temporarily and rolled it back. Could you check now to see if the file names are still different?

@expectopatronum
Copy link

Hi!
It looks like not only the file names have changed, but also the number of files.

I downloaded the dataset from here on December 17th 2020. After unpacking it consisted of 3 directories (imagenet-matched-frequency-format-val, imagenet-threshold-0.7-format-val, imagenet-top-images-format-val), each containing 1000 directories with names like n03461385. The directories in imagenet-matched-frequency-format-val for example each contains the images 0.jpeg ... 19.jpeg (which amounts to 20,000 images instead of the mentioned 10,000).

I downloaded the dataset again 1 or 2 days ago and all of a sudden the directory imagenet-matched-frequency-format-val contains directories 0 ... 999 and each of them contains 10 images each with names like 7e4a8987a9a330189cc38c4098b1c57ac301713f.jpeg.

At first I thought I mixed something up but I had documented everything in December when I first downloaded it and even my browser remembered, that I downloaded it from exactly the same URL.

So, what's going on? Could you clarify what the correct version is (I assume the latter with 10k images)? But where do the additional images in the directory I downloaded in December come from?

Best regards
Verena

@Vaishaal
Copy link
Collaborator

Hi @expectopatronum & @junsukchoe

The current dataset release (the one you can download right now with 10k images is the correct one). We had a mixup with our S3 bucket in October 2020 and all our files got deleted, and we re-uploaded the dataset to the same locations.

The long names like "7e4a8987a9a330189cc38c4098b1c57ac301713f" are our internal candidate ids and were added to the release to allow you can merge the images with the data structures/labels found in this repository, and our other project: https://github.com/modestyachts/evaluating_machine_accuracy_on_imagenet. You were right @junsukchoe this is indeed a change in our current release from our old release from pre October 2020.

The extra 10k images are duplicates so you can ignore them!

I can dig up the exact mapping between the filenames old release (from Pre October 2020) and the new release if you need it!

Thanks,
Vaishaal Shankar

@m-Just
Copy link

m-Just commented Mar 12, 2021

Hi! The new directory names 0, 1, ..., 999 cause trouble for using torchvision.datasets.ImageFolder, which sorts the names into 0, 1, 10, 100, ..., 999, different from the original order. To get around, I padded zeros to all directory names to 4 digits and it worked. In Python:

import os, glob

for path in glob.glob('../dataset/imagenetv2*'):
    if os.path.isdir(path):
        for subpath in glob.glob(f'{path}/*'):
            dirname = subpath.split('/')[-1]
            os.rename(subpath, '/'.join(subpath.split('/')[:-1]) + '/' + dirname.zfill(4))

@sbelharbi
Copy link

hi @Vaishaal

just downloaded the dataset imagenetv2-threshold0.7 using this link from here.
when untared, there is only one folder imagenetv2-threshold0.7-format-val.
and as others mentioned, names of files are faa7b8da1c2a3f0fee1814d01d1afffb4b5952f7.jpeg.

I can dig up the exact mapping between the filenames old release (from Pre October 2020) and the new release if you need it!

any news on the mapping?

@expectopatronum did you find a way around this?

i really appreciate your help
thanks

@Vaishaal
Copy link
Collaborator

The tar.gz should have 1000 sub-folders which correspond to each of the 1000 imagnet classes (https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a)

Is this not what you see?

@Vaishaal
Copy link
Collaborator

if you are using pytorch you can use https://github.com/modestyachts/ImageNetV2_pytorch to load the dataset.

@sbelharbi
Copy link

hi,
the issue has nothing to do with the dataset.
but, the issue is the name of files that have changed and caused a problem somewhere else in some repo that uses the old names.
it was 0.jpeg for example, and now it is faa7b8da1c2a3f0fee1814d01d1afffb4b5952f7.jpeg.
someone had made additional annotation based on the old naming.
so, as you said above that you changed the naming system.
so, i was wondering if you have the old naming (mapping between the old naming to the new naming).

thanks

@Vaishaal
Copy link
Collaborator

Oh did not realize there was a dependency on the filenames! We actually lost the old version of the dataset because the newer version with the candidate ids allows us to associate each image in the release to the rest of the metadata we've released in https://github.com/modestyachts/ImageNetV2.

If you have a copy of the old dataset lying around I can probably generate the mapping quite easily but right now I don't have access to the old dataset.

@sbelharbi
Copy link

i dont have the old dataset, but probably the author of the additional annotation might @junsukchoe

thanks

@ludles
Copy link

ludles commented Oct 17, 2021

Hi! The new directory names 0, 1, ..., 999 cause trouble for using torchvision.datasets.ImageFolder, which sorts the names into 0, 1, 10, 100, ..., 999, different from the original order. To get around, I padded zeros to all directory names to 4 digits and it worked. In Python:

import os, glob

for path in glob.glob('../dataset/imagenetv2*'):
    if os.path.isdir(path):
        for subpath in glob.glob(f'{path}/*'):
            dirname = subpath.split('/')[-1]
            os.rename(subpath, '/'.join(subpath.split('/')[:-1]) + '/' + dirname.zfill(4))

Thank you for your snippet! It solves the problem. I made the following minor adjustments to make it more robust w.r.t. OS. (Windows 10 has a different path separator from Linux.)

import glob
for path in glob.glob('../dataset/imagenetv2*'):
    if os.path.isdir(path):
        for subpath in glob.glob(f'{path}/*'):
            dirname = os.path.basename(subpath)
            os.rename(subpath, os.path.sep.join([os.path.dirname(subpath), dirname.zfill(4)]))

@tangtaogo
Copy link

So what's the mapping between old and new filenames? Why not just keep consistent with the original valset

@Vaishaal
Copy link
Collaborator

Ah sorry we lost the old filenames. You can use the ImageNetV2 pytorch dataloader: https://github.com/modestyachts/ImageNetV2_pytorch if you'd like code that loads the dataset correctly (so it is compatible with ImageNet-Val)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants