Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

"Reproducing our work" does not specify set of languages and snapshots #26

Open
leezu opened this issue Apr 1, 2021 · 2 comments
Open

Comments

@leezu
Copy link

leezu commented Apr 1, 2021

README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

cc_net/cc_net/mine.py

Lines 172 to 191 in 242e10d

REPRODUCE_CONFIG = Config(
config_name="reproduce",
dump="2019-09",
mined_dir="reproduce",
pipeline=["fetch_metadata", "keep_lang", "keep_bucket", "split_by_lang"],
metadata="https://dl.fbaipublicfiles.com/cc_net/1.0.0",
# Optional filtering:
# It won't change much the execution speed, but decreases the disk requirement.
# Restrict languages
lang_whitelist=["fr"],
# Restrict perplexity buckets
# Top languages have been split in perplexity buckets according
# to a Wikipedia trained LM.
# The buckets from low perplexity (good) to high (bad) are:
# ["head", "middle", "tail"]
# Languages without a LM have only one bucket "all".
# It won't change much the execution speed, but decreases the disk requirement.
keep_bucket=["head", "all"],
mine_num_processes=1,
)

The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

@mome1024
Copy link

https://dl.fbaipublicfiles.com/cc_net/1.0.0
Is the website accessible?
image

I can't access this website properly.

@mome1024
Copy link

README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

cc_net/cc_net/mine.py

Lines 172 to 191 in 242e10d

REPRODUCE_CONFIG = Config(
config_name="reproduce",
dump="2019-09",
mined_dir="reproduce",
pipeline=["fetch_metadata", "keep_lang", "keep_bucket", "split_by_lang"],
metadata="https://dl.fbaipublicfiles.com/cc_net/1.0.0",
# Optional filtering:
# It won't change much the execution speed, but decreases the disk requirement.
# Restrict languages
lang_whitelist=["fr"],
# Restrict perplexity buckets
# Top languages have been split in perplexity buckets according
# to a Wikipedia trained LM.
# The buckets from low perplexity (good) to high (bad) are:
# ["head", "middle", "tail"]
# Languages without a LM have only one bucket "all".
# It won't change much the execution speed, but decreases the disk requirement.
keep_bucket=["head", "all"],
mine_num_processes=1,
)

The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

https://dl.fbaipublicfiles.com/cc_net/1.0.0
Is the website accessible?
image
I can't access this website properly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants