You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
# It won't change much the execution speed, but decreases the disk requirement.
# Restrict languages
lang_whitelist=["fr"],
# Restrict perplexity buckets
# Top languages have been split in perplexity buckets according
# to a Wikipedia trained LM.
# The buckets from low perplexity (good) to high (bad) are:
# ["head", "middle", "tail"]
# Languages without a LM have only one bucket "all".
# It won't change much the execution speed, but decreases the disk requirement.
keep_bucket=["head", "all"],
mine_num_processes=1,
)
The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.
The text was updated successfully, but these errors were encountered:
# It won't change much the execution speed, but decreases the disk requirement.
# Restrict languages
lang_whitelist=["fr"],
# Restrict perplexity buckets
# Top languages have been split in perplexity buckets according
# to a Wikipedia trained LM.
# The buckets from low perplexity (good) to high (bad) are:
# ["head", "middle", "tail"]
# Languages without a LM have only one bucket "all".
# It won't change much the execution speed, but decreases the disk requirement.
keep_bucket=["head", "all"],
mine_num_processes=1,
)
The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.
README.md provides
python -m cc_net --config reproduce --dump 2019-09
as an example to reproduce the cc_net corpus, which relies oncc_net/cc_net/mine.py
Lines 172 to 191 in 242e10d
The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via
https://dl.fbaipublicfiles.com/cc_net/1.0.0
, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.The text was updated successfully, but these errors were encountered: