"Reproducing our work" does not specify set of languages and snapshots #26

leezu · 2021-04-01T17:19:28Z

README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

cc_net/cc_net/mine.py

Lines 172 to 191 in 242e10d

    
           REPRODUCE_CONFIG = Config( 
        
               config_name="reproduce", 
        
               dump="2019-09", 
        
               mined_dir="reproduce", 
        
               pipeline=["fetch_metadata", "keep_lang", "keep_bucket", "split_by_lang"], 
        
               metadata="https://dl.fbaipublicfiles.com/cc_net/1.0.0", 
        
               # Optional filtering: 
        
               # It won't change much the execution speed, but decreases the disk requirement. 
        
               # Restrict languages 
        
               lang_whitelist=["fr"], 
        
               # Restrict perplexity buckets 
        
               # Top languages have been split in perplexity buckets according 
        
               # to a Wikipedia trained LM. 
        
               # The buckets from low perplexity (good) to high (bad) are: 
        
               # ["head", "middle", "tail"] 
        
               # Languages without a LM have only one bucket "all". 
        
               # It won't change much the execution speed, but decreases the disk requirement. 
        
               keep_bucket=["head", "all"], 
        
               mine_num_processes=1, 
        
           )

The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

The text was updated successfully, but these errors were encountered:

mome1024 · 2021-09-16T09:17:51Z

https://dl.fbaipublicfiles.com/cc_net/1.0.0
Is the website accessible?

I can't access this website properly.

mome1024 · 2021-09-16T09:20:17Z

README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

cc_net/cc_net/mine.py

Lines 172 to 191 in 242e10d

REPRODUCE_CONFIG = Config(

config_name="reproduce",

dump="2019-09",

mined_dir="reproduce",

pipeline=["fetch_metadata", "keep_lang", "keep_bucket", "split_by_lang"],

metadata="https://dl.fbaipublicfiles.com/cc_net/1.0.0",

# Optional filtering:

# It won't change much the execution speed, but decreases the disk requirement.

# Restrict languages

lang_whitelist=["fr"],

# Restrict perplexity buckets

# Top languages have been split in perplexity buckets according

# to a Wikipedia trained LM.

# The buckets from low perplexity (good) to high (bad) are:

# ["head", "middle", "tail"]

# Languages without a LM have only one bucket "all".

# It won't change much the execution speed, but decreases the disk requirement.

keep_bucket=["head", "all"],

mine_num_processes=1,

)

The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

https://dl.fbaipublicfiles.com/cc_net/1.0.0
Is the website accessible?

I can't access this website properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Reproducing our work" does not specify set of languages and snapshots #26

"Reproducing our work" does not specify set of languages and snapshots #26

leezu commented Apr 1, 2021

mome1024 commented Sep 16, 2021

mome1024 commented Sep 16, 2021

"Reproducing our work" does not specify set of languages and snapshots #26

"Reproducing our work" does not specify set of languages and snapshots #26

Comments

leezu commented Apr 1, 2021

mome1024 commented Sep 16, 2021

mome1024 commented Sep 16, 2021