Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不知道作者下一步有没有兴趣把数据规模提升到T级别? #36

Open
ArrogantL opened this issue Dec 14, 2019 · 2 comments

Comments

@ArrogantL
Copy link

No description provided.

@ArrogantL
Copy link
Author

Common Crawl包含了超过7年的网络爬虫数据集,包含原始网页数据、元数据提取和文本提取。
里面包含有大量中文文本以供提取
[1]Buck C, Heafield K, Van Ooyen B. N-gram Counts and Language Models from the Common Crawl[C]//LREC. 2014, 2: 4.
[2]Smith J R, Saint-Amand H, Plamada M, et al. Dirt cheap web-scale parallel text from the common crawl[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1374-1383.
[3]Spiegler S. Statistcs of the common crawl corpus 2012[R]. Technical report, SwiftKey, 2013.
[4]Mühleisen H, Bizer C. Web Data Commons-Extracting Structured Data from Two Large Web Corpora[J]. LDOW, 2012, 937: 133-145.
[5]Bizer C, Eckert K, Meusel R, et al. Deployment of rdfa, microdata, and microformats on the web–a quantitative analysis[C]//International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013: 17-32.

@brightmart
Copy link
Owner

有兴趣。能否通过QQ群加我一下,我们一起搞一搞。欢迎加入中文预训练模型transform,群聊号码:836811304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants