不知道作者下一步有没有兴趣把数据规模提升到T级别？ #36

ArrogantL · 2019-12-14T02:36:41Z

No description provided.

ArrogantL · 2019-12-14T03:36:20Z

Common Crawl包含了超过7年的网络爬虫数据集，包含原始网页数据、元数据提取和文本提取。
里面包含有大量中文文本以供提取
[1]Buck C, Heafield K, Van Ooyen B. N-gram Counts and Language Models from the Common Crawl[C]//LREC. 2014, 2: 4.
[2]Smith J R, Saint-Amand H, Plamada M, et al. Dirt cheap web-scale parallel text from the common crawl[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1374-1383.
[3]Spiegler S. Statistcs of the common crawl corpus 2012[R]. Technical report, SwiftKey, 2013.
[4]Mühleisen H, Bizer C. Web Data Commons-Extracting Structured Data from Two Large Web Corpora[J]. LDOW, 2012, 937: 133-145.
[5]Bizer C, Eckert K, Meusel R, et al. Deployment of rdfa, microdata, and microformats on the web–a quantitative analysis[C]//International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013: 17-32.

brightmart · 2019-12-14T03:43:27Z

有兴趣。能否通过QQ群加我一下，我们一起搞一搞。欢迎加入中文预训练模型transform，群聊号码：836811304

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

不知道作者下一步有没有兴趣把数据规模提升到T级别？ #36

不知道作者下一步有没有兴趣把数据规模提升到T级别？ #36

ArrogantL commented Dec 14, 2019

ArrogantL commented Dec 14, 2019

brightmart commented Dec 14, 2019

不知道作者下一步有没有兴趣把数据规模提升到T级别？ #36

不知道作者下一步有没有兴趣把数据规模提升到T级别？ #36

Comments

ArrogantL commented Dec 14, 2019

ArrogantL commented Dec 14, 2019

brightmart commented Dec 14, 2019