Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OSS101] Task 3: Construction of Dataset and Method Implementation for Named Entity Recognition in the Open Source Community #59

Open
PureNatural opened this issue May 13, 2024 · 4 comments

Comments

@PureNatural
Copy link
Collaborator

Description

This task aims to construct an open-source community named entity recognition (NER) dataset and implement corresponding methods. By collecting and annotating textual data from the open-source community, especially content containing named entities, we will create a dataset for training and evaluating NER models. Additionally, you will explore and implement various NER methods, including rule-based, statistical, or deep learning approaches, to enhance the performance and applicability of the models.

The relevant code and dataset for this task need to be provided in the repository.

@YeexiaoZheng
Copy link

这个openperf项目的数据集是需要自己获取吗?

@YeexiaoZheng
Copy link

还是说我们使用任何一个开源数据集当我们实验数据集都可以?

@PureNatural
Copy link
Collaborator Author

还是说我们使用任何一个开源数据集当我们实验数据集都可以?

数据集需要自己来构建,需要使用开源生态场景下的文本数据,举个例子,可以获取开源仓库下的readme文档,获取文档中的实体。关于每个仓库的描述内容可以通过GitHub官方提供的Rest API获取,GitHub行为日志数据可以通过https://www.gharchive.org/
获取,包含了所有issue PR commit相关的评论文本。

最后数据集效果类似下图:
image

实体类型可以自己来定义,总而言之,只要是在开源社区场景下的命名实体任务即可。

@PureNatural
Copy link
Collaborator Author

可以参考一下这个论文:
https://aclanthology.org/2020.acl-main.443/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants