-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于比赛选题的想法 #1
Comments
这里可以简单总结一下上述工作的内容: 我的研究方式是通过paperwithcode的数据(api:https://paperswithcode.com/)与(Academic Graph api:https://api.semanticscholar.org/api-docs/)建立连接。通过开源数据集对应的学术论文的title可以在Academic Graph找到对应的学术引用网络。 其中用到的是:论文被引用的网络➡️详细被引用的作者、论文名、引用时间(按年)➡️以每年为单位分析被引用的次数➡️分析此数据集的长期使用模式 |
但是关于比赛来说,个人感觉只分析开源数据集每年的被引用次数工作量有点单薄; |
是的,仅仅只做分析确实有点工作量不足,那么除了你提到的分析开源数据集每年的被引用次数等,是不是可以尝试构建和分析更复杂的引用网络来提供更丰富的信息?例如: |
最近也看了几篇文章,咱们可以尝试引入GitHub开源的数据可视化分析工具HyperCRX,对应的还有几个指标如Activity、openrank等。 |
对于开源数据集的分析是不是需要引入数据集的领域知识?比如GIS领域的数据集与CS领域的数据集是有相当大的区别的,而且使用模式、长期的趋势发展也是有一定不同的。 |
我个人觉得按照领域对开源数据集进行区分后再分析是有一定的必要的 |
我之前的研究工作就是基于开源GIS的生态建设进行展开,不同领域之间的开源生态发展是有相当大的区别的 |
这是我那篇文章的摘要: |
对于不同领域数据集进行展开分析确实有必要,不过工作量有点巨大,而我们最终只能从引用网络中得到其中一个论文时间的字段,其他大量的字段目前都没利用上 |
现在已有的工作就是针对text、image、audio、medical、video这五个领域的数据集选择有代表性的数据集进行的分析 |
主要就是要想想该怎么把api获取到的其他引用网络中的字段利用起来 这样能提高工作量 |
也对,确实需要综合考虑并且评估一下工作量 |
既然选择的都是不同模态的数据集,那么以最近比较火热的多模态大模型研究为切入点是不是有必要? |
对 我目前就是针对不同的模态的数据集 但是多模态大模型要怎么考虑进来呢 |
可以对多模态大模型的应用做一个细致的分类,并根据其对应的研究文献对多模态数据集的引用展开初步分析 |
或者说可以更深入一些,基于你之前对数据集长期使用模式(以年为单位)的研究进行细粒度分析 |
我们的数据分析的可视化大概是什么技术路线? |
今天深入了解了下,我们可以尝试通过pandas、matplotlib、seaborn、plotly、bokeh等Python数据分析与可视化库,对我们现有的数据进行统计与图表绘制。也可直接通过Looker Studio创建交互式可视化报表和仪表盘,无需编写复杂代码。或者可使用数据可视化商业工具如Tableau或Microsoft Power BI,对从GitHub下载的CSV数据进行可视化分析。 |
很全面,但是我们还是要根据我们当前的数据集种类来制定合适的计划和方案 |
确实,细粒度的方案指定更加有意义 |
目前有一个初步的想法,基于我之前的工作进一步进行展开
这是我在bench2024上的工作内容:X-lab2017/open-research#296
题目:Evaluating Long-Term Usage Patterns of Open Source Datasets: A Citation Network Approach
摘要:
The evaluation of datasets serves as a fundamental basis for tasks in evaluatology. Evaluating the usage patterns of datasets has a significant impact on the selection of appropriate datasets. Many renowned Open Source datasets are well-established and have not been updated for many years, yet they continue to be widely used by a large number of researchers. Due to this characteristic, conventional Open Source metrics (e.g., number of stars, issues, and activity) are insufficient for evaluating the long-term usage patterns based on log activity data from their GitHub repositories.
Researchers often encounter significant challenges in selecting appropriate datasets due to the lack of insight into how these datasets are being utilized. To address this challenge, this paper proposes establishing a connection between Open Source datasets and the citation networks of their corresponding academic papers. By constructing the citation network of the corresponding academic paper, we can obtain rich graph-structured information, such as citation times, authors, and more. Utilizing this information, we can evaluate the long-term usage patterns of the associated Open Source dataset.
Furthermore, this paper conducts extensive experiments based on ten major dataset categories (Texts, Images, Videos, Audio, Medical, 3D, Graphs, Time Series, Tabular, and Speech) to demonstrate that the proposed method effectively evaluates the long-term usage patterns of Open Source datasets. Additionally, the insights gained from the experimental results can serve as a valuable reference for future researchers in selecting appropriate datasets for their work.
The text was updated successfully, but these errors were encountered: