Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc(website): add zh doc for openrank #1191

Merged
merged 1 commit into from
Feb 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
- [Workflow](/workflow.md)
- [Data Description](/data.md)
- Metrics
- [OpenRank](/metrics/openrank.md)
- [Global OpenRank](/metrics/global_openrank.md)
- [Project OpenRank](/metrics/project_openrank.md)
6 changes: 3 additions & 3 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ The data source comes from [GH Archive](https://www.gharchive.org/) which is a p

### Database

In order to meet the requirement for high-speed analysis among such big data, we parse the row data into well-defined structure and import it into [ClickHouse](https://clickhouse.tech/) server which is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. The Clickhouse database version is 22.8 in our server.
In order to meet the requirement for high-speed analysis among such big data, we parse the row data into well-defined structure and import it into [ClickHouse](https://clickhouse.tech/) server which is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. The ClickHouse database version is 22.8 in our server.

### Data Schema in Database

The database table offered by the `Clickhouse` server is showing in [data description](https://github.com/X-lab2017/open-digger/blob/master/docs/assets/data_description.csv). You can find a table with 120+ rows of features which were parsed from the raw GHArchive datasets. Check the data descriptions and what features you want to play with.
The database table offered by the `ClickHouse` server is showing in [data description](https://github.com/X-lab2017/open-digger/blob/master/docs/assets/data_description.csv). You can find a table with 120+ rows of features which were parsed from the raw GHArchive datasets. Check the data descriptions and what features you want to play with.

### User Guide for Database Service

For the detailed documentations for Clickhouse SQL usage, check out the [SQL reference](https://clickhouse.tech/docs/en/).
For the detailed documentations for ClickHouse SQL usage, check out the [SQL reference](https://clickhouse.tech/docs/en/).

### FAQ

Expand Down
4 changes: 2 additions & 2 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

<head>
<meta charset="UTF-8">
<title>GitHub Analysis Report</title>
<title>OpenDigger</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="description" content="GitHub Analysis Report">
<meta name="description" content="OpenDigger, a data analysis tool for open source world">
<meta name="viewport"
content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
<link rel="stylesheet" href="//unpkg.com/docsify/lib/themes/vue.css">
Expand Down
14 changes: 7 additions & 7 deletions docs/metrics/openrank.md → docs/metrics/global_openrank.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
# OpenRank
# Global OpenRank

![Type](https://img.shields.io/badge/Type-Index-blue) ![From](https://img.shields.io/badge/From-X--lab-blue) ![For](https://img.shields.io/badge/For-Repo/Developer-blue)

## Definition

OpenRank is an index introduced by X-lab, the original idea of OpenRank is from Frank, read the [blog](https://blog.frankzhao.cn/how_to_measure_open_source_3) for the detail of this index.
Global OpenRank is an index introduced by X-lab, the original idea of global OpenRank is from Frank, read the [blog](https://blog.frankzhao.cn/how_to_measure_open_source_3) for the detail of this index.

OpenRank is a downstream index of `activity`, it partially uses `activity` index to construct a collaborative network for all GitHub repos and developers. The network model is:
Global OpenRank is a downstream index of `activity`, it partially uses `activity` index to construct a collaborative network for all GitHub repos and developers. The network model is:

![OpenRankUML](https://www.plantuml.com/plantuml/png/SoWkIImgAStDuUBAJInGI4ajIyt9BqWjKgZcKb0eIymfJLMmjLF8AyrDIYtYgeKeAaejo2_EBCalgiIb2c6CZQwk7R86AuN4v9BCiioIIYukXzIy5A3D0000)

In the implementation of OpenRank, we use `activity` index as relationship weight for developers and repositories, construct the global network for every month and calculate the OpenRank of every node in the network. However, we do not use `square` to calculate the `activity` in OpenRank because `square` is used to bring community size into account, but for a global collaborative network, the community size is already implied in the network structure.
In the implementation of global OpenRank, we use `activity` index as relationship weight for developers and repositories, construct the global network for every month and calculate the global OpenRank of every node in the network. However, we do not use `square` to calculate the `activity` in global OpenRank because `square` is used to bring community size into account, but for a global collaborative network, the community size is already implied in the network structure.

Different from PageRank, the value of each node does not entirely depend on the network structure, but also partially depends on the value of the node in last month. So for every developer and repository, it will inherit part of its OpenRank value which is also a reflect of long-term value in open source.
Different from PageRank, the value of each node does not entirely depend on the network structure, but also partially depends on the value of the node in last month. So for every developer and repository, it will inherit part of its global OpenRank value which is also a reflect of long-term value in open source.

## Code

We do not open source OpenRank calculation code in OpenDigger since this is a network index and depends on Neo4j database. But we do export the result of each month to ClickHouse server, so you can still access OpenRank index by the [code](https://github.com/X-lab2017/open-digger/blob/master/src/metrics/indices.ts#L21).
We do not open source global OpenRank calculation code in OpenDigger since this is a network index and depends on Neo4j database. But we do export the result of each month to ClickHouse server, so you can still access global OpenRank index by the [code](https://github.com/X-lab2017/open-digger/blob/master/src/metrics/indices.ts#L21).

## Parameters

There are several parameters used in OpenRank algorithm.
There are several parameters used in global OpenRank algorithm.

| Parameter Name | Value | Description | Note |
| :------------- | :---- | :---------- | :--- |
Expand Down
3 changes: 3 additions & 0 deletions docs/zh-cn/_sidebar.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
- [贡献指南](/zh-cn/CONTRIBUTING.md)
- [工作流](/zh-cn/workflow.md)
- [数据描述](/zh-cn/data.md)
- 指标说明
- [全域 OpenRank](/zh-cn/metrics/global_openrank.md)
- [项目 OpenRank](/zh-cn/metrics/project_openrank.md)
- [项目与活动](/zh-cn/events.md)
21 changes: 9 additions & 12 deletions docs/zh-cn/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,22 @@

### 数据库

为了满足在大规模数据上的高速查询的需求,我们将 GitHub 日志数据解析为结构化数据并导入了开源的列存储高性能实时分析数据库 [ClickHouse](https://clickhouse.tech/) 中,目前该项目使用的 Clickhouse 服务器版本为 20.8.7.15
为了满足在大规模数据上的高速查询的需求,我们将 GitHub 日志数据解析为结构化数据并导入了开源的列存储高性能实时分析数据库 [ClickHouse](https://clickhouse.tech/) 中,目前该项目使用的 ClickHouse 服务器版本为 22.8

### 数据结构

`Clickhouse` 服务器中数据表的结构如[数据描述表](https://github.com/X-lab2017/open-digger/blob/master/docs/assets/data_description.csv)所示。该表中包含了超过 120 行数据列,可以根据该表决定自己想要分析的数据和分析方法。
`ClickHouse` 服务器中数据表的结构如[数据描述表](https://github.com/X-lab2017/open-digger/blob/master/docs/assets/data_description.csv)所示。该表中包含了超过 120 行数据列,可以根据该表决定自己想要分析的数据和分析方法。

### 数据库用户指南

Clickhouse SQL 的详细用法,请参阅 [Clickhouse SQL 文档](https://clickhouse.tech/docs/en/)。
ClickHouse SQL 的详细用法,请参阅 [Clickhouse SQL 文档](https://clickhouse.tech/docs/en/)。

### 示例
### FAQ

以下是较简单的一个从 Clickhouse 数据库查询数据的 SQL 语句。也可以在全域分析和案例分析的 SQL 组件中找到更多示例。
- Q:OpenDigger 可以做开源项目更细化的分析吗,例如任务分配、表情、Issue 标签事件等?

* 某组织下的 Pull Request 审阅评论数据
- A:目前不行,因为任务分配(assign)、表情(reaction)、Issue 标签(label)等事件不在 GitHub 日志数据中,因此 OpenDigger 中没有这部分数据。但开发者可以通过 GitHub API 来获取等详细的数据来进行分析。

```
SELECT actor_id, actor_login, repo_id, repo_name, issue_id, action, created_at
FROM github_log.events
WHERE type='PullRequestReviewCommentEvent' AND repo_name LIKE '{org}/%'
ORDER BY created_at ASC
```
- Q:OpenDigger 开放的指标数据为何不是很准确?

- A:由于 OpenDigger 使用的是 GHArchive 服务与 GitHub 归档日志数据,可能会由于服务稳定性等原因出现部分的数据丢失,所以 OpenDigger 提供的指标数据可以很好的被用于观察项目的指标变化趋势,但不是精确的结果。
39 changes: 39 additions & 0 deletions docs/zh-cn/metrics/global_openrank.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# 全域 OpenRank

![Type](https://img.shields.io/badge/类型-指标-blue) ![From](https://img.shields.io/badge/来自-X--lab-blue) ![For](https://img.shields.io/badge/用于-项目/开发者-blue)

## 定义

全域 OpenRank 是一个由 X-lab 开放实验室提出的开源指标,该指标由赵生宇博士提出,关于全域 OpenRank 的算法细节可以参考[这篇博客](https://blog.frankzhao.cn/how_to_measure_open_source_3)。

全域 OpenRank 是`活跃度`指标的一个下游指标,借鉴了`活跃度`来构建 GitHub 全域项目与开发者之间的一个协作网络,其网络模型是:

![OpenRankUML](https://www.plantuml.com/plantuml/png/SoWkIImgAStDuIhEpimhI2nAp5L8IKrBBCqfSSlFA_5Bp4rLS0nI2F1H2FLEp5HmzkFYoaqiK7Ywf-5f_yGN3QqArLmA2lu5gNb1YNdP2hPs2i-cRdZQi8Uh5gBkoUx9JtTDngC8OP2DhguTJBsLmhCjkrziR-PoICrB0JeE0000)

在全域 OpenRank 指标的实现中,使用`活跃度`指标作为开发者与仓库之间的边的权重,从而构建出全域协作网络来计算网络中每个节点在每个月的全域 OpenRank 值。但与`活跃度`不同的地方在于,我们并没有对开发者的加权活跃值进行开方运算,这是由于`活跃度`指标中的开方运算是为了将社区参与人数(社区规模)的因素引入到指标计算中,但对于协作网络而言,社区参与人数这个变量已经隐含在了网络结构中。

与传统 PageRank 不同之处在于,计算中每个节点的全域 OpenRank 值将不仅仅依赖于当月的协作网络结构,并且也部分依赖于该节点在上个月的全域 OpenRank 值。即对于全域协作网络中的每个开发者和仓库节点,会部分的继承其历史的 OpenRank 值,这里也是体现了开源中珍视长期价值的价值观。

## 代码

由于全域 OpenRank 是基于 Neo4j 数据库的图指标实现,我们并没有在 OpenDigger 中完全开源全域 OpenRank 的计算代码。但我们将每月的结算结果导入到了 ClickHouse 数据库中,因此依然可以通过 OpenDigger 的[代码]((https://github.com/X-lab2017/open-digger/blob/master/src/metrics/indices.ts#L21))来访问各项目与开发者的全域 OpenRank 值。

## 参数

全域 OpenRank 的计算中包含的参数如下:

| 参数名 | 值 | 参数描述 | 注 |
| :------------- | :---- | :---------- | :--- |
| OpenRank 默认值 | 1.0 | 协作网中新节点的默认值,例如新加入网络的开发者节点与新仓库 | |
| 开发者继承比例 | 0.5 | 开发者节点对于上个月 OpenRank 的依赖比例 | 该算法认为相较于仓库,开发者的价值更应体现出开源中的长期价值,因此开发者对于历史价值的依赖度较高 |
| 仓库继承比例 | 0.3 | 仓库节点对于上个月 OpenRank 的依赖比例 | |
| OpenRank 衰减系数 | 0.85 | 对于当月不活跃的开发者和仓库节点的 OpenRank 衰减比例 | OpenRank 价值并不会因为开发者或仓库仅在某月不活跃就直接清零 |
| OpenRank 最小值 | 0.1 | 当节点 OpenRank 衰减值该值以下时清空节点 OpenRank | |

## CodePen 示例

<iframe height="600" style="width: 100%;" scrolling="no" title="OpenDigger - [X-lab] OpenRank/Activity/Bus Factor" src="https://codepen.io/frank-zsy/embed/bGjyqQj?default-tab=js%2Cresult&editable=true&type=openrank" frameborder="no" loading="lazy" allowtransparency="true" allowfullscreen="true">
See the Pen <a href="https://codepen.io/frank-zsy/pen/bGjyqQj">
OpenDigger - [X-lab] OpenRank/Activity/Bus Factor</a> by Frank Zhao (<a href="https://codepen.io/frank-zsy">@frank-zsy</a>)
on <a href="https://codepen.io">CodePen</a>.
</iframe>
48 changes: 48 additions & 0 deletions docs/zh-cn/metrics/project_openrank.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# 项目 OpenRank

![Type](https://img.shields.io/badge/类型-指标-blue) ![From](https://img.shields.io/badge/来自-X--lab-blue) ![For](https://img.shields.io/badge/用于-开发者-blue)

## 定义

项目 OpenRank 是一个由 X-lab 开放实验室提出的开源指标,该指标由赵生宇博士提出,关于项目 OpenRank 的算法细节可以参考[这篇博客](https://blog.frankzhao.cn/openrank_in_project/)。

与全域 OpenRank 指标的计算方法类似,该算法使用项目内的 Issue、PR 等协作数据来构建网络,其网络模型为:

![Project OpenRank](https://www.plantuml.com/plantuml/png/SoWkIImgAStDuU8gpixCAqWiIinLI4bDIopDAN7BpolnIynDLN0CKWZmKGZrJinKSFRZuifDB51ukgVXQV_45msj2jLS2Wh-1QbvGObvsGgsTWhFfcvush27gnQYxidkoKztJIQWABEuk3ILW9g2qfoS-ABKmjBKuX8yIX7kij7LjOEQRANmRClk5zkRqMHHp4Ge0ki1Au2w7YZrTEEy9xlwk90rO5VXa9gN0WnD0000)

## 代码

项目 OpenRank 算法实现尚未开源到 OpenDigger 中,但其底层使用的用于计算通用 OpenRank 的 Neo4j [插件项目](https://github.com/X-lab2017/openrank-neo4j-gds)已经开源,欢迎大家使用。

## 参数

项目 OpenRank 相较于全域 OpenRank 更加复杂,包含较多参数:

| 参数名 | 值 | 描述 | 注 |
| :------------- | :---- | :---------- | :--- |
| 开发者/仓库 OpenRank 默认值 | 1.0 | 网络中开发者与仓库的 OpenRank 默认值,例如新加入社区的开发者或新仓库 | |
| Issue OpenRank 默认值 | 2.0 | 网络中 Issue 节点的 OpenRank 默认值 | |
| 未合入 PR OpenRank 默认值 | 3.0 | 网络中未合入的 PR 节点的 OpenRank 默认值 | |
| 已合入 PR OpenRank 默认值 | 5.0 | 网络中已合入的 PR 节点的 OpenRank 默认值 | |
| 开发者/仓库继承比例 | 0.15 | 网络中开发者/仓库节点对上月历史 OpenRank 或初始 OpenRank 的继承比例 | 项目内 OpenRank 中,开发者的价值更应依赖于当月的活跃情况 |
| Issue/PR 继承比例 | 0.8 | 网络中 Issue/PR 节点对上个月历史 OpenRank 或初始 OpenRank 的继承比例 | Issue/PR 的价值应当相对稳定且更依赖于自身的价值 |
| OpenRank 衰减系数 | 0.8 | 对于当月不活跃开发者/Issue/PR 的 OpenRank 衰减比例 | 网络中各节点的 OpenRank 不应在当月不活跃后立即清零 |
| OpenRank 最小值 | 0.1 | 当网络中节点 OpenRank 值衰减至该值以下时将清零 | |
| Issue/PR 节点延属于边流向仓库节点的 OpenRank 比例 | 0.1 | Issue/PR 节点的 OpenRank 有多少比例转移到仓库节点 | |
| 仓库节点延属于边流向 Issue/PR 节点的 OpenRank 比例 | 平均 | 仓库节点的 OpenRank 将平均分配到所有 Issue/PR 节点 | |
| Issue/PR 节点延活跃边流向开发者节点的 OpenRank 比例 | 0.9 | Issue/PR 节点的 OpenRank 有多少比例转移到开发者节点 | |
| 开发者节点延活跃边流向 Issue/PR 节点的 OpenRank 比例 | 1.0 | 开发者节点的 OpenRank 有多少比例转移到 Issue/PR 节点 | |
| `发起`动作活跃比例 | 0.5 | Issue/PR 的价值有多少将转移到其作者 | Issue/PR 作者将优先获取其 50% 的价值,剩余 50% 由其他参与者获得 |
| `发起`/`评论`/`Review`/`关闭` 动作权重 | 2/1/1/2 | 用于计算活跃边权重时各类事件的权重 | |
| 👍/❤️/🚀 表情权重 | 2/3/4 | 用于计算 Issue/PR 初始 OpenRank 的表情的权重 | Issue/PR 的初始 OpenRank 将由其社区中开发者对其添加的表情所决定 |

## CodePen 示例

> 出于计算成本的考虑,我们并没有对所有项目生成项目 OpenRank,目前已经支持的项目包含 X-lab [XSOSI](https://github.com/X-lab2017/open-digger/blob/master/notebook/community_analysis/xlab.ipynb) 中的所有项目以及[阿里巴巴开源开发者贡献榜](https://opensource.alibaba.com/collection/contribution_leaderboard)中的所有项目.

<iframe height="600" style="width: 100%;" scrolling="no" title="OpenDigger - [X-lab] OpenRank/Activity/Bus Factor" src="https://codepen.io/frank-zsy/embed/abjMXBV?default-tab=js%2Cresult&editable=true" frameborder="no" loading="lazy" allowtransparency="true" allowfullscreen="true">
See the Pen <a href="https://codepen.io/frank-zsy/pen/bGjyqQj">
OpenDigger - [X-lab] OpenRank/Activity/Bus Factor</a> by Frank Zhao (<a href="https://codepen.io/frank-zsy">@frank-zsy</a>)
on <a href="https://codepen.io">CodePen</a>.
</iframe>