主要工作专注于数据收集,对论文中数据分析的部分并未复现。整体工作思路为以论文为单位来复现论文中收集数据的方法。
该论文仅提供了一种数据采集的工具,即ipfs-crawler,详情可见ipfs-crawler 主动爬取DHT_server
该论文介绍了一种passively来获取数据的方式,即爬虫伪装成正常的peers,来获取connected peers的数据。文中使用了两种工具,go-ipfs(kubo)与hydra-booster。两者的详情可见:
论文通过上述工具,对数个时间段内的ipfs进行监控,并分析了Churn、网络规模、用户客户端等情况
疑惑:
该文的Fig. 3中H1、H2、Hydra三条曲线的实验参数不清楚,复现时hydra&kubo与Fig. 3中的结果不吻合:
Fig. 3, P3中peers数目一直上涨,但是本地的情况则是上涨到一定的值(6k+)后会不在上涨甚至下跌,怀疑是peers主动断开连接,但是不清楚为什么不主动连接其他peers
主要思路是同kubo一样来建立peers之间的连接,之后通过监听peers的bitswap请求(want_have),来获取每个peer想获取的CID来分析
该论文的工具未开源,但在kubo的源码的基础上修改下应该是不难实现的:
修改kubo中bitswap的log等级
tang@ubuntu:~$ ipfs log level bitswap debug
Changed log level of 'bitswap' to 'debug'
tang@ubuntu:~$ ipfs log level bitswap-client debug
Changed log level of 'bitswap-client' to 'debug'
tang@ubuntu:~$ ipfs log level bitswap-server debug
Changed log level of 'bitswap-server' to 'debug'
tang@ubuntu:~$ ipfs log level bitswap_network debug
Changed log level of 'bitswap_network' to 'debug'
可以看到类似的日志:
2023-04-27T13:48:01.932Z DEBUG bitswap_network network/ipfs_impl.go:427 bitswap net handleNewStream from 12D3KooWDfrUc9KWYphepLsoGvFYqmHaahjBAKj2iFmY2nFDY2Wy
可以根据日志找到如下源码:
// handleNewStream receives a new stream from the network.
func (bsnet *impl) handleNewStream(s network.Stream) {
defer s.Close()
if len(bsnet.receivers) == 0 {
_ = s.Reset()
return
}
reader := msgio.NewVarintReaderSize(s, network.MessageSizeMax)
for {
received, err := bsmsg.FromMsgReader(reader)
if err != nil {
if err != io.EOF {
_ = s.Reset()
for _, v := range bsnet.receivers {
v.ReceiveError(err)
}
log.Debugf("bitswap net handleNewStream from %s error: %s", s.Conn().RemotePeer(), err)
}
return
}
p := s.Conn().RemotePeer()
ctx := context.Background()
log.Debugf("bitswap net handleNewStream from %s", s.Conn().RemotePeer())
bsnet.connectEvtMgr.OnMessage(s.Conn().RemotePeer())
atomic.AddUint64(&bsnet.stats.MessagesRecvd, 1)
for _, v := range bsnet.receivers {
v.ReceiveMessage(ctx, p, received)
}
}
}
应该可以从这个handle文件入手来记录我们需要的bitswap的信息,再往下我就没继续深入了
疑惑: bitswap的具体流程:为何论文中说如果收到了"want_have c",如果没返回"have c"的话将不能进入该peer的session?如果未进入session的话会漏掉哪些数据?
该论文主要用到了三个数据集:
- First Dataset: 抓取了所有DHT_server ,抓取工具:nebula-crawler 主动爬取DHT_server
- Second Dataset: 拿了所有ipfs.io 网关的后台数据,这个我们拿不到
- Third Dataset: 相当于实验,publication & retrieval,在A节点上发布一个任意文件后再在B节点上看多久能取回,并且统计寻找peers所消耗的时间与数据传输所消耗的时间,将两台机器的时间校准后通过日志应该可以拿到publication & retrieval各部分所消耗的时间
github: https://github.com/trudi-group/ipfs-crawler
ipfs-crawler 可以从boostrappers开始,不断获得未访问过的节点的k-buckets,直到没有未访问过的peer,类似于DFS/BFS,但由于Kademlia中为了保证网络的健壮性,即k-buckets中的都为可访问的peers,所以本爬取方式无法爬到处于DHT_client模式下的peers(e.g., peers that are behind NATs)
通过该工具,可以获得:
-
- 所有peers(DHT_server)的
- NodeID
- MultiAddrs
- reachable
- agent_version Example:
{
"NodeID":"12D3KooWSSWpPrUnhC6MbpwGka2AiLASW3nvsdtUYHLZf7L8LQkT",
"MultiAddrs":[
"/ip4/127.0.0.1/udp/4001/quic",
"/ip4/137.184.46.49/tcp/4001",
"/ip4/127.0.0.1/tcp/4001",
"/ip4/10.244.3.195/tcp/4001",
"/ip4/137.184.46.49/udp/39162/quic",
"/ip6/::1/udp/4001/quic",
"/ip4/10.244.3.195/udp/4001/quic",
"/ip6/::1/tcp/4001"
],
"reachable":true,
"agent_version":"kubo/0.14.0/e0fabd6"
}
其中IPFS-Measurement-SIGCOMM22根据Addrs中的ip所在地进行了分析,还有IPFS-Churn-ICDCSW22对agent_version的更新情况进行监测分析,有了上述数据后这些内容的复现都是不难的
PS. 暂未深入源码分析是否连接失败时会重试,这会影响reachable的正确性与爬取的范围,之后可以看看
-
- peers之间的连通性(根据peers中的k-buckets) Example:
SOURCE;TARGET;ONLINE;TIMESTAMP
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWPDwrVMpG89PFRgnDdfZwHMpNLriXJVgLtSvq87pmi1vD;true;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWT1VEz9Tcfhbzh3BzF1ATdhzRakymQcwbX6gzknvkbzdA;true;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWB1WfNwMsvJpjviHWALskh8PHcGx5fNwETYEnp4ui83gX;true;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWDQcxLMH6JMp1GNfTwtP9N2nkozpY5owNUwTs6YrcvTqF;false;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWDjD6HWosiMHGvUXTQ17jde7dQCzkfDU2Ymx9gM32C7EY;true;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWG3PSxjHaWRg2kaxitmgN1kQ99KJ2eHTVRgCLmhQWmt6a;false;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWREYKz8bi8stnAJ3c7r512jcXU7nryZuZxAJoDfW35Vse;true;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWD7KTTJtA4UxpG7qLn45X4xUybnYoTPU349YzQCDYrTTn;false;2023-04-24T21:52:01+0000
12D3KooWDMz98CMBY8ESou22GXRnDpW3gZcah7LpDVmSuwWndHm8;12D3KooWPBWpwc96AZQzcU4ko8er2yhvDhdLJVX8qJphm86ZrxiZ;false;2023-04-24T21:52:01+0000
......
如下为连续爬取10小时所得的数据,爬取间隔为死循环,即上次爬取结束后下次爬取立即开始,共爬取160次
github: https://github.com/ipfs/kubo
“Kubo was the first IPFS implementation and is the most widely used one today. Implementing the Interplanetary Filesystem - the Web3 standard for content-addressing, interoperable with HTTP.”
由IPFS-ICDCS22中的介绍知,ipfs会通过swarm维护一定的connected peers,在检索某个CID时会优先使用bitswap来向connected peers询问是否有该CID的文件,如果没有,则再查找DHT。
所以ipfs的peers之间会互相维持着一定数量的connections,数量根据LowWater与HighWater来确定,据IPFS-Churn-ICDCSW22描述,其在实验时仅修改了这两个数据。
复现
- 如论文中所描述,修改LowWater&HighWater为 18000与20000
tang@ubuntu:~$ cat ~/.ipfs/config
.....
"Swarm": {
"AddrFilters": null,
"ConnMgr": {
"LowWater": 18000,
"HighWater": 20000
},
......
- 启动ipfs daemon
tang@ubuntu:~$ ipfs daemon
Initializing daemon...
Kubo version: 0.19.1
Repo version: 13
System version: amd64/linux
Golang version: go1.19.8
......
Daemon is ready
- 可以看到已连接的peers
tang@ubuntu:~$ ipfs swarm peers
/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ
/ip4/104.207.147.161/udp/4001/quic-v1/p2p/12D3KooWQApQJG3pc3iNLuzaGGkJdQkzGNMe6i2FSqCJcywpFM17
/ip4/109.123.240.146/udp/4001/quic-v1/p2p/12D3KooWPzJcDSFQrBSgrdPbSAwTX5x6aWKWAoAzuEj3NBb43iKD
/ip4/136.244.83.93/udp/4001/quic-v1/p2p/12D3KooWHkWRWLN9CWTSXdDWHWPuzLbgT1EPh8LtLpZzMwBCeLNo
/ip4/139.178.91.71/udp/4001/quic/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN
......
- 还可以获得CID的信息
tang@ubuntu:~$ ipfs id 12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9
{
"ID": "12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9",
"PublicKey": "CAESIEHpj124ydTz8zJMuj7CVTLOj8ggmJcM87HcYUeQRQeC",
"Addresses": [
"/ip4/127.0.0.1/tcp/4001/p2p/12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9",
"/ip4/127.0.0.1/udp/4001/quic/p2p/12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9",
"/ip4/172.33.1.5/tcp/4001/p2p/12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9",
"/ip4/172.33.1.5/udp/4001/quic/p2p/12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9",
"/ip4/83.228.248.13/tcp/4001/p2p/12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9",
"/ip4/83.228.248.13/udp/4001/quic/p2p/12D3KooWEFfELn8766a7DQaPmtCwRQMTxo1ibGjordKqXSS1xfS9"
],
"AgentVersion": "kubo/0.17.0/4485d6b",
"ProtocolVersion": "ipfs/0.1.0",
"Protocols": [
"/ipfs/bitswap",
"/ipfs/bitswap/1.0.0",
"/ipfs/bitswap/1.1.0",
"/ipfs/bitswap/1.2.0",
"/ipfs/id/1.0.0",
"/ipfs/id/push/1.0.0",
"/ipfs/ping/1.0.0",
"/libp2p/autonat/1.0.0",
"/libp2p/circuit/relay/0.1.0",
"/libp2p/circuit/relay/0.2.0/hop",
"/libp2p/circuit/relay/0.2.0/stop",
"/libp2p/dcutr",
"/p2p/id/delta/1.0.0",
"/x/"
]
}
通过上述信息,我们可以获得论文中的所有数据:连接时长、ipfs version的变化、connections的变化等
有问题的点:
按照论文中P2的设置(LowWater=18000, HighWater=20000)启动ipfs daemon后connected peers数量并未同论文中的结果一样呈现单调上涨且在10000+时趋于收敛。
复现时connected peers的数量在6000+左右达到峰值
- LowWater=18000, HighWater=20000
- ulimit -n 的值为655350 文件句柄数目是够的
- 且可被另一台peer通过PID找到,说明处于DHT_server模式下
不清楚为何与实验数据不符
github: https://github.com/libp2p/hydra-booster
hydra相当于设置了多个kubo Instances,每个kubo instance被其称作一个head,每个head有着不同的PID,可以尽可能的连接不同的peers
有问题的点:
其数据同样跟论文的数据对不上
tang@ubuntu:~/clone_file/hydra-booster$ go run ./main.go -name hydra_0 -port-begin 4002 -nheads 5 -httpapi-addr 192.168.0.107:7779
......
[NumHeads: 0, Uptime: 5s, MemoryUsage: 62 MB, PeersConnected: 146, TotalUniquePeersSeen: 157, BootstrapsDone: 5, ProviderRecords: 0,
......
[NumHeads: 0, Uptime: 3m0s, MemoryUsage: 182 MB, PeersConnected: 1500, TotalUniquePeersSeen: 3716, BootstrapsDone: 5, ProviderRecords: 0,
......
[NumHeads: 0, Uptime: 6m0s, MemoryUsage: 236 MB, PeersConnected: 2261, TotalUniquePeersSeen: 5333, BootstrapsDone: 5, ProviderRecords: 357501,
......
[NumHeads: 0, Uptime: 15m2s, MemoryUsage: 355 MB, PeersConnected: 3199, TotalUniquePeersSeen: 8565, BootstrapsDone: 5, ProviderRecords: 366893
......
[NumHeads: 0, Uptime: 1h50m6s, MemoryUsage: 335 MB, PeersConnected: 3525, TotalUniquePeersSeen: 11632, BootstrapsDone: 5, ProviderRecords: 534816,
......
[NumHeads: 0, Uptime: 2h56m16s, MemoryUsage: 402 MB, PeersConnected: 3517, TotalUniquePeersSeen: 12179, BootstrapsDone: 5, ProviderRecords: 539413,
......
[NumHeads: 0, Uptime: 2h57m42s, MemoryUsage: 551 MB, PeersConnected: 732, TotalUniquePeersSeen: 12247, BootstrapsDone: 5, ProviderRecords: 539413,
......
[NumHeads: 0, Uptime: 3h25m12s, MemoryUsage: 399 MB, PeersConnected: 840, TotalUniquePeersSeen: 14654, BootstrapsDone: 5, ProviderRecords: 541867,
......
[NumHeads: 0, Uptime: 3h50m31s, MemoryUsage: 333 MB, PeersConnected: 839, TotalUniquePeersSeen: 14752, BootstrapsDone: 5, ProviderRecords: 542010,
......
其PeersConnected增长到3k+后增长缓慢,而且在3h后PeersConnected会大幅下降,这与论文中的数据严重不符
不知道是不是ipfs版本更新出了一些新的策略来针对hydra-booster这类爬虫
github: https://github.com/dennis-tra/nebula
与ipfs-crawler 主动爬取DHT_server较为相似,实现原理上基本一样,目前还未深入了解过与ipfs-crawler细节上的不同(如性能、监控等)
ipfs的swarm:
- 如何找到peers?
- 什么时候决定连接什么时候决定断开?
- hydra-booster是否也遵循这样的规则?