其实很简单,这些sentinel redis instance之间唯一的通信方式就是通过tcp通信, 而目前由于sentinel的所有网络通信都是由redisAsyncCommand这个命令异步执行的, 所以只要grep redisAsyncCommand即可list出所有操作。
正如之前提到的,sentinel instance通过config file指定也好,runtime config也好,monitor管辖了很多 所有master instance,对于这些master instance以及这些master的所有slave instance,每个instance建立 了两个tcp连接,一个cc,一个pc。而与此同时,对于每个master instance而言,有other sentine也在 monitor该master instance,sentinel一一与这些other sentinel建立一个cc连接。
先说建立连接时会用到的交互,这部分内容对sentinel与redis,sentinel与sentinel来讲,是通用的。
/* src/sentinel.c */
1676 void sentinelSendAuthIfNeeded(sentinelRedisInstance *ri, redisAsyncContext *c) {
1677 char *auth_pass = (ri->flags & SRI_MASTER) ? ri->auth_pass :
1678 ri->master->auth_pass;
1679
1680 if (auth_pass) {
1681 if (redisAsyncCommand(c, sentinelDiscardReplyCallback, NULL, "AUTH %s",
1682 auth_pass) == REDIS_OK) ri->pending_commands++;
1683 }
与sentinelSendAuthIfNeeded相关的配置项如下,
/* sentinel.conf */
# sentinel auth-pass <master-name> <password>
#
# Set the password to use to authenticate with the master and slaves.
# Useful if there is a password set in the Redis instances to monitor.
#
# Note that the master password is also used for slaves, so it is not
# possible to set a different password in masters and slaves instances
# if you want to be able to monitor these instances with Sentinel.
#
# However you can have Redis instances without the authentication enabled
# mixed with Redis instances requiring the authentication (as long as the
# password set is the same for all the instances requiring the password) as
# the AUTH command will have no effect in Redis instances with authentication
# switched off.
#
# Example:
#
# sentinel auth-pass mymaster MySUPER--secret-0123passw0rd
先定义一下bucket这个概念,一个redis master instance以及与他同步的所有redis slave instance这样一组 redis instance称之为一个bucket.
这段配置的含义是,即使这个配置文件只指定了master sentinelRedisInstance的auth passwd, 但是会自动扩散到slave sentinelRedisInstance,主要是为了方便, 为了配合这个做法,才限制如果要和sentinel配合使用, 则同属一个bucket的master和slave role的redis instance必须用相同的auth密码. 也就可以看到*auth_pass = (ri->flags & SRI_MASTER) ? ri->auth_pass : ri->master->auth_pass;就是因为如此.
可以看到此举是针对master或者slave role的sentinelRedisInstance执行的,而sentinelSendAuthIfNeeded在 该sentinelRedisInstance为sentinel role时也会调用,并且此时会走 *auth_pass = (ri->flags & SRI_MASTER) ? ri->auth_pass : ri->master->auth_pass;后半部分的逻辑,那么通过 sentinelSendAuthIfNeeded这个函数给sentinel instance发送ri->master->auth_pass auth信息,有用吗? 会对sentinel instance产生什么影响。在这个问题很简单,因为sentinel instance在启动的时候加载了一个自定义的 响应命令的子集sentinelcmds,这个sentinelcmds list里面根本就没有auth这个cmd,所以,auth命令发送给sentinel instance, 会被直接无视,没有任何影响。后续会介绍sentinelcmds相关逻辑。
/* src/sentinel.c */
1686 /* Use CLIENT SETNAME to name the connection in the Redis instance as
1687 * sentinel-<first_8_chars_of_runid>-<connection_type>
1688 * The connection type is "cmd" or "pubsub" as specified by 'type'.
1689 *
1690 * This makes it possible to list all the sentinel instances connected
1691 * to a Redis servewr with CLIENT LIST, grepping for a specific name format. */
1692 void sentinelSetClientName(sentinelRedisInstance *ri, redisAsyncContext *c, char *type) {
1695 snprintf(name,sizeof(name),"sentinel-%.8s-%s",server.runid,type);
1696 if (redisAsyncCommand(c, sentinelDiscardReplyCallback, NULL,
1697 "CLIENT SETNAME %s", name) == REDIS_OK)
注释说的很清楚了,CLIENT SETNAME是让在remote redis instance或者sentinel instance按照此命令参数指定的具有规则的名字来命名 这些cc或者pc连接。以便在这些instance上执行CLIENT LIST cmd获取到client list后,可以通过grep相关pattern来筛选过滤. TODO, 由于sentinel长时间运行下,可以会产生连接泄露,也许是与某些配置项太小有关系,但是目前不清楚具体原因,希望通过 CLIENT LIST来排查,但是还是上面这个sentinelcmds list子集的问题,需要sentinel同时加载CLIENT LIST,CLIENT SETNAME, 才会让debug成为可能,所以其实现在CLIENT SETNAME cmd发送给sentinel instance,其实是被pass掉的
然后sentinel对redis或者sentinel instance的ping操作以及sentinelPingReplyCallback中检查到instance处于 BUSY状态时采取SCRIPT KILL操作,这部分内容对sentinel与redis,sentinel与sentinel来讲,也是通用的。
/* src/sentinel.c */
2327 int sentinelSendPing(sentinelRedisInstance *ri) {
2328 int retval = redisAsyncCommand(ri->cc,
2329 sentinelPingReplyCallback, NULL, "PING");
2062 void sentinelPingReplyCallback(redisAsyncContext *c, void *reply, void *privdata) {
2084 if (strncmp(r->str,"BUSY",4) == 0 &&
2085 (ri->flags & SRI_S_DOWN) &&
2086 !(ri->flags & SRI_SCRIPT_KILL_SENT))
2087 {
2088 if (redisAsyncCommand(ri->cc,
2089 sentinelDiscardReplyCallback, NULL,
2090 "SCRIPT KILL") == REDIS_OK)
除了上面提到instance之间的通用的交互方式之外,接下来分开说一下不通用的部分,
先说sentinel与redis instance之间的交互.
sentinel与redis instance之间,
-
先说通过cc进行的,
-
info操作之前也讲过,是通过master或者slave role的sentinelRedisInstance的cc连接进行的。
/* src/sentinel.c */ 2344 void sentinelSendPeriodicCommands(sentinelRedisInstance *ri) { 2378 if ((ri->flags & SRI_SENTINEL) == 0 && 2379 (ri->info_refresh == 0 || 2380 (now - ri->info_refresh) > info_period)) 2381 { 2382 /* Send INFO to masters and slaves, not sentinels. */ 2383 retval = redisAsyncCommand(ri->cc, 2384 sentinelInfoReplyCallback, NULL, "INFO");
-
sentinelSendSlaveOf里有一个transaction,几个相关的命令在里面一并执行, 这些命令会在master或者slave role的sentinelRedisInstance的cc连接上执行.
/* src/sentinel.c */ 3403 int sentinelSendSlaveOf(sentinelRedisInstance *ri, char *host, int port) { 3426 retval = redisAsyncCommand(ri->cc, 3427 sentinelDiscardReplyCallback, NULL, "MULTI"); 3428 if (retval == REDIS_ERR) return retval; 3429 ri->pending_commands++; 3430 3431 retval = redisAsyncCommand(ri->cc, 3432 sentinelDiscardReplyCallback, NULL, "SLAVEOF %s %s", host, portstr); 3433 if (retval == REDIS_ERR) return retval; 3434 ri->pending_commands++; 3435 3436 retval = redisAsyncCommand(ri->cc, 3437 sentinelDiscardReplyCallback, NULL, "CONFIG REWRITE"); 3438 if (retval == REDIS_ERR) return retval;
-
-
再说通过pc进行的,
-
master或者slave role的sentinelRedisInstance的pc连接(这个连接就是从 当前sentinel instance连接到remote master或者slave instance)创建之后, 不可忽略的一个重要操作就是SUBSCRIBE SENTINEL_HELLO_CHANNEL这个频道。
/* src/sentinel.c */ 1706 void sentinelReconnectInstance(sentinelRedisInstance *ri) { 1735 if ((ri->flags & (SRI_MASTER|SRI_SLAVE)) && ri->pc == NULL) { 1757 retval = redisAsyncCommand(ri->pc, 1758 sentinelReceiveHelloMessages, NULL, "SUBSCRIBE %s", 1759 SENTINEL_HELLO_CHANNEL);
值得注意的是,可以看到sentinel与sentinel之间并不会直接订阅对方,但是后续会提到的, 我们配合sentinel的方案中,我们的listener是直接订阅了所有的sentinel instance的, 即sentinel instance的pubsub的消息来源渠道并不对外开放。怎么做到不开放后续会解释。而是通过 sentinelEvent方法向外部广播sentinel内部正在发生什么的时候内部使用。 关于sentinelEvent方法后续也会详细介绍.
-
再说sentinel与sentinel instance之间,
-
通过cc进行的,
-
之前讲到过,sentinel与sentinel instance之间会通过在通向其他sentinel instance的cc连接上执行 SENTINEL is-master-down-by-addr命令来沟通master的S_DOWN状态, 并且存储到本地other sentinel sentinelRedisInstance struct的SRI_MASTER_DOWN状态中,供后续统计。
/* src/sentinel.c */ 3193 void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) { 3197 di = dictGetIterator(master->sentinels); 3198 while((de = dictNext(di)) != NULL) { 3199 sentinelRedisInstance *ri = dictGetVal(de); 3224 retval = redisAsyncCommand(ri->cc, 3225 sentinelReceiveIsMasterDownReply, NULL, 3226 "SENTINEL is-master-down-by-addr %s %s %llu %s", 3227 master->addr->ip, port, 3228 sentinel.current_epoch, 3229 (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ? 3230 server.runid : "*");
-
至此列出了几乎所有的的sentinel与redis instance之间以及sentinel与sentinel instance之间的交互方式,除了一个例外, 下一章会讲一下,一个重要的但是比较特殊的交互方式, hello msg.可以简单的说,这其实还是一个sentinel与sentinel instance, sentinel与redis instance之间都会有的交互方式,但是具体交互方式又很不相同。
-
先讲一下sentinel instance send hello msg的常规方式
/* src/sentinel.c */ 3919 void sentinelHandleRedisInstance(sentinelRedisInstance *ri) { 3923 sentinelSendPeriodicCommands(ri); 2344 void sentinelSendPeriodicCommands(sentinelRedisInstance *ri) { 2389 } else if ((now - ri->last_pub_time) > SENTINEL_PUBLISH_PERIOD) { 2390 /* PUBLISH hello messages to all the three kinds of instances. */ 2391 sentinelSendHello(ri);
/* src/redis.c */ 1063 int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) { 1242 /* Run the Sentinel timer if we are in sentinel mode. */ 1243 run_with_period(100) { 1244 if (server.sentinel_mode) sentinelTimer(); 1245 }
可以看到sentinelSendHello的执行是随着sentinelHandleRedisInstance这个sentinelTimer定期执行逻辑执行的. 作用于三种role的sentinelRedisInstance的cc连接上,预计100ms一次。但是有个限制条件就是 距离sentinel三种role的sentinelRedisInstance上次ri->last_pub_time更新已经超过SENTINEL_PUBLISH_PERIOD, SENTINEL_PUBLISH_PERIOD默认为2s.ri->last_pub_time后续马上会提到。
-
然后看一下hello msg的格式
sentinel_ip,sentinel_port,sentinel_runid,current_epoch, master_name,master_ip,master_port,master_config_epoch.
可以看到这个用逗号分隔的msg里含有以下几种信息,
-
sentinel_ip,sentinel_port,sentinel_runid这些都是广播关于当前sentinel的信息,让other sentinel发现自己的存在。 注意到sentinel runid信息是间接给vote逻辑用的,但是hello msg跟vote逻辑没有直接关系。
-
current_epoch是在global sentinel struct里保存的一个全局epoch信息,后续会详细解释其用途。
-
master_name,master_ip,master_port 之前提到过,此send hello msg的逻辑作用于三种role中任意role的sentinelRedisInstance上, 所以此处的master_xx是指当任意role的sentinelRedisInstance对应的master sentinelRedisInstance的ip,port信息。
-
master_config_epoch是当前sentinelRedisInstance对应的master sentinelRedisInstance的config_epoch信息, 这个epoch后续也会解释其用途。
有几个重要的问题值得提起,
-
可以看到的是,这个hello msg的各个组成部分实际上是从master sentinelRedisInstance struct中获取的相关config信息, 而这些master sentinelRedisInstance struct实际上是当前sentinel的管辖下的master instance到当前sentinel的env的映射而已, 所以这些信息都是当前sentinel的主观视角的信息而已,保证这些信息的时效性不在此处. 这些master config信息尽可能及时被更新的逻辑后续会提到。
-
hello msg是从本地的master slave sentinel三种role的sentinelRedisInstance发起的, 也就是说其实slave sentinel role的sentinelRedisInstance发起的 hello msg其实是同对应的master role的sentinelRedisInstance的hello msg是重复的。 但是注意cc link这个渠道不一样,每个sentinelRedisInstance向外广播的渠道是当前sentinel与这个 sentinelRedisInstance所指向的remote master或slave redis instance或者sentinel instance的之间建立的cc连接。 暂且先不说这些instance对hello msg的处理有何不同,后续会马上提到。
hello msg通过publish cmd不断向外send广播出去,
-
这个广播既发给了master和slave redis instance, 很好理解,通过这些redis instance的pubsub广播渠道曲线到达other sentinel instance, 因为正如上面提到的这一组sentinel中每个sentinel instance都SUBSCRIBE了所有这一组sentinel管辖下 的master和slave instance的SENTINEL_HELLO_CHANNEL channel。
-
同时还直接发给了sentinel instance, 这一点很蹊跷,后续会讲到sentinel instance对通过publish cmd发送hello msg给他的处理方式。
-
-
接下来,详细解释一下sentinelSendHello逻辑
/* src/sentinel.c */ 2250 int sentinelSendHello(sentinelRedisInstance *ri) { 2239 /* Send an "Hello" message via Pub/Sub to the specified 'ri' Redis 2240 * instance in order to broadcast the current configuraiton for this 2241 * master, and to advertise the existence of this Sentinel at the same time. 2242 * 2243 * The message has the following format: 2244 * 2245 * sentinel_ip,sentinel_port,sentinel_runid,current_epoch, 2246 * master_name,master_ip,master_port,master_config_epoch. 2247 * 2248 * Returns REDIS_OK if the PUBLISH was queued correctly, otherwise 2249 * REDIS_ERR is returned. */ 2250 int sentinelSendHello(sentinelRedisInstance *ri) { 2251 char ip[REDIS_IP_STR_LEN]; 2252 char payload[REDIS_IP_STR_LEN+1024]; 2253 int retval; 2254 char *announce_ip; 2255 int announce_port; 2256 sentinelRedisInstance *master = (ri->flags & SRI_MASTER) ? ri : ri->master; 2257 sentinelAddr *master_addr = sentinelGetCurrentMasterAddress(master); 2258 2259 if (ri->flags & SRI_DISCONNECTED) return REDIS_ERR; 2260 2261 /* Use the specified announce address if specified, otherwise try to 2262 * obtain our own IP address. */ 2263 if (sentinel.announce_ip) { 2264 announce_ip = sentinel.announce_ip; 2265 } else { 2266 if (anetSockName(ri->cc->c.fd,ip,sizeof(ip),NULL) == -1) 2267 return REDIS_ERR; 2268 announce_ip = ip; 2269 } 2270 announce_port = sentinel.announce_port ? 2271 sentinel.announce_port : server.port; 2272 2273 /* Format and send the Hello message. */ 2274 snprintf(payload,sizeof(payload), 2275 "%s,%d,%s,%llu," /* Info about this sentinel. */ 2276 "%s,%s,%d,%llu", /* Info about current master. */ 2277 announce_ip, announce_port, server.runid, 2278 (unsigned long long) sentinel.current_epoch, 2279 /* --- */ 2280 master->name,master_addr->ip,master_addr->port, 2281 (unsigned long long) master->config_epoch); 2282 retval = redisAsyncCommand(ri->cc, 2283 sentinelPublishReplyCallback, NULL, "PUBLISH %s %s", 2284 SENTINEL_HELLO_CHANNEL,payload); 2285 if (retval != REDIS_OK) return REDIS_ERR; 2286 ri->pending_commands++; 2287 return REDIS_OK; 2288 }
-
可以看到如果sentinelRedisInstance处于SRI_DISCONNECTED,则会直接返回REDIS_ERR
-
hello msg中sentinel_ip, sentinel_port信息是可以单独从配置文件指定的即announce_host,announce_port。 好处是在docker container的net为bridge mode下,sentinel hello msg机制也可以工作。
-
master_xx这些config是从(ri->flags & SRI_MASTER) ? ri : ri->master;这样的sentinelRedisInstance中 通过sentinelGetCurrentMasterAddress获取的。
sentinelgetcurrentmasteraddress这样一种获取master config的方式值得说一下,
/* src/sentinel.c */ 1297 /* Return the current master address, that is, its address or the address 1298 * of the promoted slave if already operational. */ 1299 sentinelAddr *sentinelGetCurrentMasterAddress(sentinelRedisInstance *master) { 1300 /* If we are failing over the master, and the state is already 1301 * SENTINEL_FAILOVER_STATE_RECONF_SLAVES or greater, it means that we 1302 * already have the new configuration epoch in the master, and the 1303 * slave acknowledged the configuration switch. Advertise the new 1304 * address. */ 1305 if ((master->flags & SRI_FAILOVER_IN_PROGRESS) && 1306 master->promoted_slave && 1307 master->failover_state >= SENTINEL_FAILOVER_STATE_RECONF_SLAVES) 1308 { 1309 return master->promoted_slave->addr; 1310 } else { 1311 return master->addr; 1312 } 1313 }
可以看到,
-
这个master sentinelRedisInstance的flags如果处于SRI_FAILOVER_IN_PROGRESS状态
-
并且master->promoted_slave为真,
-
并且master->failover_state >= SENTINEL_FAILOVER_STATE_RECONF_SLAVES,
则表示该promoted_slave所对应的redis instance已经响应了slave of no one的命令摒弃了与old master之间的sync关系, 此时当前sentinel就开始广播这一虽然是阶段性但确是里程碑性质的成果, 虽然此时failover还在继续中,但是最重要的一步已经完成. 再重提一下sentinelAbortFailover进行的前提条件,
/* src/sentinel.c */ 3900 void sentinelAbortFailover(sentinelRedisInstance *ri) { 3901 redisAssert(ri->flags & SRI_FAILOVER_IN_PROGRESS); 3902 redisAssert(ri->failover_state <= SENTINEL_FAILOVER_STATE_WAIT_PROMOTION);
可以看到sentinelAbortFailover会redisAssert(ri->failover_state <= SENTINEL_FAILOVER_STATE_WAIT_PROMOTION),而 SENTINEL_FAILOVER_STATE_WAIT_PROMOTION刚好是SENTINEL_FAILOVER_STATE_RECONF_SLAVES这个状态的前一个状态,到达 SENTINEL_FAILOVER_STATE_RECONF_SLAVES则表示不能再abort failover,进入sentinelFailoverReconfNextSlave之后该次failover无论 如何都必须继续完成,所谓必须完成的相关逻辑在sentinelFailoverDetectEnd,即使输出了+failover-end-for-timeout messge, 该次failover也一定会走+failover-end的逻辑完成,之前将failover流程的时候已经提到过了
可以看到此处就将failover成果通过upgrade config的方式第一时间广播出去,对提高sentinel方案的容错性有很大的好处, 因为hello msg中master config epoch高的upgrade config一定会获得other sentinel的直接认同 (除了比较config_epoch之外不需要任何前置确认信息), 只要有一个sentinel instance将这份高epoch的config持久化下来,这份config就会强制生效了。除非后续有新的config来覆盖它, 否则redis instance之间一定会达到这个config所定义的拓扑状态,值得注意的是, config_epoch的的作用范围以及config_epoch每次变更是局限在一个master的范围内的.
-
-
-
继续来看sentinel给send hello msg这一PUBLISH async cmd注册的sentinelPublishReplyCallback函数。
同样返回REDIS_ERR在sentinelSendHello表示async cmd根本就没有queued correctly。 可以注意到的是,在sentinelSendHello里并没有直接更新ri->last_pub_time, 更新是在sentinelPublishReplyCallback函数里完成的, 如果reply不为error的情况下才会更新ri->last_pub_time,具体如下,
/* src/sentinel.c */ 2099 /* This is called when we get the reply about the PUBLISH command we send 2100 * to the master to advertise this sentinel. */ 2101 void sentinelPublishReplyCallback(redisAsyncContext *c, void *reply, void *privdata) { 2102 sentinelRedisInstance *ri = c->data; 2103 redisReply *r; 2104 REDIS_NOTUSED(privdata); 2105 2106 if (ri) ri->pending_commands--; 2107 if (!reply || !ri) return; 2108 r = reply; 2109 2110 /* Only update pub_time if we actually published our message. Otherwise 2111 * we'll retry again in 100 milliseconds. */ 2112 if (r->type != REDIS_REPLY_ERROR) 2113 ri->last_pub_time = mstime(); 2114 }
关于ri->last_pub_time,这个参数详细提一下,其限制作用之前已经提过了, 通过now - ri->last_pub_time) > SENTINEL_PUBLISH_PERIOD这个判断来限制调用sentinelSendHello的频率, 而且sentinelSendHello有且仅有那样一个入口。 所以要改变sentinelSendHello的行为,则就只能通过变更ri->last_pub_time来控制。
但是什么情况下更新ri->last_pub_time,上面讲到的只是正常情况下的一种情况, 下面还有一种情况下,为了尽快publish变更出去, 会将当前的ri->last_pub_time减掉SENTINEL_PUBLISH_PERIOD+1这样一个时间间隔,那么 下次循环就会立即执行此publish操作。
具体细节如下,
/* src/sentinel.c */ 2290 /* Reset last_pub_time in all the instances in the specified dictionary 2291 * in order to force the delivery of an Hello update ASAP. */ 2292 void sentinelForceHelloUpdateDictOfRedisInstances(dict *instances) { 2293 dictIterator *di; 2294 dictEntry *de; 2295 2296 di = dictGetSafeIterator(instances); 2297 while((de = dictNext(di)) != NULL) { 2298 sentinelRedisInstance *ri = dictGetVal(de); 2299 if (ri->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1)) 2300 ri->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1); 2301 } 2302 dictReleaseIterator(di); 2303 } 2304 2305 /* This function forces the delivery of an "Hello" message (see 2306 * sentinelSendHello() top comment for further information) to all the Redis 2307 * and Sentinel instances related to the specified 'master'. 2308 * 2309 * It is technically not needed since we send an update to every instance 2310 * with a period of SENTINEL_PUBLISH_PERIOD milliseconds, however when a 2311 * Sentinel upgrades a configuration it is a good idea to deliever an update 2312 * to the other Sentinels ASAP. */ 2313 int sentinelForceHelloUpdateForMaster(sentinelRedisInstance *master) { 2314 if (!(master->flags & SRI_MASTER)) return REDIS_ERR; 2315 if (master->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1)) 2316 master->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1); 2317 sentinelForceHelloUpdateDictOfRedisInstances(master->sentinels); 2318 sentinelForceHelloUpdateDictOfRedisInstances(master->slaves); 2319 return REDIS_OK; 2320 }
可以看到sentinelForceHelloUpdateForMaster会在master sentinelRedisInstance这个struct上执行 该master->last_pub_time减少操作以提前下次send hello msg。
sentinelForceHelloUpdateForMaster的调用时机如下,
/* src/sentinel.c */ 1789 /* Process the INFO output from masters. */ 1790 void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) { 1944 /* Handle slave -> master role switch. */ 1945 if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) { 1946 /* If this is a promoted slave we can change state to the 1947 * failover state machine. */ 1948 if ((ri->flags & SRI_PROMOTED) && 1949 (ri->master->flags & SRI_FAILOVER_IN_PROGRESS) && 1950 (ri->master->failover_state == 1951 SENTINEL_FAILOVER_STATE_WAIT_PROMOTION)) 1952 { 1958 ri->master->config_epoch = ri->master->failover_epoch; 1959 ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES; 1960 ri->master->failover_state_change_time = mstime(); 1961 sentinelFlushConfig(); 1962 sentinelEvent(REDIS_WARNING,"+promoted-slave",ri,"%@"); 1963 sentinelEvent(REDIS_WARNING,"+failover-state-reconf-slaves", 1964 ri->master,"%@"); 1965 sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER, 1966 "start",ri->master->addr,ri->addr); 1967 sentinelForceHelloUpdateForMaster(ri->master);
调用时机还是在之前提到的那个关键步骤,failover_state被提升为SENTINEL_FAILOVER_STATE_RECONF_SLAVES 之后立即执行sentinelForceHelloUpdateForMaster,提前下一次send hello msg到下个timer循环,尽快将新的config广播出去。 使得其他sentinel尽快更新自己的config为该upgrade之后的config.
至此关于当前sentinel instance send hello msg以及send hello msg callback已经讲完了.
-
但是other sentinel instance怎么收到hello msg以及怎么处理hello msg还没有讲
/* src/sentinel.c */ 1706 void sentinelReconnectInstance(sentinelRedisInstance *ri) { 1756 /* Now we subscribe to the Sentinels "Hello" channel. */ 1757 retval = redisAsyncCommand(ri->pc, 1758 sentinelReceiveHelloMessages, NULL, "SUBSCRIBE %s", 1759 SENTINEL_HELLO_CHANNEL);
-
在SUBSCRIBE master,slave的redis instance的时候,给该channel的pubsub消息注册了一个回调函数sentinelReceiveHelloMessages。 这就是通过pubsub渠道间接获取其他sentinel的hello msg并处理的机制。
/* src/sentinel.c */ 2209 /* This is our Pub/Sub callback for the Hello channel. It's useful in order 2210 * to discover other sentinels attached at the same master. */ 2211 void sentinelReceiveHelloMessages(redisAsyncContext *c, void *reply, void *privdata) { 2212 sentinelRedisInstance *ri = c->data; 2213 redisReply *r; 2214 REDIS_NOTUSED(privdata); 2215 2216 if (!reply || !ri) return; 2217 r = reply; 2218 2219 /* Update the last activity in the pubsub channel. Note that since we 2220 * receive our messages as well this timestamp can be used to detect 2221 * if the link is probably disconnected even if it seems otherwise. */ 2222 ri->pc_last_activity = mstime(); 2223 2224 /* Sanity check in the reply we expect, so that the code that follows 2225 * can avoid to check for details. */ 2226 if (r->type != REDIS_REPLY_ARRAY || 2227 r->elements != 3 || 2228 r->element[0]->type != REDIS_REPLY_STRING || 2229 r->element[1]->type != REDIS_REPLY_STRING || 2230 r->element[2]->type != REDIS_REPLY_STRING || 2231 strcmp(r->element[0]->str,"message") != 0) return; 2232 2233 /* We are not interested in meeting ourselves */ 2234 if (strstr(r->element[2]->str,server.runid) != NULL) return; 2235 2236 sentinelProcessHelloMessage(r->element[2]->str, r->element[2]->len); 2237 }
有几个逻辑,
-
sentinelReceiveHelloMessages在检查reply合法性之前,即只要有reply,则更新ri->pc_last_activity, ri->pc_last_activity主要是用于判断pc连接是否需要reconnect的。如果距离上次更新ri->pc_last_activity 超过3倍SENTINEL_PUBLISH_PERIOD则需要重连。这也就是pc_last_activity的全部作用。
-
如果该hello msg是当前sentinel发出去的,则也忽略。
-
最后处理hello msg的函数是sentinelProcessHelloMessage,后续会详细解释。
-
-
那么直接发送给other sentinel instance的hello msg消息,other sentinel是怎么处理的呢?
谈到这个问题,不得不说一下sentinelcmds的相关机制。
/* src/sentinel.c */ 385 void sentinelCommand(redisClient *c); 386 void sentinelInfoCommand(redisClient *c); 387 void sentinelSetCommand(redisClient *c); 388 void sentinelPublishCommand(redisClient *c); 389 void sentinelRoleCommand(redisClient *c); 390 391 struct redisCommand sentinelcmds[] = { 392 {"ping",pingCommand,1,"",0,NULL,0,0,0,0,0}, 393 {"sentinel",sentinelCommand,-2,"",0,NULL,0,0,0,0,0}, 394 {"subscribe",subscribeCommand,-2,"",0,NULL,0,0,0,0,0}, 395 {"unsubscribe",unsubscribeCommand,-1,"",0,NULL,0,0,0,0,0}, 396 {"psubscribe",psubscribeCommand,-2,"",0,NULL,0,0,0,0,0}, 397 {"punsubscribe",punsubscribeCommand,-1,"",0,NULL,0,0,0,0,0}, 398 {"publish",sentinelPublishCommand,3,"",0,NULL,0,0,0,0,0}, 399 {"info",sentinelInfoCommand,-1,"",0,NULL,0,0,0,0,0}, 400 {"role",sentinelRoleCommand,1,"l",0,NULL,0,0,0,0,0}, 401 {"shutdown",shutdownCommand,-1,"",0,NULL,0,0,0,0,0} 402 }; 410 /* Perform the Sentinel mode initialization. */ 411 void initSentinel(void) { 412 unsigned int j; 413 414 /* Remove usual Redis commands from the command table, then just add 415 * the SENTINEL command. */ 416 dictEmpty(server.commands,NULL); 417 for (j = 0; j < sizeof(sentinelcmds)/sizeof(sentinelcmds[0]); j++) { 418 int retval; 419 struct redisCommand *cmd = sentinelcmds+j; 420 421 retval = dictAdd(server.commands, sdsnew(cmd->name), cmd); 422 redisAssert(retval == DICT_OK); 423 }
可以看到initSentinel这个sentinel相对redis server特有的初始化函数,首先将 server.commands这个dict清空,然后重新加载了一批在sentinelcmds list里定义的响应命令的list。 也就是说sentinel instance摒弃了redis server原有的所有cmd,sentinel instance单独只响应 sentinelcmds list中命令,这个sentinelcmds list又分三类,
-
原封不动加载的redis server提供的已有命令, pingCommand,subscribeCommand,unsubscribeCommand,psubscribeCommand,punsubscribeCommand,shutdownCommand 可以看到响应ping和shutdown命令,以及subscribe相关的订阅,批量订阅,取消订阅,取消批量订阅都是 同redis server一样的响应逻辑。
-
sentinel特有的命令, sentinelCommand,这个就是prefix为sentinel的那一系列命令,如sentinel is-master-down-by-addr,sentinel masters 等的响应逻辑。
-
被sentinel override的命令, sentinelPublishCommand, sentinelInfoCommand,sentinelRoleCommand
此处介绍一下被override的sentinelPublishCommand。
/* src/sentinel.c */ 3027 /* Our fake PUBLISH command: it is actually useful only to receive hello messages 3028 * from the other sentinel instances, and publishing to a channel other than 3029 * SENTINEL_HELLO_CHANNEL is forbidden. 3030 * 3031 * Because we have a Sentinel PUBLISH, the code to send hello messages is the same 3032 * for all the three kind of instances: masters, slaves, sentinels. */ 3033 void sentinelPublishCommand(redisClient *c) { 3034 if (strcmp(c->argv[1]->ptr,SENTINEL_HELLO_CHANNEL)) { 3035 addReplyError(c, "Only HELLO messages are accepted by Sentinel instances."); 3036 return; 3037 } 3038 sentinelProcessHelloMessage(c->argv[2]->ptr,sdslen(c->argv[2]->ptr)); 3039 addReplyLongLong(c,1); 3040 }
有几个值得注意的地方,
-
这个publish命令响应函数仅仅用来响应其他sentinel instance发送的hello msg,对于除SENTINEL_HELLO_CHANNEL 这个channel之外的msg, 返回addReplyError。除此之外调用sentinelProcessHelloMessage这个真正处理msg的逻辑, sentinelProcessHelloMessage这个函数存在的好处是和常归的hello msg处理流程做到了代码共用。
-
通过override的做法,做到了在send hello msg给master slave redis instance以及sentinel instance的时候共用 一个逻辑。
-
所以可以看到上面就是sentinel instance send hello msg,以及remote instance对其的响应的两种不同的流程。
-
-
最后关于hello msg来看一下,两种不同的流程处理hello msg时共用的sentinelProcessHelloMessage的逻辑。
/* src/sentinel.c */ 2121 void sentinelProcessHelloMessage(char *hello, int hello_len) { 2122 /* Format is composed of 8 tokens: 2123 * 0=ip,1=port,2=runid,3=current_epoch,4=master_name, 2124 * 5=master_ip,6=master_port,7=master_config_epoch. */ 2125 int numtokens, port, removed, master_port; 2126 uint64_t current_epoch, master_config_epoch; 2127 char **token = sdssplitlen(hello, hello_len, ",", 1, &numtokens); 2128 sentinelRedisInstance *si, *master; 2129 2130 if (numtokens == 8) { 2131 /* Obtain a reference to the master this hello message is about */ 2132 master = sentinelGetMasterByName(token[4]); 2133 if (!master) goto cleanup; /* Unknown master, skip the message. */ 2134 2135 /* First, try to see if we already have this sentinel. */ 2136 port = atoi(token[1]); 2137 master_port = atoi(token[6]); 2138 si = getSentinelRedisInstanceByAddrAndRunID( 2139 master->sentinels,token[0],port,token[2]); 2140 current_epoch = strtoull(token[3],NULL,10); 2141 master_config_epoch = strtoull(token[7],NULL,10); 2142 2143 if (!si) { 2144 /* If not, remove all the sentinels that have the same runid 2145 * OR the same ip/port, because it's either a restart or a 2146 * network topology change. */ 2147 removed = removeMatchingSentinelsFromMaster(master,token[0],port, 2148 token[2]); 2149 if (removed) { 2150 sentinelEvent(REDIS_NOTICE,"-dup-sentinel",master, 2151 "%@ #duplicate of %s:%d or %s", 2152 token[0],port,token[2]); 2153 } 2154 2155 /* Add the new sentinel. */ 2156 si = createSentinelRedisInstance(NULL,SRI_SENTINEL, 2157 token[0],port,master->quorum,master); 2158 if (si) { 2159 sentinelEvent(REDIS_NOTICE,"+sentinel",si,"%@"); 2160 /* The runid is NULL after a new instance creation and 2161 * for Sentinels we don't have a later chance to fill it, 2162 * so do it now. */ 2163 si->runid = sdsnew(token[2]); 2164 sentinelFlushConfig(); 2165 } 2166 } 2200 /* Update the state of the Sentinel. */ 2201 if (si) si->last_hello_time = mstime(); 2202 }
-
可以看到先行判断收到的hello msg以逗号分隔后是否为8部分。如果不是,则丢弃。
-
如果用hello msg中的master_name通过sentinelGetMasterByName去sentinel.masters管辖下 的所有master信息中查找是否存在该master_name,如果找不到,即未知的master,则会被直接忽略, 此处也就是master信息不会通过hello msg的广播机制共享给其他sentinel的原因。
-
如果能够在sentinel.masters找到该master,则先行从master sentinelRedisInstance struct的 master->sentinels中删除重复的sentinel sentinelRedisInstance(如果相应的sentinel sentinelRedisInstance存在的话). 并输出了-dup-sentinel msg。然后在重新创建sentinel sentinelRedisInstance并挂载在master下。并输出+sentinel msg。 由于自动发现的sentinel的创建sentinel sentinelRedisInstance的runid就是在此处填充的,没有其他的机会填充。
-
最后可以看到此处更新了sentinel sentinelRedisInstance的last_hello_time属性。last_hello_time属性 目前仅用于addReplySentinelRedisInstance这个被用于各种"sentinel masters"这类的info逻辑中的函数里。 记录了该sentinel sentinelRedisInstance所对应的远程sentinel instance的上一条hello msg是什么时候到达的。
关于sentinelResetMaster的部分以及epoch变更的部分,后续会详细解释。
-
自此,hello msg的所有相关流程介绍完成。
介绍一个epoch相关的细节,epoch其实是几种,epoch只是一个统称,epoch的作用 关乎vote,关乎upgrade config的传播,所以算是一个比较复杂的逻辑。
分别在以下数据结构里,
/* src/sentinel.c */
118 typedef struct sentinelRedisInstance {
122 uint64_t config_epoch; /* Configuration epoch. */
176 char *leader;
180 uint64_t leader_epoch; /* Epoch of the 'leader' field. */
181 uint64_t failover_epoch; /* Epoch of the currently started failover. */
196 /* Main state. */
197 struct sentinelState {
198 uint64_t current_epoch; /* Current epoch. */
这四个epoch之间以及他们与vote之间有着千丝万缕的联系,分开看每一个都不完整。同时顺带着会讲leader字段以及vote的逻辑。
分阶段讲吧,
-
初始化
/* src/sentinel.c */ 410 /* Perform the Sentinel mode initialization. */ 411 void initSentinel(void) { 425 /* Initialize various data structures. */ 426 sentinel.current_epoch = 0; 896 sentinelRedisInstance *createSentinelRedisInstance(char *name, int flags, char *hostname, int port, int quorum, sentinelRedisInstance *master) { 936 ri->config_epoch = 0; 973 /* Failover state. */ 974 ri->leader = NULL; 975 ri->leader_epoch = 0; 976 ri->failover_epoch = 0;
可以明显看出来的是,current_epoch是在global sentinel struct上的一个属性,没有什么歧义。
/* sentinel current-epoch is a global state valid for all the masters. */
而对于config_epoch,failover_epoch,leader_epoch来说,暂时则比较不确定,具体是作用于什么role的sentinelRedisInstance上。
-
sentinelHandleConfiguration里的一段逻辑关于epoch可以用来预热
/* src/sentinel.c */ 1391 } else if (!strcasecmp(argv[0],"current-epoch") && argc == 2) { 1392 /* current-epoch <epoch> */ 1393 unsigned long long current_epoch = strtoull(argv[1],NULL,10); 1394 if (current_epoch > sentinel.current_epoch) 1395 sentinel.current_epoch = current_epoch; 1396 } else if (!strcasecmp(argv[0],"config-epoch") && argc == 3) { 1397 /* config-epoch <name> <epoch> */ 1398 ri = sentinelGetMasterByName(argv[1]); 1399 if (!ri) return "No such master with specified name."; 1400 ri->config_epoch = strtoull(argv[2],NULL,10); 1401 /* The following update of current_epoch is not really useful as 1402 * now the current epoch is persisted on the config file, but 1403 * we leave this check here for redundancy. */ 1404 if (ri->config_epoch > sentinel.current_epoch) 1405 sentinel.current_epoch = ri->config_epoch; 1406 } else if (!strcasecmp(argv[0],"leader-epoch") && argc == 3) { 1407 /* leader-epoch <name> <epoch> */ 1408 ri = sentinelGetMasterByName(argv[1]); 1409 if (!ri) return "No such master with specified name."; 1410 ri->leader_epoch = strtoull(argv[2],NULL,10);
-
关于配置/* current-epoch */,如果大于sentinel.current_epoch,则更新sentinel.current_epoch
-
关于配置/* config-epoch */,通过name去找master,如果找到则将该master sentinelRedisInstance 的config_epoch置为配置数,并且如果该config_epoch大于sentinel.current_epoch,则更新sentinel.current_epoch.
-
对于/* leader-epoch */同样也是先去找到master并且将该master sentinelRedisInstance的config_epoch 置为配置数。
-
-
从addReplySentinelRedisInstance捕风捉影
/* src/sentinel.c */ 2410 /* Redis instance to Redis protocol representation. */ 2411 void addReplySentinelRedisInstance(redisClient *c, sentinelRedisInstance *ri) { 2509 /* Only masters */ 2510 if (ri->flags & SRI_MASTER) { 2511 addReplyBulkCString(c,"config-epoch"); 2512 addReplyBulkLongLong(c,ri->config_epoch); 2513 fields++; 2578 /* Only sentinels */ 2579 if (ri->flags & SRI_SENTINEL) { 2584 addReplyBulkCString(c,"voted-leader"); 2585 addReplyBulkCString(c,ri->leader ? ri->leader : "?"); 2586 fields++; 2588 addReplyBulkCString(c,"voted-leader-epoch"); 2589 addReplyBulkLongLong(c,ri->leader_epoch); 2590 fields++;
从这里来看,config_epoch只在master sentinelRedisInstance才会被通过info信息传达出去, leader以及leader_epoch只在sentinel sentinelRedisInstance上会被传达出去。
接下来,就是sentinelCheckObjectivelyDown会常态化的去从每个master sentinelRedisInstance的角度 去检查挂载在master下的所有sentinel sentinelRedisInstance的SRI_MASTER_DOWN的数量,与quorum进行判断, 并判定master sentinelRedisInstance是否应该置为SRI_O_DOWN状态。此处就是quorum用来进行大多数统计的第一处逻辑。
-
sentinelAskMasterStateToOtherSentinels常态化的ask
/* src/sentinel.c */ 3193 void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) { 3197 di = dictGetIterator(master->sentinels); 3198 while((de = dictNext(di)) != NULL) { 3199 sentinelRedisInstance *ri = dictGetVal(de); 3200 mstime_t elapsed = mstime() - ri->last_master_down_reply_time; 3204 /* If the master state from other sentinel is too old, we clear it. */ 3205 if (elapsed > SENTINEL_ASK_PERIOD*5) { 3206 ri->flags &= ~SRI_MASTER_DOWN; 3207 sdsfree(ri->leader); 3208 ri->leader = NULL; 3209 } 3216 if ((master->flags & SRI_S_DOWN) == 0) continue; 3222 /* Ask */ 3223 ll2string(port,sizeof(port),master->addr->port); 3224 retval = redisAsyncCommand(ri->cc, 3225 sentinelReceiveIsMasterDownReply, NULL, 3226 "SENTINEL is-master-down-by-addr %s %s %llu %s", 3227 master->addr->ip, port, 3228 sentinel.current_epoch, 3229 (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ? 3230 server.runid : "*");
可以看到在该sentinel sentinelRedisInstance的存储的remote sentinel instance对该master的评估SRI_MASTER_DOWN信息, 如果距离上次更新到现在超过5倍SENTINEL_ASK_PERIOD时间,则直接摒弃掉该状态。如果该master sentinelRedisInstance处于 SRI_S_DOWN,则暂时放弃从该master挂载下的所有sentinel sentinelRedisInstance去ask这一行为。
再来看一下,在此阶段,other sentinel instance对于is-master-down-by-addr的响应逻辑。
/* src/sentinel.c */ 2628 void sentinelCommand(redisClient *c) { 2657 } else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) { 2658 /* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>*/ 2666 if (c->argc != 6) goto numargserr; 2667 if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != REDIS_OK || 2668 getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL) 2669 != REDIS_OK) 2670 return; 2671 ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters, 2672 c->argv[2]->ptr,port,NULL); 2673 2674 /* It exists? Is actually a master? Is subjectively down? It's down. 2675 * Note: if we are in tilt mode we always reply with "0". */ 2676 if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) && 2677 (ri->flags & SRI_MASTER)) 2678 isdown = 1; 2679 2680 /* Vote for the master (or fetch the previous vote) if the request 2681 * includes a runid, otherwise the sender is not seeking for a vote. */ 2682 if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) { 2683 leader = sentinelVoteLeader(ri,(uint64_t)req_epoch, 2684 c->argv[5]->ptr, 2685 &leader_epoch); 2686 } 2687 2688 /* Reply with a three-elements multi-bulk reply: 2689 * down state, leader, vote epoch. */ 2690 addReplyMultiBulkLen(c,3); 2691 addReply(c, isdown ? shared.cone : shared.czero); 2692 addReplyBulkCString(c, leader ? leader : "*"); 2693 addReplyLongLong(c, (long long)leader_epoch);
-
对于c->argv[4],这个参数,被用来填充req_epoch这个变量。但是由于 strcasecmp(c->argv[5]->ptr,"*")为0的原因,填充后的req_epoch并没有派上用场。
-
同样由于strcasecmp(c->argv[5]->ptr,"*")为0的原因,leader_epoch并不会被填充, leader也会不被赋值,所以addReplyLongLong返回的leader_epoch又是无意义的初始值。
-
另外从此处可以看到sentinel instance对于未知的master的另外一部分处理逻辑,会用is-master-down-by-addr的 去当前sentinel.masters去找,如果没找到,则isdown永远为0,即对该master是否down掉并不表达意见。 当然如果是已知的master,并且该master sentinelRedisInstance处于SRI_S_DOWN状态,则回复isdown为1,表达出自己 已有的对master instance的SRI_S_DOWN状态的判断。
再来看一下,此阶段当前sentinel收到other sentinel的reply之后的callback逻辑。
/* src/sentinel.c */ 3148 /* Receive the SENTINEL is-master-down-by-addr reply, see the 3149 * sentinelAskMasterStateToOtherSentinels() function for more information. */ 3150 void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) { 3151 sentinelRedisInstance *ri = c->data; 3152 redisReply *r; 3153 REDIS_NOTUSED(privdata); 3154 3155 if (ri) ri->pending_commands--; 3156 if (!reply || !ri) return; 3157 r = reply; 3158 3159 /* Ignore every error or unexpected reply. 3160 * Note that if the command returns an error for any reason we'll 3161 * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */ 3162 if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 && 3163 r->element[0]->type == REDIS_REPLY_INTEGER && 3164 r->element[1]->type == REDIS_REPLY_STRING && 3165 r->element[2]->type == REDIS_REPLY_INTEGER) 3166 { 3167 ri->last_master_down_reply_time = mstime(); 3168 if (r->element[0]->integer == 1) { 3169 ri->flags |= SRI_MASTER_DOWN; 3170 } else { 3171 ri->flags &= ~SRI_MASTER_DOWN; 3172 } 3185 } 3186 }
在此阶段,sentinelReceiveIsMasterDownReply的用途就仅仅只是用来收集上面提到的reply响应的isdown信息, 并记录到该master sentinelRedisInstance下相应的sentinel sentinelRedisInstance的SRI_MASTER_DOWN中。 并且更新了该sentinel sentinelRedisInstance的ri->last_master_down_reply_time属性。
可以看出来此阶段通过is-master-down-by这个命令沟通的信息有限。
-
-
发起start failover
如果该master sentinelRedisInstance处于SRI_O_DOWN状态,则会进入sentinelStartFailover的流程。
/* src/sentinel.c */ 3460 void sentinelStartFailover(sentinelRedisInstance *master) { 3461 redisAssert(master->flags & SRI_MASTER); 3462 3463 master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START; 3464 master->flags |= SRI_FAILOVER_IN_PROGRESS; 3465 master->failover_epoch = ++sentinel.current_epoch; 3471 master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
-
可以看到此处涉及两个epoch, 首先将++global sentinel current_epoch++,并赋给了try-failover的 master sentinelRedisInstance的failover_epoch,
-
此处更新了master->failover_start_time,此处就是failover_start_time的一处更新逻辑, 并且更新时伴随着rand()%SENTINEL_MAX_DESYNC的逻辑。
-
此处sentinel.current_epoch的更新逻辑,是当前sentinel发起的一次主动更新逻辑。
-
-
start failover之后,sentinelAskMasterStateToOtherSentinels的更完整用途。
-
就是sentinelAskMasterStateToOtherSentinels由于SENTINEL_ASK_FORCED这个flag加持,就ask得更频繁一些。 并且由于(master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ? server.runid : "*", 此处ask的runid参数带上了当前sentinel instance的runid而具有了意义。
-
从other sentinel reply的角度来讲,则不免由sentinelVoteLeader这个逻辑进入了选举流程, 进入了选举流程则,local req_epoch, leader, leader_epoch则真正派上了用场。
/* src/sentinel.c */ 2628 void sentinelCommand(redisClient *c) { 2657 } else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) { 2680 /* Vote for the master (or fetch the previous vote) if the request 2681 * includes a runid, otherwise the sender is not seeking for a vote. */ 2682 if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) { 2683 leader = sentinelVoteLeader(ri,(uint64_t)req_epoch, 2684 c->argv[5]->ptr, 2685 &leader_epoch); 2686 } 2687 2688 /* Reply with a three-elements multi-bulk reply: 2689 * down state, leader, vote epoch. */ 2690 addReplyMultiBulkLen(c,3); 2691 addReply(c, isdown ? shared.cone : shared.czero); 2692 addReplyBulkCString(c, leader ? leader : "*"); 2693 addReplyLongLong(c, (long long)leader_epoch);
此处为other sentinel响应is-master-down-by-addr命令时的vote逻辑,other sentinel执行sentinelVoteLeader采用了 作为参数的当前sentinel的current_epoch(也即为当前sentinel发起的这轮failover的failover_epoch), 来评估自己的投票选择,评估二字为何,Vote for the master (or fetch the previous vote)就是解释。 可以看到此阶段此处other sentinel进入sentinelVoteLeader并不要求other sentinel对该 master sentinelRedisInstance有除了role之外的任何要求。
此处会岔开去讲sentinelVoteLeader这个重头戏,等会回过来头讲此阶段下当前sentinel的reply callback的逻辑。
-
-
other sentinel sentinelVoteLeader reply给当前sentinel
接下来这一段内容中,我临时设身处地成other sentinel的角度,接下来的这段内容会临时将other sentinel称为当前sentinel。 将other sentinel切换到第一人称视角。为了表达方便。
/* src/sentinel.c */ 3243 char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) { 3244 if (req_epoch > sentinel.current_epoch) { 3249 sentinel.current_epoch = req_epoch; 3253 } 3254 3255 if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch) 3256 { 3257 mstime_t time_since_last_vote = mstime() - master->failover_start_time; 3266 if (time_since_last_vote > master->failover_timeout || 3267 strcasecmp(req_runid,server.runid) == 0 || 3268 master->leader == NULL) { 3269 sdsfree(master->leader); 3270 master->leader = sdsnew(req_runid); 3271 } 3272 master->leader_epoch = sentinel.current_epoch; 3276 /* If we did not voted for ourselves, set the master failover start 3277 * time to now, in order to force a delay before we can start a 3278 * failover for the same master. */ 3279 if (strcasecmp(master->leader,server.runid)) { 3280 mstime_t last_time = master->failover_start_time; 3281 master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC; 3286 } 3287 } 3288 3289 *leader_epoch = master->leader_epoch; 3290 return master->leader ? sdsnew(master->leader) : NULL; 3291 }
-
如果该req_epoch大于当前的sentinel.current_epoch,则更新当前sentinel的sentinel.current_epoch。 此处为被动更新sentinel.current_epoch的一个逻辑。
-
如果master sentinelRedisInstance的leader_epoch小于该req_epoch并且当前sentinel的sentinel.current_epoch 不大于req_epoch,其实可以看到由于上面的逻辑,根本没有小于的可能。
如果上面两个条件满足则,
-
上面两个条件满足则考虑更新该master sentinelRedisInstance的leader信息, 将该req_runid参数赋给master sentinelRedisInstance的leader属性。 例外就是如果距离该master sentinelRedisInstance的failover_start_time还没有超过failover_timeout这么长时间, 此处就是failover_start_time的一个限制逻辑。 则坚持上次的投票意见不变,例外就是投票给自己,投票给自己其实不是sentinelVoteLLeader在这个阶段的作用。 投票给自己后续阶段会解释。
-
更新master->leader_epoch为当前的sentinel.current_epoch
-
并且如果我们并不是投票给了自己,则还有一个约束条件就是去更新master->failover_start_time, 限制自己下次vote或者start failover的时间。此处就是failover_start_time的一个更新逻辑, 可以看到更新failover_start_time伴随着一个rand()%SENTINEL_MAX_DESYNC的逻辑。 这是一个比较无力的failover_start_time的desync逻辑。 至此,failover_start_time的两处更新逻辑以及一处限制逻辑都已经讲到了,并且两处 更新failover_start_time都伴随着一个rand()%SENTINEL_MAX_DESYNC的逻辑.
-
刚好提一下failover_start_time的其他限制逻辑。
-
此处是sentinelStartFailoverIfNeededs的一个前置条件
/* src/sentinel.c */ 3491 int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) { 3500 /* Last failover attempt started too little time ago? */ 3501 if (now - master->failover_start_time < 3502 master->failover_timeout*2) 3503 { 3504 if (master->failover_delay_logged != master->failover_start_time) { 3505 time_t clock = (master->failover_start_time + 3506 master->failover_timeout*2) / 1000; 3507 char ctimebuf[26]; 3508 3509 ctime_r(&clock,ctimebuf); 3510 ctimebuf[24] = '\0'; /* Remove newline. */ 3511 master->failover_delay_logged = master->failover_start_time; 3512 redisLog(REDIS_WARNING, 3513 "Next failover delay: I will not start a failover before %s", 3514 ctimebuf); 3515 } 3516 return 0;
此处是sentinelStartFailoverIfNeededs的一个前置条件,如果距离上次更新该 master sentinelRedisInstance的failover_start_time还没超过2倍failover_timeout,则直接return, 则暂时不进入start failover.
-
参与sentinelFailoverWaitStart的election_timeout逻辑
/* src/sentinel.c */ 3632 /* ---------------- Failover state machine implementation ------------------- */ 3633 void sentinelFailoverWaitStart(sentinelRedisInstance *ri) { 3651 /* Abort the failover if I'm not the leader after some time. */ 3652 if (mstime() - ri->failover_start_time > election_timeout) { 3653 sentinelEvent(REDIS_WARNING,"-failover-abort-not-elected",ri,"%@ %llu", 3654 (unsigned long long) ri->failover_epoch); 3655 3656 sentinelAbortFailover(ri);
-
无论如何最终,
- 最后通过return值以及填充leader_epoch参数的这俩个方式,将此次投票信息返回出去。
整个过程就是将同意发起start failover的sentinel instance在is-master-down-by-addr在参数中 带上的自身的++后的current_epoch信息,也即传递到other sentinel的sentinelVoteLeader调用时的req_epoch参数。 为什么同意呢,
-
因为当前sentinel中该master sentinelRedisInstance的leader_epoch小于该值,
-
并且当前sentinel的current_epoch还没有超前于该req_epoch, 此处的current_epoch检查逻辑蔑视了比它小的req_epoch要求更新投票信息的请求.
同意req_epoch的意思也就是,此时需要更新投票信息,包括该master sentinelRedisInstance的leader以及leader_epoch属性。
-
leader信息不更新之前讲过特例,如果该master sentinelRedisInstance的leader距离上次变更还未超过一次failover_timeout的时间。
-
leader_epoch更新就是为了避免在同一个leader_epoch下变更leader信息。
注意此处的leader和leader_epoch信息是存储在master sentinelRedisInstance中的。
至此临时的other sentinel的第一视角结束。
-
-
vote reply callback sentinelReceiveIsMasterDownReply的更完整的作用
Reply with a three-elements multi-bulk reply: down state, leader, vote epoch
other sentinel vote reply信息中带上了leader以及vote epoch信息。
/* src/sentinel.c */ 3150 void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) { 3151 sentinelRedisInstance *ri = c->data; 3158 3159 /* Ignore every error or unexpected reply. 3160 * Note that if the command returns an error for any reason we'll 3161 * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */ 3162 if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 && 3163 r->element[0]->type == REDIS_REPLY_INTEGER && 3164 r->element[1]->type == REDIS_REPLY_STRING && 3165 r->element[2]->type == REDIS_REPLY_INTEGER) 3166 { 3173 if (strcmp(r->element[1]->str,"*")) { 3174 /* If the runid in the reply is not "*" the Sentinel actually 3175 * replied with a vote. */ 3176 sdsfree(ri->leader); 3182 ri->leader = sdsnew(r->element[1]->str); 3183 ri->leader_epoch = r->element[2]->integer; 3184 } 3185 } 3186 }
此处当前sentinel对other sentinel的投票reply信息的处理是将leader,leader_epoch信息, 直接存入sentinel sentinelRedisInstance的leader和leader_epoch中。 当然此处的sentinel sentinelRedisInstance是挂载在当前正在进行failover的master sentinelRedisInstance下。 可以看到在当前sentinel以及other sentinel中对于leader和leader_epoch是 存储在不同role的sentinelRedisInstance中的。
-
start failover之后,正式开始failover的流程之前,叫做wait start failover,
/* src/sentinel.c */ 3633 void sentinelFailoverWaitStart(sentinelRedisInstance *ri) { 3634 char *leader; 3635 int isleader; 3636 3637 /* Check if we are the leader for the failover epoch. */ 3638 leader = sentinelGetLeader(ri, ri->failover_epoch); 3639 isleader = leader && strcasecmp(leader,server.runid) == 0; 3640 sdsfree(leader); 3641 3644 if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) { 3645 int election_timeout = SENTINEL_ELECTION_TIMEOUT; 3646 3649 if (election_timeout > ri->failover_timeout) 3650 election_timeout = ri->failover_timeout; 3651 /* Abort the failover if I'm not the leader after some time. */ 3652 if (mstime() - ri->failover_start_time > election_timeout) { 3656 sentinelAbortFailover(ri); 3657 } 3658 return;
可以看到此处会从当前sentinel来统计vote情况,上一步的vote reply的vote信息已经存储在 当前master sentinelRedisInstance挂载下的sentinel sentinelRedisInstance中了, 如果选举一直失败,则此阶段过一段时间会进入election timeout状态.
详细介绍一下sentinelGetLeader,
-
先看前半部分,统计已有的vote,看是否有winner
3316 /* Scan all the Sentinels attached to this master to check if there 3317 * is a leader for the specified epoch. 3318 * 3319 * To be a leader for a given epoch, we should have the majority of 3320 * the Sentinels we know (ever seen since the last SENTINEL RESET) that 3321 * reported the same instance as leader for the same epoch. */ /* src/sentinel.c */ 3322 char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) { 3333 counters = dictCreate(&leaderVotesDictType,NULL); 3335 voters = dictSize(master->sentinels)+1; /* All the other sentinels and me. */ 3336 3337 /* Count other sentinels votes */ 3338 di = dictGetIterator(master->sentinels); 3339 while((de = dictNext(di)) != NULL) { 3340 sentinelRedisInstance *ri = dictGetVal(de); 3341 if (ri->leader != NULL && ri->leader_epoch == epoch) { 3342 sentinelLeaderIncr(counters,ri->leader); 3348 } 3349 } 3350 dictReleaseIterator(di); 3351 3355 di = dictGetIterator(counters); 3356 while((de = dictNext(di)) != NULL) { 3357 uint64_t votes = dictGetUnsignedIntegerVal(de); 3358 3359 if (votes > max_votes) { 3360 max_votes = votes; 3361 winner = dictGetKey(de); 3362 } 3363 } 3364 dictReleaseIterator(di);
可以看到此处就是all the Sentinels attached to this master 统计这些sentinel sentinelRedisInstance的leader leader_epoch信息和given epoch参数是否吻合, 并sentinelLeaderIncr累加到counters dict中。 这个given epoch其实就是master sentinelRedisInstance的failover_epoch了,不一定是当前的sentinel.current_epoch, 可能此时当前sentinel的current_epoch已经由于当前failover接下来的failover又++了
接着开始统计counters这个dict,将投票最多的runid记录到winner中,将该投票记录到max_votes中.
-
再看后半部分,统计已有的vote,看是否有winner
/* src/sentinel.c */ 3322 char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) { 3366 /* Count this Sentinel vote: 3367 * if this Sentinel did not voted yet, either vote for the most 3368 * common voted sentinel, or for itself if no vote exists at all. */ 3369 if (winner) 3370 myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch); 3371 else 3372 myvote = sentinelVoteLeader(master,epoch,server.runid,&leader_epoch); 3373 3374 if (myvote && leader_epoch == epoch) { 3375 uint64_t votes = sentinelLeaderIncr(counters,myvote); 3376 3377 if (votes > max_votes) { 3378 max_votes = votes; 3379 winner = myvote; 3380 } 3381 } 3382 3383 voters_quorum = voters/2+1; 3384 if (winner && (max_votes < voters_quorum || max_votes < master->quorum)) 3385 winner = NULL; 3386 3387 winner = winner ? sdsnew(winner) : NULL; 3388 sdsfree(myvote); 3389 dictRelease(counters); 3390 return winner;
此处是sentinelVoteLeader的另外一个入口,
-
如果在统计中,没有一个winner出现,则当前sentinel通过sentinelVoteLeader投给自己。
-
当然如果当前sentinel已经sentinelVoteLeader给自己过了,则此处并不会重复计入counters, counters是dict,这个dict的key就是sentinel instance的runid。
-
此处投票给自己并非之前给员外讲过的羊群效应,因为vote信息并没有广而告知,并没有在sentinel之间互相广播传播。 投票信息是从other sentienl往当前sentinel汇总的。仅仅算是当前sentinel的自己的一点私心而已. 当然其实也是很慷慨的,如果有other sentinel在当前sentinel 通过此处的sentinelVoteLeader逻辑投票之前已经赢得了时间获得了投票并且已经反馈到了当前sentinel, 至少当前sentinel这一票肯定会投给他。
-
最终统计vote情况,需要大于大多数,也需要大于master->quorum。此处也就是quorum的另外一处大多数统计的用途。 如果不满足,则即使有winner也会被清空。 此处sentinelGetLeader一次统计不成功,会再次统计,一直重试,直到SENTINEL_ELECTION_TIMEOUT,大概10s.
-
-
关于leader以及leader_epoch上面大致已经介绍完了。
-
failover中SENTINEL_FAILOVER_STATE_WAIT_PROMOTION状态
/* src/sentinel.c */ 1790 void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) { 1945 if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) { 1946 /* If this is a promoted slave we can change state to the 1947 * failover state machine. */ 1948 if ((ri->flags & SRI_PROMOTED) && 1949 (ri->master->flags & SRI_FAILOVER_IN_PROGRESS) && 1950 (ri->master->failover_state == 1951 SENTINEL_FAILOVER_STATE_WAIT_PROMOTION)) 1952 { 1958 ri->master->config_epoch = ri->master->failover_epoch; 1959 ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;
-
可以看到当前sentinel的master sentinelRedisInstance的config_epoch被更新为master sentinelRedisInstance的 failover_epoch,虽然是在借助slave sentinelRedisInstance的情况下完成的,但是此举也就是认定config upgrade的这一重要步骤。
-
此处master config_epoch upgrade之后,新的master config_epoch以及promoted slave的ip和port信息 以及当前sentinel.current_epoch就会不断通过send hello msg从当前sentinel广播出去, 即使当前sentinel都还未真正生效此变更,因为还不到当前sentinel变更的时候,
-
可以解释为,此举之后当前sentinel的failover行为是不可逆的,一定要成功,即使当前sentinel真的crash了, 那么这个upgrade config也由于广播出去了,会被其他sentinel最终fix生效。 但是当前sentinel还需要在目前的视角上做一些事情,所以还不到变更的时机。
-
-
other sentinel收到hello msg的处理逻辑sentinelProcessHelloMessage
/* src/sentinel.c */ 2121 void sentinelProcessHelloMessage(char *hello, int hello_len) { 2122 /* Format is composed of 8 tokens: 2123 * 0=ip,1=port,2=runid,3=current_epoch,4=master_name, 2124 * 5=master_ip,6=master_port,7=master_config_epoch. */ 2126 uint64_t current_epoch, master_config_epoch; 2129 2130 if (numtokens == 8) { 2132 master = sentinelGetMasterByName(token[4]); 2133 if (!master) goto cleanup; /* Unknown master, skip the message. */ 2134 2135 /* First, try to see if we already have this sentinel. */ 2137 master_port = atoi(token[6]); 2138 si = getSentinelRedisInstanceByAddrAndRunID( 2139 master->sentinels,token[0],port,token[2]); 2140 current_epoch = strtoull(token[3],NULL,10); 2141 master_config_epoch = strtoull(token[7],NULL,10); 2168 /* Update local current_epoch if received current_epoch is greater.*/ 2169 if (current_epoch > sentinel.current_epoch) { 2170 sentinel.current_epoch = current_epoch; 2174 } 2175 2176 /* Update master info if received configuration is newer. */ 2177 if (master->config_epoch < master_config_epoch) { 2178 master->config_epoch = master_config_epoch; 2179 if (master_port != master->addr->port || 2180 strcmp(master->addr->ip, token[5])) 2181 { 2182 sentinelAddr *old_addr; 2183 2191 old_addr = dupSentinelAddr(master->addr); 2192 sentinelResetMasterAndChangeAddress(master, token[5], master_port);
可以看到此处有几个更新逻辑。
-
如果hello msg 的current_epoch,大于sentinel.current_epoch,则更新sentinel.current_epoch, 这里是sentinel.current_epoch被动更新的又一处逻辑。此处也就是current_epoch的最后一处更新逻辑.
-
如果hello msg的master_config_epoch大于master->config_epoch,则此处更新master->config_epoch
-
在master config_epoch变更的情况下,如果master ip port和当前不匹配,则做 sentinelResetMasterAndChangeAddress切换更新master info。
-
-
other sentinel在获取该hello msg之后,以及当前sentinel在准备好switch状态后,共用一个逻辑sentinelResetMaster
-
sentinelResetMaster清空了master sentinelRedisInstance的leader信息。不保留已有的投票信息
-
将failover_start_time置为0,去掉了之前提到的failover_start_time的影响。
-
将failover_state置为初始SENTINEL_FAILOVER_STATE_NONE值。
-
清空了promoted_slave信息。
-
关于leader_epoch信息是完全保留的,小于等于该leader_epoch的vote请求不再会有 有意义的更新的vote信息返回,即返回null.
-
sentinelResetMasterAndChangeAddress在sentinelResetMaster之后立马switch了master->addr,更新了master info。
-
至此,关于epoch和vote的细节解释完成。