How is etcd important for coreos setup? #658

sheerun · 2016-02-29T11:03:15Z

Hey,

I see Apollo uses consul for service discovery and the services are deployed with ansible. Then how important is initial etcd cluster and fleet daemon? Isn't currently the only role of coreos to provide consistent base system with systemd support? Can we just disable etcd and fleet?

tayzlor · 2016-02-29T11:40:07Z

At the moment, there isn't much dependency on etcd. the only role it is really playing (at the moment) is to be used in an OS 'upgrade' via the cluster getting an etcd-lock (to ensure all machines are not rebooted at once in the cluster). In the future we could do more dynamic bootstrapping of the machines allowing them to discover each other better (rather than relying on ansible host / group variables to bootstrap the cluster at launch time). When we ported over the ansible cfg to coreOS we deliberately didnt go the whole hog on making everything fully dynamic just to try and ease the transition.
If we did move to a more dynamic cluster bootstrap this would make it more scalable from the outset.

we'd be up for the discussion, it depends how much we want to couple the project to coreOS as the only place we can deploy to.

sheerun · 2016-02-29T12:03:34Z

Hmm.

I'd vote for removing dynamic part from core setup, and focus on dynamic aspect of mesos cluster. Rolling updates can be probably handled well at the terraform level (or even a script to taint and redeploy machines master machines one by one, including mesos graceful shutdown). I find it less magical and more explicit than rolling update in etcd cluster..

All to all, instead of handling discoverability at two places (etcd, consul) and fleet setup at two places (fleetd, mesos), I'd focus on consul and mesos, and make base setup more static.. Especially I can see value in deploying slaves with CentOS 7 or Ubuntu 16.04 LTS as a base system (both support systemd).

sheerun · 2016-02-29T12:17:10Z

I think the only requirement for host system for Apollo should be: network, storage (like glusterfs or ceph), systemd, docker support.. All the rest can be handled with proper docker services in systemd, no?

wallies · 2016-02-29T15:55:40Z

we currently support rolling updates for different parts of the platform via two ansible playbooks
https://github.com/Capgemini/Apollo/blob/devel/rolling-upgrade-maintenance.yml is to updates different pieces of software of the masters and slaves e.g. docker, weave, consul etc. we do this on the slaves using mesos maintenance primitives and by rolling one host at a time.
There is also https://github.com/Capgemini/Apollo/blob/devel/rolling-upgrade-mesoscluster.yml to upgrade mesos and marathon on the mesos masters servers.
Yes you are right, Apollo should handle platform level things e.g. storage, network, docker but as we currently are transitioning to coreos and systemd, we have to deal with the fact that coreos utilises etcd and fleet to bootstrap a cluster and for OS updates via the channels.

sheerun · 2016-02-29T16:33:43Z

Well, don't see how etcd and fleetd plays a role in bootstrapping the cluster.

There are two level of clustering here:

Cluster of host CoreOS machines (it's the same configuration for masters and slaves, except a role flag to use by ansible). Currently you use architecture known as Small Cluster where etcd runs on each node. For some reason you also enable fleet daemon but you don't use it.
Cluster of mesos master and slaves. It is discovered independently from cluster of CoreOS machines via consul and can exist without etcd / fleet / flannel. It's bootstrapped by ansible scripts that read "role" flag set in cloud-setup config.

According to CoreOS documentation "The update service is an optional hosted service provided by CoreOS and is not included in a standard CoreOS cluster" and starts at $995/month what is pretty sad requirement of an open source product. Of course you could setup setup something open source like core roller, but it's not officially supported solution, questionable for production deployment.

On the other hand tainting instances in terraform, and re-deploying in serial with script, is far more straightforward solution that you could potentially use at the same time as ansible scripts updating containers apollo upgrade would take care of tainting instances one by one and running ansible upgrade scripts.

Other than coreupdate feature you don't use etcd and fleetd for nothing. They just expose additional complexity to handle, so why bother?

sheerun · 2016-02-29T17:03:46Z

OK, I see etcd is only used in one place: to bootstrap consul cluster in https://github.com/Capgemini/Apollo/blob/devel/roles/consul/templates/consul-discovery.service.j2

But you already have this information (IP addresses of master nodes) from terraform output. You don't need etcd to discover initial set of servers.

Even if you don't want to use them, there's free discovery service hosted by atlas as described in https://www.consul.io/docs/guides/bootstrapping.html

wallies · 2016-02-29T20:52:56Z

you are right, we dont need fleet, i will raise a ticket to disable this.
you dont need the update service to upgrade coreos machines, you can still configure the upgrade strategy, to something like best-effort or etcd-lock, but I do see your point in that we dont really need this running all the time, could be just to do updates of the servers.
We used to use atlas to bootstrap the consul cluster, but we moved away from it. Will talk to the team about different options to do with etcd and consul bootstrapping. I think we can disable it after boot and only start it when needed.
As far using terraform taint, this wont really work, as starting up a new instance with the ami or digitalocean droplet id, will be the old coreos image and will still need updating via the coreos channel and rebooting via ansible scripts, although for mesos masters this could work with terraform taint and ansible upgrade scripts run with serial.

sheerun · 2016-02-29T21:52:30Z

I think digital ocean automatically updates coreos-stable channel, and you can just explicitly update ami on aws if you decide to upgrade..

As for initial bootstrapping goes id's just use outputs.master_ips from terraform.tfstate as a parameter to ansible scripts.

enxebre · 2016-03-01T09:57:24Z

Cant see the usecase for fleet at the minute
Although we are not really making use of etcd at the moment and it kinda can overlaps with consul but as we are going coreOS I wonder if it can be helpful and cleaner solution for upgrades or autoscaling
Anyway If we keep it we should run the quorum on the masters and proxy agents on the slaves as it's recomended for production

tayzlor · 2016-03-01T09:57:28Z

Raised #659 to turn off fleet - agreed it's not necessary. I don't think there is a simple answer (at the moment) on the etcd issue as it could depend on how we would like to do autoscaling of instances in the future.

I would agree that at the moment it's not that necessary but i'm not fully sold (yet) on the idea of handling the updates solely via terraform taint etc...

If we were to do much more dynamic autoscaling of instances etcd could come into it's own in a situation like that where we can dynamically adjust and configure the cluster at scale time without the need for interaction with an external service (e.g. ansible). It might be worth exploring how we would tackle autoscaling of instances first before making a decision on whether to ditch etcd or not. It could be that it turns out autoscaling would be real easy just hooking straight into a launch config with ansible and we don't need etcd. In that case i'd be in favour of removing it for simplicity.

If we were to keep etcd we should move to the "production cluster" setup where we'd have consensus achieved by the "master" node set (3/5 servers) and each slave would be an etcd worker.

sheerun · 2016-03-01T16:10:36Z

What about alternative approach of deploying: instead using ansible to provision hosts, just generate .service files for fleet and push them to the cluster. This way auto-scaling would work out of the box.

tayzlor mentioned this issue Mar 1, 2016

disable fleet #659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is etcd important for coreos setup? #658

How is etcd important for coreos setup? #658

sheerun commented Feb 29, 2016

tayzlor commented Feb 29, 2016

sheerun commented Feb 29, 2016

sheerun commented Feb 29, 2016

wallies commented Feb 29, 2016

sheerun commented Feb 29, 2016

sheerun commented Feb 29, 2016

wallies commented Feb 29, 2016

sheerun commented Feb 29, 2016

enxebre commented Mar 1, 2016

tayzlor commented Mar 1, 2016

sheerun commented Mar 1, 2016

How is etcd important for coreos setup? #658

How is etcd important for coreos setup? #658

Comments

sheerun commented Feb 29, 2016

tayzlor commented Feb 29, 2016

sheerun commented Feb 29, 2016

sheerun commented Feb 29, 2016

wallies commented Feb 29, 2016

sheerun commented Feb 29, 2016

sheerun commented Feb 29, 2016

wallies commented Feb 29, 2016

sheerun commented Feb 29, 2016

enxebre commented Mar 1, 2016

tayzlor commented Mar 1, 2016

sheerun commented Mar 1, 2016