Swarm2k and Swarm3k lessons learned

Here's a summary of what you learned from these experiments:

  • For a large set of workers, managers require a lot of CPUs. CPUs will spike whenever the Raft recovery process kicks in.
  • If the leading manager dies, it's better to stop Docker on that node and wait until the cluster becomes stable again with n-1 managers.
  • Keep snapshot reservation as small as possible. The default Docker Swarm configuration will do. Persisting Raft snapshots uses extra CPU.
  • Thousands of nodes require a huge set of resources to manage, both in terms of CPU and network bandwidth. Try to keep services and the managers' topology geographically compact.
  • Hundreds of thousand tasks require high memory nodes.
  • Now, a maximum of 500-1000 nodes are recommended for stable production setups.
  • If managers seem to be stuck, wait; they'll recover eventually.
  • The advertise-addr parameter is mandatory for Routing Mesh to work.
  • Put your compute nodes as close to your data nodes as possible. The overlay network is great and will require tweaking Linux net configuration for all hosts to make it work best.
  • Docker Swarm Mode is robust. There were no task failures, even with unpredictable network connecting this huge cluster together.

For Swarm3k, we would like to thank all the heroes: @FlorianHeigl; @jmaitrehenry from PetalMD; @everett_toews from Rackspace, Internet Thailand; @squeaky_pl, @neverlock, @tomwillfixit from Demonware; @sujaypillai from Jabil; @pilgrimstack from OVH; @ajeetsraina from Collabnix; @AorJoa and @PNgoenthai from Aiyara Cluster; @GroupSprint3r, @toughIQ, @mrnonaki, @zinuzoid from HotelQuickly; @_EthanHunt_; @packethost from Packet.io; @ContainerizeT-ContainerizeThis, The Conference; @_pascalandy from FirePress; @lucjuggery from TRAXxs; @alexellisuk; @svega from Huli; @BretFisher; @voodootikigod from Emerging Technology Advisors; @AlexPostID; @gianarb from ThumpFlow; @Rucknar, @lherrerabenitez; @abhisak from Nipa Technology; and @djalal from NexwayGroup.

We would also like to thank Sematext again for the best-of-class Docker monitoring system; and DigitalOcean for providing us with all resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset