Throughout this book, we’ve focused on the technical side of microservices: how to design, deploy, and operate services. But it’d be a mistake to examine the technical nature of microservices alone. People implement software, and building great software is as much about effective communication, alignment, and collaboration as implementation choices.
A microservice architecture is great for getting things done. It allows you to build new services and capabilities rapidly and independently of existing functionality. Conversely, it increases the scope and complexity of day-to-day tasks, such as operations, security, and on-call support. It can significantly change an organization’s technical strategy. It demands a strong culture of ownership and accountability from engineers. Achieving this culture, while minimizing friction and increasing pace, is vital to a successful microservice implementation.
In this chapter, we’ll begin by discussing team formation in software engineering and the principles that make teams effective. We’ll then examine different models for engineering team structure and how they apply to microservice development. Lastly, we’ll explore recommended practices for governance and engineering culture within microservice teams. Throughout the chapter, we’ll touch on and explain how to mitigate some common pitfalls of microservice development.
Although you might not currently work as an engineering manager, a team lead, or a director, we think it’s essential to understand how these dynamics — and the choices you and your organization make — impact the pace and quality of microservice development.
Splitting engineers into independent teams is a natural outcome of organizational growth. Doing so is necessary to help an organization scale effectively, as limiting team size has several benefits:
Small, independent teams can typically move faster than large teams. They also gel faster and gain effectiveness more quickly. Contrastingly, distinct engineering teams can also cause new problems:
Microservices can exacerbate these divisions. Different teams will likely no longer work on the same shared body of code. Teams will have different, competing priorities — and be less likely to have a global understanding of the application.
Building an effective engineering organization beyond a small group of people — and developing great software products — is a balancing act between these two tension points: autonomy and collaboration. If boundaries between teams overlap and ownership is unclear, tension can increase; conversely, independent teams still need to collaborate to deliver the whole application.
It can be difficult to separate cause and effect in organizations that have successfully built microservice applications. Was the development of fine-grained services a logical outcome of their organizational structure and the behavior of their teams? Or did that structure and behavior arise from their experiences building fine-grained services?
The answer is: a bit of both! A long-running system isn’t only an accumulation of features requested, designed, and built. It also reflects the preferences, opinions, and objectives of its builders and operators. This indicates that structure — what teams work on, what goals they set, and how they interact — will have a significant impact on how successfully you build and run a microservice application.
Conway’s Law expresses this relationship between team and system:
…organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations…
“Constrained” might suggest that these communication structures limit and constrict the effective development of a system. But the inverse of the rule is also true: you can take advantage of changes to team structure to produce a desired architecture. Team structure and microservice architecture are symbiotic: both can and should influence each other. This is a powerful technique, which we’ll consider throughout this chapter.
At a macro level, it’s best to think of teams as units of achievement and communication. They’re how stuff gets done and how people relate to each other within an organization. To realize benefits from microservices and adequately manage their complexity, your teams will need to adopt new working principles and practices, rather than using the same techniques they used to build monoliths.
There’s no single right, perfect way to organize your teams. You’ll always suffer from constraints: headcount, budget, personalities, skill sets, and priorities. Sometimes you can hire to fill a gap; sometimes you can’t. The nature of your application and business domain will demand different approaches and skills. Your organization may be limited in its capacity to change. The best approach we’ve found is to guide the formation of teams using a small set of shared principles: ownership, autonomy, and end-to-end responsibility.
Teams with a strong sense of ownership have high intrinsic motivation and exercise a considerable degree of responsibility for the area they own. Because microservice applications are typically long-lived, teams that have long-term ownership of an area support the evolution of that code while developing deep understanding and knowledge.
In a monolithic application, ownership is typically n:1. Many teams own one service: the monolith. This ownership is often split between different layers (such as frontend and backend) or between functional areas (such as orders and payments). In a microservice application, ownership is usually 1:n, meaning a team might own many services. Figure 13.2 depicts these ownership models.
As an organization’s codebase grows and the makeup of the engineering team fluctuates, the risk of code that no one knows — or code that no one can fix when it breaks — increases. Clear ownership helps you avoid this risk by placing natural, reasonable bounds on a team’s knowledge while ensuring that ownership is the responsibility of a group, not individual developers.
It’s not coincidental that these three principles reflect some of the principles of microservices themselves. Teams that can work autonomously — with limited dependencies on other teams — can work with less friction. These types of teams are highly aligned but loosely coupled.
Autonomy is important for scale. For an engineering manager, it’s exhausting to control the work of multiple teams (not to mention, disempowering for the teams themselves); instead, you can empower teams to self-manage.
A development team should own the full ideate-build-run loop of a product. With control over what’s being built, a team can make rational, local priority decisions; experiment; and achieve a short cycle time between coming up with an idea and validating that idea with real code and users.
Most software spends significantly longer in operation than it ever spent being built. But many software engineers focus on the build stage, throwing code over the fence for a separate team to run it. This ultimately results in poorer quality and slower delivery. How software operates — how you observe its behavior in the real world — should feed back into improving that software (figure 13.3). Without responsibility for operation, this information is often lost. This tenet is also central to the DevOps movement.
End-to-end responsibility correlates closely with autonomy and ownership:
In this section, we’ll explore two approaches for structuring teams — by function or across function — and their benefits and disadvantages in developing microservices.
The latter approach is a natural fit with microservices development.
Traditionally, many engineering organizations have been grouped along horizontal, functional lines: backend engineers, frontend engineers, designers, testers, product (or project) management, and sysadmin/ops. Figure 13.4 illustrates this type of organization. In other cases, teams or individuals may move between any number of time-bounded projects.
This approach optimizes for expertise:
Now imagine you’re building a new feature. This functional approach almost looks like a chain: the analyst team gathers requirements, engineers build backend services, testing windows are scheduled with the QA team, and sysadmins deploy the service. You can see that this approach involves a high coordination burden — delivering a feature relies on synchronization across several independent teams (figure 13.5).1 This approach fails to meet our three principles for effective organization.
No team has clear ownership of business outcomes or value — they’re only cogs in the value chain. As such, ownership of individual services is unclear: once a project is finished, who maintains the services that were built? How are these iterated on, improved, or discarded? Work allocation based on projects tends to shortchange long-term thinking and encourages ownership of code by individual engineers, which you want to avoid.
These teams are tightly coupled, not autonomous. Their priorities are set elsewhere, and every time work crosses a team boundary, the chance increases that a team will be blocked and development will be hampered. This leads to long lead times, rework, quality issues, and delays. Without alignment to the system architecture they’re building, the team will be unable to evolve their application without being encumbered by other teams.
A project-oriented approach isn’t conducive to long-term responsibility for the code produced or for the quality of a product. If the team is only together for a time-bound project, they might hand off their code to another department to run the application, so the original team won’t be able to iterate on their original ideas and implementation. The organization will also fail to realize benefits from knowledge retention in the original team.
Lastly, a new team requires time to normalize productive working behaviors — the longer people work together, the better the team gels, and the more effective it becomes. A team that stays together longer will maintain a longer period of high performance.
Lastly, this approach also risks the formation of silos — teams diverge in goals and become incapable of effective, empathetic collaboration. Hopefully you’ve never worked someplace where the relationship between test and dev, or dev and ops, is almost adversarial, but it’s been known to happen.
Ultimately, it’s unlikely that a functional, project-oriented organization will deliver a microservice application without incurring significant friction and substantial cost.
By optimizing for expertise, the functional approach aims to eliminate duplicated work and skill-based inefficiencies, in turn reducing overall cost. But this can cause gridlock: increasing friction and reducing your speed in achieving organizational goals. This isn’t great — your microservice architecture was meant to increase pace and reduce friction.
Let’s look at an alternative. Instead of grouping by function, you can work cross-functionally. A cross-functional team is made up of people with different specialties and roles intended to achieve a specific business goal. You could call these teams market-driven: they might aim toward a specific, long-term mission; build a product; or connect directly with the needs of their end customer. Figure 13.6 depicts a typical cross-functional team.
Compared to the functional approach, a cross-functional team can be more closely aligned with the end goal of the team’s activity. The multidisciplinary nature of the team is conducive to ownership. By taking on end-to-end responsibility for specification, deployment, and operation, the team can work autonomously to deliver features. The team gains clear accountability by taking on a mission that has a meaningful impact on the business’s success. Day-to-day partnership between different specialists eliminates silos, as team members share ownership for the ultimate product of the team’s work.
Designing these teams to be long-lived (for example, at least six months) is also beneficial. A long-lived team builds rapport, which increases their effectiveness, and shared knowledge, which increases their ability to optimize and improve the system under development. They also take long-term responsibility for the operation of the microservice application, rather than handing it off to another team.
The cross-functional, end-to-end approach to structuring teams is advantageous to microservice development:
This approach is common in modern web enterprises and is often cited as a reason for their success. For example, Amazon’s CTO described the company’s approach to architecture in 2006:
In the fine grained services approach that we use at Amazon, services do not only represent a software structure but also the organizational structure. The services have a strong ownership model, which combined with the small team size is intended to make it very easy to innovate. In some sense you can see these services as small startups within the walls of a bigger company. Each of these services require a strong focus on who their customers are, regardless whether they are externally or internally.
-—Werner Vogels
Perhaps most importantly, a well-formed cross-functional team will be faster at delivering features than a group of functional teams, as lines of communication are shorter, coordination is local, and team members are aligned. The cross-functional approach prioritizes pace — but not at the expense of quality!
A cross-functional team should have a mission. A mission is inspirational: it gives the team something to strive toward but also sets the boundaries of a team’s responsibilities. Determining what a team is (and isn’t) responsible for encourages autonomy and ownership while helping other teams align with each other. A mission is usually a business problem; for example, a growth team might aim to maximize recurring spend by customers, whereas a security team might aim to protect its codebase and data from known and novel threats. Based on this mission, each team prioritizes its own roadmap in collaboration with relevant partners within the business. Cross-cutting initiatives are driven by product or technical leadership.
If your company offers a range of small products — that a team of 7 +/– 3 can productively work on — each team can be responsible for one product (figure 13.7). This isn’t the case in many companies such as those that offer a large, complex product to market, requiring the effort of multiple teams.
For larger scale scenarios, bounded contexts — covered in chapter 4 — are an effective starting point for setting loose boundaries for different teams in an organization. They also have the benefit of creating teams that map closely to business teams within the enterprise; for example, a warehouse product team will interact closely with warehouse operations.3 Figure 13.8 illustrates a possible model for teams within SimpleBank.
Forming teams that own services in specific bounded contexts makes use of the inverse version of Conway’s Law: if systems reflect the organizational structure that produces them, then you can attain a desirable system architecture by first shaping the structure and responsibilities of your organization.
As with services themselves, the right boundaries between teams may not always be obvious. We keep two general rules in mind:
Although we’ve advocated strongly for end-to-end ownership, it isn’t always practical. For example, the underlying infrastructure — or microservice platform — of a large company is typically complex and requires a joined-up roadmap and dedicated effort, rather than loose collaboration between DevOps specialists spread across distinct teams.
As we outlined earlier in the book, building a microservice platform — deployment processes, chassis, tooling, and monitoring — is vital to sustainably and rapidly building a great microservice application. When you first start working with microservices, the team building the application will usually own the task of building the platform too (figure 13.9).Over time, this platform will need to serve the needs of multiple teams, at which stage you might establish a platform team (figure 13.10).
Depending on the needs of your company and your technical choices, you might split this platform team further (figure 13.11) to distinguish core infrastructural concerns (such as cloud management and security) from specific microservice platform concerns (such as deployment and cluster operation). This is especially common in companies that operate their own infrastructure, rather than using a cloud provider.
In an even larger engineering organization, these tiers might be separated further; for example, different platform teams might focus on deployment tools, observability, or inter-service communication. This is also illustrated in figure 13.11.
The three-tier model shown in the figure provides economies of scale and specialization. This isn’t a service relationship, where teams log tickets to each other. Instead, the output of each tier is a “product” that enables teams in the layer above to be more effective and productive.
The DevOps movement has been a strong influence on microservice approaches. A DevOps mentality — breaking down the barriers between build and runtime — is vital for doing microservices well, as deploying and operating multiple applications increases the cost and complexity of operational work. This movement encourages a “you build it, you run it” mindset; a team that takes responsibility for the operational lifetime of their services will build a better, more stable and more reliable application. This includes being on-call — ready to answer alerts — for your production services.
For example, in the three-tier model:
This on-call model is illustrated in figure 13.12.
Of the many changes that microservices bring, this may the most difficult to roll out: engineers are likely to resist being on-call, even for their own code. A successful on-call rotation should be
In this model, we split alerts across teams, because running software at scale is complex. Operational effort might be beyond the scope or knowledge of engineers within any one team. Many operational tasks — such as operating an Elasticsearch cluster, deploying a Kafka instance, or tuning a database — require specific expertise that would be unreasonable to expect product engineers to gain uniformly. Operational work also runs at a cadence different from the pace of product delivery.
The right choice for an on-call model that balances responsibility and expertise will depend on the types of applications you build, the throughput of those applications, and the underlying architecture you choose. If you’re interested in learning more, Increment recently published an in-depth review (https://increment.com/on-call/who-owns-on-call/) of on-call approaches used at Google, PagerDuty, Airbnb, and other organizations.
Although autonomous teams increase development pace, they have two downsides:
You can mitigate these issues. We’ve had success applying Spotify’s model of chapters and guilds.4 These are communities of practice:
Figure 13.13 depicts this model.
Comparably, some organizations use matrix management to establish a formal identity for functional units. This adds a line of management responsibility (head of QA, head of design…) for functions, at the cost of building a more complicated management structure.
Either approach works well to disseminate knowledge and develop shared working practices. This helps to prevent the isolation that can arise in highly autonomous teams, ensuring teams remain aligned technically and culturally. Cross-pollination of ideas, solutions, and techniques also supports people moving between teams and reduces organization-level bus factor risks.
It’s also important to strike a balance between team lifetime and team fluidity. In the long run, regularly rotating engineers between teams helps to share knowledge and skills and is a good complement for the chapter and guild model.
The scale of change in a microservice application can be tremendous. It can be difficult to keep up! It’s unreasonable to expect any engineer to have a deep understanding of all services and how they interact, especially because the topography of those services may change without warning. Likewise, grouping people into independent teams can be detrimental to forming a global perspective. These factors lead to some interesting cultural implications:
Good engineering practices can help you avoid these problems. In this section, we’ll walk through some of the practices that your teams should follow when building and maintaining services.
Take a moment and consider the type of build items you might work on day to day. If you’re on a product team, the items in your backlog are primarily functional additions or changes. You want to launch a new feature; support a new request from a customer; enter a new market; and so on. As such, you build and change microservices in response to these new functional requirements. And, thankfully, microservices are intended to ensure your application is flexible in the face of change.
But functional requirements — changes from your business domain — aren’t the only driver of change in services. Each microservice will change for many reasons (figure 13.14):
All this change increases complexity. For example, instead of tracking security vulnerabilities against a single monolithic application, you need to ensure your tooling supports static analysis and alerting across several applications (and likely several distinct programming languages and frameworks). Every new service generates more work.
Alternatively, some microservice practitioners have advocated immutable services — once a service is considered mature, put it under feature freeze, and add new services if change is required. There’s a tricky cost-benefit decision here: is the risk of breaking a service through modification more than the cost of building a new service? It’s a difficult question to answer definitively and will depend on both your business context and appetite for risk.
Microservice applications evolve over time: teams build new services; decommission existing services; refactor existing functionality; and so on. The faster pace and more fluid environment that microservices enable change the role of architects and technical leads.
Architects have an important role to play in guiding the scope and overall shape of an application. But they need to perform that role without becoming a bottleneck. A prescriptive and centralized approach to major technical decisions doesn’t always work well in a microservice application:
That doesn’t mean that architecture isn’t useful or necessary. An architect should have a global perspective and make sure the global needs of the application are met, guiding its evolution so that
The best starting point for architecture is to set principles. Principles are guidelines (or sometimes rules) that teams should follow to achieve higher level goals. They inform team practice. Figure 13.15 illustrates this model.
For example, if your product goal is to sell to privacy- and security-sensitive enterprises, you might set principles of compliance with recognized external standards, data portability, and clear tracking of personal information. If your goal is to enter a new market, you might mandate flexibility around regional requirements, design for multiple cloud regions, and out-of-the-box support for i18n (figure 13.16).
Principles are flexible. They can and should change to reflect the priorities of the business and the technical evolution of your application. For example, early development might prioritize validating product-market fit, whereas a more mature application might require a focus on performance and scalability.
Several day-to-day practices support this evolutionary approach to architecture, such as design review, an inner-source model, and living documentation. We’ll discuss them over the next few sections.
A tricky decision you’ll face is which languages to use to write microservices. Although microservices provide for technical freedom, using a wide range of languages and frameworks can increase risk:
In practice, you’ll always encounter scenarios where you need to pick a different language, such as specialist features or performance needs. For example, Java would be ill-suited to writing systems infrastructure, just as Ruby doesn’t have the depth of scientific and machine learning libraries available to Python. In these scenarios, it’s important to share the development of services in new languages/frameworks across many team members to reduce bus factor risk: rotate team members, have a pair program, write documentation, and mentor new engineers.
Picking a single primary language, or a small set, allows you to better optimize practices and approach for that language. The creation of service templates, chassis, and/or exemplars will naturally ease development in your favored language, leading more developers to write services using it. Lowering friction this way creates a virtuous circle. Even if you don’t explicitly choose a favored language, this can happen organically (although it’ll take longer).
Applying open source principles to microservice code can help to alleviate contention and technical isolation while improving knowledge sharing. As we mentioned earlier, each team in a microservice organization typically owns multiple services. But each service you run in production must have a clear owner: a team that takes long-term responsibility for that service’s functionality, maintenance, and stability.
That doesn’t mean those people must be the only contributors to that service. Other teams might need to tweak functionality to meet their needs or fix defects. If these changes all needed the same group of people to make them, those people would be at the mercy of their own priorities, which in turn would slow other teams down.
Instead, an inner-source model — open source within your organization — balances ownership and visibility:
This model (figure 13.17) closely resembles most open source projects, where a core group of committers make most commits and key decisions, and others can submit changes for approval. Imagine an engineer on Team A needs to make a change to a service that Team B owns. They could argue for the priority of their change against everything else on Team A’s backlog, or they could pull the code, make the change themselves, and submit a pull request for Team B to review.
This approach has three benefits:
Each new microservice is a blank slate. Each service will have different performance characteristics; might be written in a different language; might require new infrastructure; and so on. A new feature might be possible to write in several ways: as a new service, as many services, or within an existing service. This freedom is terrific, but a lack of oversight can result in
A few methods can help you get around this issue. In chapter 7, we discussed using service chassis and service exemplars as best practice starting points. But that’s only a partial solution.
In our own company — comparable to practices at Uber and Criteo — we follow a design review process. For any new service or substantial new feature, the engineer responsible produces a design document (we call this an RFC, or request for comments) and asks for feedback from a group of reviewers, both in and outside of their own team. Table 13.1 outlines the sections in a typical design review document.
Section | Purpose |
Problem & Context | What technical and/or business problem does this feature solve? Why are we doing this? |
Solution | How are you intending to solve this problem? |
Dependencies & Integration | How does it interact with existing or planned services/functionality/components? |
Interfaces | What operations might this service expose? |
Scale & Performance | How does the feature scale? What are the rough operational costs? |
Reliability | What level of reliability are you aiming for? |
Redundancy | Backups, restores, deployment, fallbacks |
Monitoring & Instrumentation | How will you understand this service’s behavior? |
Failure Scenarios | How will you mitigate the impact of possible failures? |
Security | Threat model, protection of data, and so on |
Rollout | How will you launch this feature? |
Risks & Open Questions | What risks have you identified? What don’t you know? |
This process catches suboptimal design decisions early in the development cycle. Although writing a document may seem like extra effort, having a semiformal prompt to consider service design tends to result in faster overall development, as the team brings to light the full range of considerations and tradeoffs before committing to an implementation direction.
As we’ve mentioned, it’s difficult to keep a microservice architecture in your head. The scale of a microservice application demands that your team invest time in documentation. For each service, we recommend a four-layered approach: overviews, contracts, runbooks, and metadata. Table 13.2 details these four layers.
This documentation should be discoverable in a registry — a single website where details for all services are available. Good microservice documentation serves many purposes:
Many tools exist for writing project documentation, such as MkDocs (www.mkdocs.org). You could combine them with service metadata approaches, as described in table 13.2, to build a microservice registry.
As a service owner or an architect, you’ll often want to get an overarching view of the state of your application to answer questions like
At the time of this writing, few tools exist in the wild that combine this information to make it readily available. When it’s available, it’s typically spread across multiple locations:
Similar information might be kept in spreadsheets or architectural diagrams, which, sadly, are often out of date.
A recent presentation from John Arthorne at Shopify7 proposed embedding a file, service.yml, in each code repository and using that as a source of service metadata. This is a promising idea, but at the time of this writing, you’ll need to roll your own.
Forming, growing, and improving engineering teams is a broad topic, and in this chapter we’ve only scratched the surface. If you’re interested in learning more, we recommend the following books as good places to start:
We’ve covered a lot of ground in this chapter. Choosing a microservice engineering approach is great for getting things done and empowering engineers, but changing your technical foundation is only half the battle. Any system is deeply intertwined with the people building it — successful, sustainable development requires close collaboration, communication, and rigorous and responsible engineering practices.
In the end, people deliver software. Getting the best product out requires getting the best out of your team.