Chapter 3. Physical interfaces - Systems beyond Retail E-Commerce

(warehouse operations)

Describes a possible architecture in order to run an actual warehouse with the focus on the integration hardware interfaces through wired and wireless networks. Using this as a demonstration of multiple connected devices cooperating together to provide functionality of a semi-automated / fully automated warehouse with integration into additional sub systems in an e-commerce supply-chain.

3.0 Introduction

We are living in interesting times of robotics, artificial intelligence, self driving cars and autonomous drones. The focus of the R&D departments of the leading technology companies and their top software engineers is mostly in the field of (deep) machine learning and artificial intelligence. The systems are more and more shifting towards cloud environments and they need to designed differently in order to fully leverage the improved availability and fault tolerance. On top of that the cost savings provided by the managed services provided by the various cloud providers are enabling to build applications easily and get them to the customers faster then ever before. The startups often have their business plans thinking off (almost) global availability of their services. Reliability and responsiveness of services on such scale is required and eventually achievable only by using distributed systems at some level of the architecture / infrastructure. In my opinion the use of distributed system is the only viable way to handle the increasing computing and data storages requirements for running the applications supporting multiple geographical markets in a cost effective way.

Unfortunately not all the systems which are currently being built are about ML, AI and robotics. Such research projects or their applications requires substantial funding which can be generated by “less” scientifically and technologically challenging businesses such as E-commerce. Great example for this is Amazon and their platform which enabled the company to became one of the leading cloud solution providers and innovators.

E-commerce solved problem?!

So these days developing, maintaining and improving systems which belongs to the e-commerce ecosystem does not really sound very interesting and most of the software engineers wouldn’t see it as challenging at all. Moreover you can very often hear phrases such as “E-commerce is a solved problem”, “There is nothing interesting about E-commerce systems”. Why would someone start to develop it’s own product when there is so many SaaS solution in the market. In addition to that there are so many companies who are providing their off-shelve products for solving different parts of the e-commerce as a whole or just a small part of the whole chain of operations. So eventually you can integrate multiple solutions together and you would get support for the whole e-commerce ecosystem. Commercial products often offers open source community versions which are available as well but they are mainly focused on the e-shop, the customer relations management and simplistic product management implementations. This will get you going if you are not trying to build a big technology company where one of the main goals is to get a competitive advantage.

So it is a solved problem if we are considering only the part when the transaction of buying a product from a seller happens but there is much more to this. The so called E-shop part of the system which customers are facing is just the tip of the iceberg. In the broader view an effective and successful e-commerce ecosystem with multi-market reach aspiration needs to be integrated with many more and less independent systems which are allowing the customers receive what they have ordered and payed for in a reasonable times. The complexity comes when you start to think about it in the context of the distributed systems and global market. Another important trait is the ability to innovate and evolve quickly with agile development processes and daily releases. That’s is really difficult to achieve with couple of monolithic systems integrated together and it will definitely difficult to compete with the other players. In the recent years the e-commerce market is more and more dominated by the big e-commerce players who are competing over the global market and delivering packages in different markets across the world. So if we are trying to compete with those players the task in front us starts to be a bit more interesting.

Commonly available solutions can be run in a local datacenter or in the cloud but their scalability, reliability, availability and resiliency properties are limited. Most of those applications are focusing on one specific area in the e-commerce and they usually don’t directly deal with the problems of procurement, warehousing and fulfilment of the orders. The activities and systems responsible for this are generally called supply chain and they go under the commerce domain as we don’t yet have the means of teleporting the stock from one place to another. Supply chain platforms are usually builded as big ERP systems as they can even handle the manufacturing processes and logistics. Here captains of the software industry (as SAP, IBM and Oracle) comes in with their off shelf solutions.

I would agree that the e-shop part and the order lifecycle is not very interesting if you do not include the scalability challenges but you are getting into additioanl set of problems if you consider the supply chain as well. There is plenty of room for innovation in terms of autonomous systems and optimisation in logistics and warehousing.

The main question here is how to build a global e-commerce platform using the distributed systems as you would end up with multiple warehouses in multiple markets all around the world. Lets start with brief description of the e-commerce platform to see what bits are missing in the whole process of the courier arriving at your door. We can not leave out quick walkthrough for the common architectures and a example architecture parts which we can use in the architecture of any system. Then we will dive deeper in the supply chain and fulfilment domain with the focus on the user and physical interface uncovering the problems which are going to face and proposing an example solution. All that while repeating the important concepts and practices to be aware of when building distributed systems. And we would focus on the integration of different hardware devices which you can expect to be used in a warehouse but those practices are generally applicable. Hopefully I have now motivated you that you this domain can be interesting and there are problems which you would like to solve.

Warning

From my personal experience Returns and Exchange processes are getting often underestimated as everyone focuses on the outbound process and forgets to properly define the required workflow. As it can seem simple but there is plenty of information to be tracked in order to process the returns and exchanges. There are inspections and quality checks of the goods so the refunds can be approved and so on. I personally prefer a semi-automated approach to begin with over trying to fully automate the process and encode all the possible state in the system. Once a shipment leaves the warehouse there is incredible amount of things which can go wrong. If there is there an uncertainty about how such process should work or they seem to be over-complicated it makes sense to involve some more and less manual steps. It often involves communications with the distributors and manufactures in some cases and if you are having millions of products the complexity increases drastically.

What’s behind E-Commerce pages

Lets define first what are the basic parts of the a retail e-commerce system are and follow up with the supply chain systems. From that we are going to see how broad the domain actually is. We need to understand how the systems in the e-commerce ecosystem need to interact with each and what kind of information they are exchanging. That will help us to understand the functional and non-functional requirements for our warehouse system.

Basic parts of a retail e-commerce system: - Product Inventory Management (PIM) - management of product informations, picture assets (sometimes prices, stock) - Customer Relationship Management (CRM) - customer support, sales tracking, refunds - Content Management System (CMS) - usually with e-commerce extensions, maintenance of the product pages, lists, search, checkout flows, my account pages - Eshop application - provides the web pages and the whole customer experience, basket maintenance, payment integration often part of the CMS

Those parts of the system have it’s own set of problems as they need to be able to handle big amount of concurrent users on days like Black Friday. The biggest strain is put on the customer facing services which are serving the product information needed for displaying the web pages. The latency of the responses to the users is the most important part here so use of caching on multiple layers is expected. This area is well know if you have been doing e-commerce for some time and we will focus on caching in other chapter. Our goal here is to go deeper and look at the systems which actually allows physical delivery of the orders. The increased amount of requests holds for the rest of the supply chain infrastructure as well as for the e-commerce subsystem. But the maximum throughput of a warehouse is limited by the actual physical boundaries of the real world. What I mean by that is that the fulfilment system will be fulfilling the orders the same way as any other day and eventually the orders from such event will be processed in the next couple of days / weeks. Some additional season workers are usually hired for such events in order to temporarily increase the throughput of the fulfilment centers but the system itself shouldn’t be at all affected by the amount of the incoming orders if designed well. On the other hand the problems can start to manifest itself easily when the warehouse system is running as a monolith application where the resources are shared between all the services required for the warehouse operations. The warehouse should be able to perform it’s duties completely on it’s own regardless on the amount of the orders coming in. For a well designed system it should be ideally only case of scaling up some additional nodes in order to be able process more order placements concurrently. Ideally our system should be able to give feedback about it’s current workload - so the e-shop part of the system should dynamically adjust the estimated delivery times for the goods. That’s very important principle which is applicable generally in the integration of 2 systems or nodes of a distributed system. The term for the feedback between producer and consumer is call Backpressure. So lets uncover more of such principles as we go.

In this chapter we will look on an possible implementation of set of microservices which together provides the functionality necessary for a warehouse management system WMS to work. Such system should be able to provide an independent API which can be ideally provided as a service on its own so it can be even used for supplying retail stores and other third party vendors for different kinds of e-commerce platforms. We would see how such system can benefit from the use of micro service architecture and what are the caveats.

E-Commerce Retail Ecosystem

Lets define the e-commerce ecosystem as a set of all systems which need to be integrated together in order to provide the full customer experience. We can split it into smaller sets which maps not only onto different teams and business units but could be even provided by whole specialized companies.

  • Customer platform - Provides the direct interaction with the customers. The main purpose are the web and mobile applications which provides the e-shop implementation where customers can browse products and make purchases. To this platform belongs as well the after sales support such as customer care and returns, refund and exchanges. Recommendations and search services as well as services which provide direct communication with the customers. Payment integrations to ensure that the orders are payed for before they are handed over to fulfilment centres. Notification services to maintain communication and marketing channels for the customers. Some sort of CMS where product pages and campaigns can be simply maintained. All the parts previously mentiond in as basic parts of the retail e-commerce system with the exception of the product managemen. For really big ecosystem the product static data and it’s maintenance must be solve completely separtely.

  • Product platform - set of applications responsible for maintenance and enrichment of all the product data. Heart of such platforms are Product Inventory Management which provides most of such functionality out of the box. Those are the master data which will be used by all other platforms to get important information about different products and maintain their classification and all their assets. In case that different seller channels and different channels are provided this platforms is responsible for maintaining different kind of information for seller product information and

  • Fulfilment platform - The simple view of responsibilities of the fulfilment platform is to receive stock into a warehouse and dispatch the stock out of the warehouse to another warehouse / customer / distributor / manufacturer. The most important part is tracking of the inventory accross different locations inside a warehouse and of course tracking of the expiry dates for the perishable goods. For that it needs to provide applications which will support and track the work of the operators inside the warehouse. That means it needs to provide interfaces which can be run on different devices such as computers, handhelds, mobile devices and has to even control various hardware devices. The order fulfilment itself consist of splitting the orders between the available warehouses and deciding which products can be packed together based on their dimensions and weight. It is as well responsible for assigning the orders to the pickers in way that it is ensured that the orders are picked and packed in time so they can be handed over to the logistics for delivery. It must be able to deal with cancellations and changes of the orders as well as missing and damaged stock in every part of the process. It needs to deal with transfers of the stock between the warehouses.

  • Logistics platform - System for management of the available means of transportation such as cars, trucks, mopeds etc. and their drivers. Management of available delivery and pickup options and the capacity for such delivery options. Providing additional services as collection money on delivery, inspection of returned goods, collection and disposal of old appliances, installation of the delivered goods. Planning of the paths and maintaining hubs from where additional goods can be delivered. Providing information about tracking and execution of deliveries. Returning not delivered orders and maintaining delivery retries.

  • Seller(Market) platform - Set of systems which provide the possibility to the third parties to sell their stock as part of the e-commerce platform. So they can use the consumer platform and supply chain facilities in order to sell their stock. Those systems are responsible for setting up the prices of the seller goods and monitoring the available stock in the warehouse. They provide the capabillity to plan deliveries of the stock from sellers to warehouses and collection of unsold stock. They as well deal with maintenance of the unique assets of the products sold as the seller stock and provide integration with the product platform.

  • Analytics platform - Provides the insight and important reports for planning. It collects metrics and events from all other platforms so it can help the manager to create better strategic plans and improve the working processes in all the other platforms. For example it can provide the throughput of a fulfilment center and giving additional information for creation of the models which then can be used for more precise estimation of the delivery times.

  • Finance platform - Dealing with invoices between the suppliers and sellers and other interested parties. Reconciliation of the received stock with tracking of the . Responsibility of tracking prices of the different products using methods such Moving Average Landing Costs. The pricing usually depends of the tracking capabilities of the warehouse. It deals with the seller and consignment stock which is in both cases stock which is in the warehouse but belongs to someone else. In the case of consignment stock the stock is paid for only when it is sold. In case of the seller stock there are only fees for holding the stock inside the warehouse and fulfilling the orders with such stock. It would be responsible for all the accounting and eventually even for the salaries of the employess. It can even provide a payment gateway capabillities.

There are of course additional systems which can be part of the whole platform and together they form the company’s ERP system such as procurement (ordering stock from suppliers / distributors). Unfortunately it is not possible to describe all those system in one chapter so we will focus on the fulfilment / warehouse platform which is in the middle and interacts directly on indirectly with all the mentioned systems. As the main focus of this chapter are the physical interfaces the fulfilment platform is the most interesting for such field.

Is E-Commerce Boring?

The greatest example of all is Amazon. Its global e-commerce ecosystem shows that the e-commerce itself was one of the main drives behind the creation of a big technology company. So many innovations came from Amazon as their data centres which has been built in order to support their own platform. Believe it or not but the e-commerce market is still evolving and more and more projects and platforms are being built and not only by startups but even by food-chain companies like Walmart. Another examples is Alibaba in China. So the fight over global market has already started and if you plan to compete you just can simply rely on the off-shelves solutions. The main reason for developing your own platform is the fact that you can design the whole architecture according to the Reactive Manifesto. This gets even more interesting because of the fact that the main goal is to target multiple markets and ultimately whole world.

There is a huge space for automation and innovation but we would mainly focus on the software part. One of the examples is Amazon autonomous delivery using drones which is currently only tested but another example is Ocado a grocery retailer who built it’s own platform with fully automated warehousing system operated by robots. Ideally the warehousing software should be designed in a way that the human operators could be easily replaceable by robots. That means a warehousing system must maintain much more information about the inventory and the environment in which is stored. Currently even the physical warehouse must be built in a way that the robots can operate smoothly. So it is still more expensive to built fully automated warehouse than to employ humans.

Those platforms has been built using microservices and reactive architecture approach. Another good example of companies who are using microservices and some of them even compete in e-commerce - Netflix, eBay, Groupoun, Zalando, Uber. So it seems that if you would think about challenging those companies adoption of the reactive principles and microservice architecture is inevitable.

Another interesting part of this domain are optimisation problems which are not related to the architecture of the system but but you would need to deal with them:

  • KnapSack Problem - packing the packages, utilisation of the warehouse

  • Shortest Paths - picking paths, optimisation the fulfilment of the orders and reducing processing times

  • Scheduling problems - for the docks inside warehouses with coming shipments

  • Recommender Subsystems

  • Prediction of Demand

  • A/B testing in live environment - testing the effectivity of new workflows based on newer models

  • Scheduling the Picking

  • Picking Waves Generation

  • Machine Learning - training models based on the picking effectivity in order to achieve better results

3.1 Distributed vs. Localised

Most of the available solutions are build using the classical N tier architecture which served us well for a long time and it has still it’s place in many use cases. For me the single microservices are most of the time still designed using this approach with the difference of reducing the amount of layers. The 3 tier architecture as it’s name suggests separates the systems into 3 layers:

1

Presentation

2

Logic

3

Database

architecture 01
Figure 3-1. N-tier Architecture

Presentation layer is the part of application which provides the GUI to the user typically a web server - serving the web pages. Logical part for a business logic which is then fed into presentation layer and uses the third layer - for the persistence of any important data. If we want to stay in the e-commerce space it is mainly data about customers, orders and products. The logical layer in such applications is implemented as so called monolith. All the functionality for the whole application is divided into separated internal components but everything runs as one executable within one process. There is no communication over the network except databases, caches, messaging and external systems. There has been written a lot already about the downsides of those applications. One of the main reasons is that they heavily depend on one single database and because of that it is difficult to scale them. That can be solved by using a NoSQL database which provides horizontal linear scaling but that brings additional complexities of updating your model and queries to work using NoSQL storage. The benefits of using NoSQL databases are in performance, scalability and replication of the data which takes us into the realms of distributed systems.

Tip

Now with the Serverless Architecture emerging and it is on the offer by multiple cloud providers. The most common one is Amazon Lambda and there are solutions as Google Functions and Azure Functions. Those products represent the smallest possible architecture of a microservice. It is basically a simple function which executes some business logic or a transformation on a given inputs and provides some outputs in a request/response way or in an asynchronous way. It can connect to other services and it is incredibly cheap plus there are zero maintenance requirements. The main difference is in supported languages and the available computing power. One of the drawbacks is that the SLA is not currently available.

Microservices has emerged as natural evolution from Service Oriented Architecture and the DevOps principles. Splitting the monolithic layers into groups of independent services which communicate through common API. That should ideally provide more scalability and resiliency of the system as whole but it as well help with the development lifecycle and provides more flexibility. Distinct services can be developed by different teams / people and allows for not only scaling the system itself but improves the speed of continuous delivery. Because of that it is a good choice for a platform which must be available across different regions and is maintained and developed by teams working in different timezones. So the way to go forward is to build the platform in a way which satisfies all the requirements for a truly reactive systems so it must be Responsive, Resilient, Elastic and Message Driven. This needs to hold for the whole platform as well as for it’s parts. As described in [Link to Come].

Those principles are not new. In the fin-tech industry the systems relies heavily on asynchronous messaging for a long time. Multiple systems are connected together using messaging technologies such as Enterprise Service Bus (ESB) and persistent queues. Systems are workings as pipelines processing messages as input and emitting messages as output. It doesn’t mean that a reactive systems can’t be built only by the use of synchronous RPC calls and RESTfull interfaces when it comes to the integration between multuple services. The systems are connected on multiple levels and on specific level they work together to provide and external API. Those connected together by Message Brokers whose responsibility is to route messages to the corresponding topics or into a systems on different level. You can imagine it as Akka actor hierarchy but the actors are representing whole systems in this case. Akka:: is framework for building distributed message-drive applications. http://akka.io

  1. Actor Hierarchy vs System Hierarchy image::images/supplychain/hierarchy.png[]

The actor hierarchy describes a simple hierarchy of actors of actors used for performing different workflows inside a warehouse. Parent actors represents types of the workflows and aggregates information about these workflows. Each children represents an unfinished workflow.

Second image represents a description of a similar approach to architecture where the whole systems / microservices are communicating with each other using messages. These messages are being routed between different queues / topics depending on the chosen technology. Within those services an internal Akka hierarchy can exist.

Reactive and Distributed

In order to provide scalability, resiliency and availability across multiple regions we will end up building a distributed systems by ourselves and end up rely on some other distributed systems such as databases or queues and cluster managers. It is definitely worthy to mention the very old fallacies of distributed systems which are very often overlooked by the software engineers.

[Link to Come]

So node failures and network partitions happens and that is a fact which needs to be dealt with. Reactive system needs to deal with that and recover from a failure. In order to provide resiliency and availability the system needs to have redundant nodes and replicas of the data stores which serves as a failover. Only that way such system can provide high availability. Those requirements forces us into the building of a distributed system. The communication over the network is inevitable as one single machine can fail and can scale only vertically.

One of the most important principles for understanding of the distributed systems is the Brewer’s theorem more commonly called CAP theorem. The name is derived from the first letters of the main 3 properties of a distributed system. This theorem says that the distributed data store can at most guarantee only 2 of those 3 properties. The properties are as follows:

  • Consistency - there is one copy of the data available at the same to all the parts to the system, referring to the fact that the latest write is immediatelly available for reading in all parts of the system

  • High Availability - system is available to successfully serve requests without errors

  • Partition Tolerance - parts of the system will sustain fact that they can not communicate with each other (broken network connection, dropping messages) and they are able to recover from such state

In order to maintain consistency the system must be able to communicate with all the replicas of the data in the system. So if we want to have a strongly consistent system and network partition happens the nodes on one of the side of the partition needs to be removed and writes cannot be accepted as it would cause inconsistency in the system. That is one of the possible approaches how consistency can be achieved while sacrificing the availability. This behaviour can be fine tuned and it heavily depends on how the partitions are maintained.

This must be kept in mind when designing the whole system so you can make the right decisions about what properties of the system are important the most. Of course when designing the actual systems there is more to consider moreover that the theorem is based on a single register and it is bit simplistic as mentioned in as described in [brewer2012]. In a complex system one event can affect multiple services of your system and trigger a multiple subsequents events to happen. Availability is commonly exchanged for high availability which means that the system can be not available but only for short periods of time until it for example deals with finding that a leader is gone and new has to be re-elected. In that time writes can’t be accepted by the rest of the system can still be operational only with degraded functionality.

Message Driven

As another main trait of the Reactive systems is the message-driveness we went for the message queue which is used for publishing events into different topics and services can subscribe into those topics if they need specific information. This provides means of fully asynchronous communication and helps to decouple the services.

The isolation of the micro services is very important. The main goal is to have the micro service to be operating as much independently if possible unless the microservice is a process manager in the DDD terminology and orchestrates multiple microservices in order to achieve a complex process. The isolation gives us plenty benefits. It allows us to run multiple instances of such microservice so we can scale and be more resilient. Another benefit is that the development of each service can be parallelised easily and if it is done correctly it is easy to test. The ideal is that other services can still operate if one of those services goes down but that is relevant for services which are not part of the critical path. There are usually some services that are crucial for the others and if they go down you need to wait until they are back. That sounds like a single point of failure but such services will be designed according to the reactive manifesto so they shouldn’t really completely go down.

Other extreme scenario is that when Kafka cluster of Cassandra cluster gets destroyed - for example because you are running them on AWS spot instances. So in an instant you loose the services and from that we are entering the realms of disaster recovery which you can read more about in infrastructure chapter

Message Delivery Guarantees

This is very important when modelling the protocol which is used for the communication between the microservices. No one wants to built a system where messages get lost and we would need to retry an operation in worst case it would cause inconsistency between communicating systems.

  • At Most Once - The message is being sent only once. So it can be be delivered only once but if it gets lost on the way because of a network partition or the processing of the message fails for some reason fails the message wouldn’t be delivered again. This actually means that there are no guarantees that sending the message will result in a succesfull delivery.

  • Exactly Once - The message is delivered to the recipient exactly once.

  • At Least Once - The same message can be delivered multiple times. Given that the message can be lost but if that happens the delivery is going to be retried.

Distributed Monolith

There are 2 types of dependencies between microservices. The first one is the shared code and libraries used. The second one are the actual runtime dependencies where one of the services can’t operate without the another services. The firs one is referred by the term Distributed Monolith.

This term refers to common anti-pattern which is caused by coupling microservices with binary dependencies. As it is described in this talk from Ben Christensen https://www.microservices.com/talks/dont-build-a-distributed-monolith/.

One of the best practices in software development is the DRY principle - Don’t repeat yourselves. Creating libraries sharing code and models is generally good and it is what we have been taught to do so as software engineers. But it can be problematic when practicing the microservices architecture. It doesn’t necessarily mean that you can’t use any libraries or factor some common code out but it must be done in a way that id doesn’t affect the possibility to evolve the services independently. In microservice architecture the duplication of code between multiple microservices is acceptable. If you want to you can publish really small precisely focused libraries which helps you to speed to process of the development of a new microservice.

One example of such small library could be a helper methods for reading messages from and to Kafka in a way that it provides all the necessary metadata for a correct serialisation and deserialization using the chosen format which can be JSON, Protobuff, Avro or others. That way it provides substantial speed up for the developers as they would reuse the same pattern. That brings us to thinking about how to define the protocols and communication between the microservices as we need to maintain the contracts.

One of the arguments which I have lately heard is that distributed monolith is any kind of microservice architecture because the communication between the services works as RPC - remote procedure calls. Almost complete isolation is desirable but not always fully achievable. Even with such dependencies between services splitting a monolith into multiple parts give you benefits - Parallel development and better means of scalability for each single service. The more dependencies there is the more complications for deployments, maintenance and development. Important drawback is that runtime dependencies are making the integration testing more difficult as mentioned before.

Service Communications

The communication mechanism used between the services needs to be standardized so the services can exchange information in an effective manner. Ideally one way of communication is used between majority of the services. One of the main selling points of the microservice architecture is the encouragement of the independent development and the freedom of choosing the internal implementation and technologies as long as all the functional and non functional requirements are met. There are services which need to use multiple types of integration. Those are the services on the boundaries which need to provide user interface or APIs for external services. They are consumed by third-parties so a general well known approach needs to be used in order to ease the integration. These days the most common ones are REST / JSON over HTTP. The common methods used for integration services together are as follows:

  • Synchronous REST services

  • Synchronous SOAP services

  • Synchronous JSON / XML over HTTP

  • Synchronous RPC (gRPC, JSON-RPC)

  • Asynchronous - Queue based - typically REST

The messages definitions or the content of the synchronous calls needs to be defined as part of the public API. They must be well documented so they can be shared between multiple services and teams so the format of the commands is known as well as the format of events which are emmited in the response of the commands. Once the applications is split into multiple independent components the shared code and model definitions can not be maintained the same way as we are used to. It is desirable that the commands and events are described using a schema definition language and published so they are easily accessible available for consumption from other services. This has been done before already as part of the Web Services specification which is using the SOAP protocol. The WSDL is exposed by the services and provides full description of all the available endpoints and the definitions of the messages with their values. Currently libraries for object serialization which are supposed to solve a different problem of reliably and quickly serializing are using schemas for definitions of that objects. They are using schemas for description of the objects and have implementations which support different languages which supports the microservice approach as you won’t to limit yourselves by the use of one specific language. Those serialization libraries are designed for the use within distributed system as their main purpose is to encode the state transferred over the network. They are used for event sourcing as well as the serialized state of the data can be easily and quickly stored in any database without the need of changing the database schema. For the use in distributed they provide additional features such as built in support for the schema evolution.

The schema versioning and evolution is another interesting topic in distributed systems and microservices architectures. The most imporant thing to consider is the backward and forward compatibility. The forward compatibility means that the service is able to extract the required information from a message and ignore the bits which it does not expect and this behaviour must be completely inpdendent on the compilation time of the service. The backward compatibility means that the services can’t remove the existing endpoints and change the format of the response / event messages. It considers the fact that there are multiple instances of one service and for example in order to achieve 0-downtime there are versions of services which contains new features which expects new fields to be included in the messages but as wella older instances of the same service which must be able to process the new format of the message.

So the main rule for message evolution is that you can only add fields and never remove them. This can be relaxed even further if we consider the fact that many of the services are communicating only internally within a bigger systems so the situation is slightly different to the versioning of a public API which is consumed by third-parties. For the public API the versioning makes sense if you want to improve your API and provide more features to your customers. You shouldn’t really ever break the API and force your customers to make the changes if they are satisfied with the features. Making a breaking change to a service to microservice which would then require a coordinated release of additional services in my mind equals to a release of a new service or even worse. The versioning makes in terms of release process and distinguishing different versions of the service which contains different features but a true microservice shouldn’t really have that many versions which radically changes it’s API.

3.2 High Level System Design

Lets first quickly summarise the main responsibilities of the system which we are building:

  • Inventory Tracking

  • Outgoing Shipments (customer orders, transfers between warehouses, recalled stock)

  • Incoming Shipments Processing (market, consignment, retail stock )

  • Reverse Logistics (returns, exchanges, undelivered orders)

  • Product Information Management

Each of those responsibilities is represented by the actual physical movements of inventory done by either operators or by the use of automated systems. The processing of the movements is the core of functionality of the whole system as the movement events are the main source of the information for the tracking. There are multiple properties of the inventory which must be tracked. It is not only the overall amount of stock and its location. In order to provide API for the third-parties which can be integrated into their platforms the system needs to know who is the owner of a specific piece stock because the same products can be sold for different prices by different sellers. There are multiple ways how the stock can get into the warehouse and who must pay for it. One of the examples of the finance aspect and reconciliation is when the warehouse receives stock owned by the operator for its own use. Then it depends on the contract between the supplier and the warehouse operator how they have chosen to pay for it. Because the stock needs to be inbounded and checked if the counts and quality matches the ordered amount. Another case is the consignment stock which still belongs to the supplier and it is payed for only when it is sold to the end customer or returned back. In order to support the after sales processes the date when the stock has been inbounded must be recorded. That is important when supplier or manufacturer warranties apply and the length of the storage must be tracked as well. For expensive products such as electronics, computers and smartphones which have unique identifiers those needs to be tracked as well in order to prevent fraud. Various other things come into the picture when there are perishable goods for which you need to track the use by / expiry dates and deal with the environmental conditions as temperature and humidity. The product information such as dimensions and weight is crucial in order to apply store optimization and make decision about how to split and deliver the shipments to the end customers.

In a logistic centre the products keep moving between locations constantly and the system needs to know following information what is being moved - the quantity, the unique identifier if it applies, the state of the product, the product classification, to whom it belongs, where it is being moved, when the movement happened and who has done it - user id or robot identifier. A lot of concurrent operations is happening which needs to be logged and updated continuously in real time. The real-time stock levels allows the fullfilment of the orders. On top of those movements information another metrics are built which allow the estimated processing times and help even for the items which are not currently in stock. That allows for better predictions of available delivery times of the orders. In order to gather all those data many devices and sensors needs to be used as the entry points for the data input into the system. The devices described in this chapter represents only a fraction of possibilities which can be used in a logistic centre but the principles are generally applicable.

Move It!

The initial focus is on an implementation for a single logistic centre. Omitting the additional complexity represented by the support of multiple warehouses integrated together. There are definitely more possibilities how to solve this problem and we will get to it in the last part of this chapter. To build a system for one warehouse is difficult enough and it is a good example for discussing and showing the principles of reactive applications and microservice architecture.

The classic approach is to build one system on top of one big RDBMS database which is backed up and replicated to provide resiliency. It has been done before and there are off-shelf solutions available. Such applications are usually built as monoliths. One big fat application internally divided into components and connected to the database. The architecture corresponds to the N-Tier Level architecture mentioned previously. You can certainly achieve Responsiveness as thats one of the traits provided by disciplined programing practices and backed inside the internals of the implementation itself. What I mean by that is that the components are part of the same process and specific errors can be reported quickly. The application needs to always respond in reasonable times even if it reports failure. Of course the components within the application should be as independent as possible and they shouldn’t cascade the errors to other components of the system effectively taking down the whole system. This holds for both microservice / SOA and monolith applications. Resiliency is harder to achieve with the monolithic approach as one of the main requirements for doing so is the ability for replication. By having only one instance of application the resiliency goes out of the window. Resiliency can be achieved by having hot / cold swaps of the applications ready running on a different machines so we can prevent the common failures of the hardware. This brings additional costs and maintenance overhead as appropriate infrastructure such as load balancers must be put in place. The system should be still able to respond even if there are some sort of error conditions like a database is not available or one of the external integrations is not available and it is not essential for its functional. If it can still provide a subset of its functionality it is better if it stays up. Elasticity is pretty difficult to achieve in the case of monolithic applications. Multiple instances of the same application running in parallel which is good for resiliency as well. It adds complexity in terms of load balancing the requests across those nodes and it gets even more difficult where there is state which needs to be replicated as well. Even if the instances are completely stateless a RDBMS will become a bottle neck at some point. The database itself can still scale vertically but it is not something you would like to do dynamically at runtime.

The monolith approach doesn’t give us the flexibility and extensibility of the systems as the development is not very fast. In such big systems where optimisation is involved it is important to do some testing of the processes and measure the throughput of the warehouse. It is not uncommon that one of the requirements is that the system supports switching of different algorithms for the pick path optimization and the order batch generation. Think of the Java Strategy Pattern applied not only for algorithms but on whole workflows as well.

If you assume that the warehouses which the system must support can be bigger that what one RDBMS instance can handle the use of distributed NoSQL databases and microservices is the next step in order to build a reactive system. It adds a great layer of complexity as the system must deal with consistency and message delivery guarantees correctly. It is not that there is a need for building databases from scratch but it is important to design the systems while knowing the caveats and implement system which behaves deterministically under boundary conditions. Another possible solution would be to partition the warehouse based on lets say areas and use a dedicated databases for each of the areas and aggregating values from those partitions across. That would result in building our distributed system as we would need to ensure consistency between the the aggregated values and the values in each partition. That way a distributed system on its own is being designed.

Our system has almost equal requirements for read and write throughput. The amount of stock on different places changes constantly which means that a substantial number of writes is required. The amount of reads consist of queries for the purpose of inventory checks and order fulfilment. It is essential to have real-time knowledge if there is enough stock to fulfil orders or that a specific stock is running out. The operators must have the possibility to identify the association of the order items with specific locations within their process. Other set of queries must provide various aggregated values for specific stock grouped by supplier, expiry date, lot or location. The tracked information can be distributed into multiple subsystems as you don’t always need to immediately know where the products from a specific supplier are located. There are different possibilities how to approach the problem of tracking the inventory each giving a different set of benefits and problems.

Even with SQL database and fully transactional system it can’t be guaranteed that the database will precisely reflect the state of the real world. Operators are going to make mistakes such as putting the wrong amount of items into a wrong place. That is going to produce discrepancies between the real world and the state of the system. It is a fact that there are going be lost, damaged and misplaced items nevertheless on how the systems is implemented. Repair mechanisms and processes needs to be put in place in order to make the state of the system consistent with the physical amounts of stock. That is necessary regardless of the approach chosen.

Looking at the tracking systems in terms of satisfying the main functional requirement for knowing which stock is where it is just a data store which contains information about the stock levels. There is a business logic involved in terms of which items can be put in which location based on their properties such as value, weight etc. but that is not important now. We have multiple choices how to build it based on how we store the data and which type of datastore we will choose.

First we need to choose appropriate datastore which depends on the way how the data going to be stored. That differs for each microservice and we have already touched the possible solutions. The system can be either build in the classic way as CRUD - storing the current state and mutating it in place or using an event store - storing each event as a single record.

Datastore types:

  1. SQL Database

Pros - consistency guarantees - support for transactions - support for joins and easily accessible aggregated values

Cons - not easily scalable - single point of failure

  1. NoSQL Database

The NoSQL databases differs in terms of how they operates - if they are schema less or purely key value-based.

Pros - resiliency - high availability - scalability

Cons - eventual consistency - complex maintenance - limited query / aggregate / join operations - no support for transactions

Way of storing the data:

  1. Classic Create, Read, Update, Delete (CRUD) - rows in tables

Pros - schema evolution - direct mapping between with the current state and objects

Cons - more difficult testing as we deal with mutable state

  1. Event Sourced (ES) - events stored as journal

Pros - append only (CREATE and READ) - improved performance - historical state - fast reads

Cons - development overhead when aggregating and querying the data - complex schema evolution - storing more data - need for snapshots and / or data retention policy if we don’t assume infinite storage for the journal - slower writes

Now we can combine those 2 approaches together and we will get 4 options: CRUD + SQL, CRUD + NoSQL, ES + SQL, ES + NoSQL.

The main problem at hand is how to ensure the atomicity and consistency of the movement operations and their isolation. SQL database really looks like a reasonable choice but as mentioned before it has it’s own limitations in terms of throughput of the system and the amount of data which can handle. So it is really a question of scale. The most difficult is actually to estimate the required performance and decide if a traditional SQL database will be able to handle the load on a Black Friday or around the Christmas period. That doesn’t mean that is completely unreasonable to use a SQL database if you know that the scale is limited. The scalability can be solved by using the CRUD approach and appropriate NoSQL datastore which provides the scalability such as Apache Cassandra. In that case the transactions can’t be relied on anymore but again it is a questions of tradeoffs and the system can be fine tuned to allow for a small percentage of conflicts.

One of the requirements for the system is the need of historical data so the causes of the discrepancies can be investigated and resolved. Auditing of the operations done by the specific users of the system another desirable functionality. It is definitely possible to implement an audit trail log as part of the business logic or using triggers and stored procedures in databases. But with ES you get all this information as part of the architectural design. So it seems that everything starts to point to the ES + NoSQL solution.

Additional complexity for the movements processing is added by the process of generating the picking waves when operator / robot must go and pick an item in order to fulfil an order. For that a possible solution could be a locking mechanism of the items in specific location. That is one way how to prevent a scenario when 2 operators would go to pick the same item from a same location where there is only one. Ideally the approach used for picking the stock for the orders should be independent from the tracking system but the tracking system is the main source of the data.

Imagine following scenario: Operator is moving an item from location to another as his manager wants to move some stock closer to the packing stations as it is very popular and he wants to reduce the processing times. At the same time an order which contains the same item is being scheduled and the scheduling algorithm decides to block in the same location. A race condition between the system which tries to block item in that specific location and the fact that the the last item in that location is being moved by operator.

Decision needs to be made how restrictive the system should be in terms of allowing the operations:

  • the system is always right - system will always check before updating the database if the source contains the objects which is moved and added it to the destination atomically within a transaction, operator is requesting a operation

  • the operator is always right - system simply accepts the movement with its quantities and no validation if the locations contains the stock are being made, operator is telling the system that a specific event has happened

The first approach seems to be ideal but it requires transactions and eventually additional reads before writes. There will be many of such operations happening across the whole warehouse so the database would need to process a lot of such requests. The biggest drawback is that this approach is synchronous in its nature - the operator needs to wait until the movement is confirmed. That is certainly achievable but it can be a problematic if there are multiple services involved in that operation because they will be adding latency and the system would use more resources.

Lets not get into the details how the counts are maintained as there must be support for conditional writes in the database itself or the current state must be read and checked that the required amount of items is available. With transactions all is good. If blocking system wins - operator gets info that the item has been blocked so he needs to put it back. If operators request wins his movement is successful and the blocking algorithm gets an error back and needs to pick another location or reschedule the fulfilment of that order. So that is how an RDBMS based system would solve this.

The use of transactions and RDBMS is a well known and straightforward approach and would ensure strong consistency of the system and would make implementation of reservation mechanism for order items simpler.

Considering using the same approach with a weaker consistency model and without use of transactions the same behaviour is still achievable if the reads / writes are sequentialised and not concurrently applied from different nodes as they apply only for the state of a single entity.

The second approach allows for higher throughput and distributing the load across multiple instances assuming that the operations will be send as events and stored and the state of the database will be eventually consistent. It means that the movements will be processed completely asynchronously in a fire and forget manner. This doesn’t disallow the use of the SQL database but it is a bit pointless as we are trying to find solution which will allow us to use better support for scalability and resilience. The ordering of the events matter in this case as the blocking operation can fail when the stock is physically at the location but the system hasn’t processed the movement yet. One of the assumptions is that only one operator can interact with a specific location at a time. That suggest that the transactions wouldn’t be required as the state of a given location can not be concurrently modified.

In my mind the priority for the logistic centre is process the orders it already received even if new orders can’t be received and some parts of the system are down without any external dependency. The availability here is crucial as the workers needs to go and make the movements. The system still should perform validations and display warnings about missing stock - because otherwise the operators will create even more discrepancies. This can be achieved by using by an independent and resilient queue which is responsible for durability of the movements and ensure delivery and processing of the message. That adds additional latency but if the system does’t wait for the response it does not matter. That means the operations inside the warehouse can still progress as long the movements are being recorded but not necessarily immediately processed. That assumes some graceful degradation of the service if the tracking systems goes down.

If the tracking system or other systems required for validation such as static data about the locations and products moved are not responding because they become unavailable a reactive service should be able to resolve such situation. By degrading the consistency requirement the process could be allowed to skip the validation completely. That will still let the operators pick the orders and ship them out of the warehouse. Once the tracking system comes up the movements would be processed and the it should became consistent with real state of the warehouse. It assumes that the system will go up and the movements will be processed - schedulling of new orders should be stopped and so but it is still better than to fully stop the operations in the warehouse giving the system with problems time to recover.

Third approach starts to form - Systems checks the current state at the point of the operation but still proceed with the movement - with immediately indicating that there is something wrong with state of that specific location and another process must be scheduled in order to resolve the discrepancy. Minimalising the possibility of a wrong movement but increasing the overall throughput of the system.

So the main requirement is that the movements can happen all the time and the work of operators is not blocked by the system - system must be available as a possible. System should prevent failures of order fulfilment because of missing or misplaced stock - must provide consistency. Looking at those requirement it seems that the consistency and availability are the defining traits of the system in terms of CAP theorem. What is very important as well is the performance of the system in terms of how many concurrent movements it can handle.

By choosing the NoSQL approach with multiple instances the Partition Tolerance can’t be ignored. Even if we assume that the instances of the logistics systems are running either onsite and doesn’t require internet connection for the inventory tracking or system is collocated within one network of a data centre in one specific availability zone. For that setup it is highly unlikely that partition will happen but if it happens we need to deal with it. The simplest approach would be to shut down the whole system down and stop the operations when a partition is detected - so that would be similar to loosing a RDBMS and we would completely sacrifice availability and I wouldn’t call such system partition tolerant. That is not very reactive nor resilient. It needs to be evaluated how often such situation can happen. What is unfortunate the likelihood of such things happening is proportional to the current load of the system so it usually breaks when you need the system the most. In the case of partition it would be better to use a strategy which will shut down the minority of the nodes. Preventing the movements from happening for that part of the system in order to retain the overall consistency.

Using technologies such as akka persistence and akka sharding the state of an arbitrary big warehouse can be loaded into memory which will ensure very quick processing times. It has its own drawbacks as well in case of failure of a node the benefit of having the data in memory will disappear and it would mean that operations over specific locations will be slower. Other benefit would be that it can easily support the Event Sourcing. Akka split brain resolver provides the necessary tools for handling the network partition. Tho independent clusters can be formed and the system would retain full availability but the same entities will be created on each side of the partition and the state of that specific entity can be corrupted. But it is not that bad is can seem as the state of the locations can be corrupted on a daily operational basis so there must be the processes for correcting such corrupted state as long it recovers into a working state. The possibility of adding more nodes provides the required elasticity.

The most important reactive traits of warehousing system are resiliency and responsiveness. As long all the movements are tracked correctly and eventual failure is well communicated to the operators of the warehouse so they can react properly everything is fine. It suggests that the elasticity of the inventory tracking system is less important. That is true to some extent if you would expect that the physical warehouse itself is bounded in terms of how many locations it can contain and that there is only limited amount of workers who can do movements. That can obviously change if the people are replaced by robots and the throughput of the whole system will increase rapidly.

The main reason for choosing the the reactive and micro service oriented architecture path in this case is the ease of changing and extending the system with new functionality as well as resiliency. It’s much easier to redesign different parts of the systems if they are not part of one monolithic application and they don’t share the same database / schema / key space. And it is possible to scale the development much better. On the other hand it it brings a lot of complexity and overhead. So it is very important to put in place processes for communication and overall technical oversight over the whole platform. Standardization of the development and and CI/CD processes is a must. Forming a team of site reliability engineers and architects who are setting patterns and evaluating architectures of new microservices to see if the microservice satisfies the requirements of a reactive system. How to setup such structure and put mechanism those mechanism in place is described in [fowler2017].

From my personal experience I would suggest to aim for reasonably sized services. One can say that starting with microservices from the beginning is a premature optimization as the knowledge of the domain and the potential choke points in the system can be. It is as well not obviously and how to extract functionality in an effective way in order to not actually loose performance or some guarantees given the integration overhead. As the system grows and evolves the domain and technical understanding rises the services can be split again even more to become truly micro. That’s is why the presence of domain experts from the beginning of such projects is very important as well as enough time to be spent in the planning and designing the system. Very important is definition of reusable patterns which should be used accross the platform especially for the communication between the services. The more expertise you have the better you can split the system as you would see where are the potential benefits to be gained.

Dividing the Domain

The process of division of the warehousing domain into sub domains and designing the system accordingly has been done many times already. There is a lot of domain experts as this is the one of the most commonly solved problems. Software engineers started to build thos service since the beginning of the computer science. It is very important to understand how the system works together as a whole. The services and teams should be as independent as possible but they still need to exchange information with each other so some sort of indirect dependency is always there.

The use of domain driven design seems like a logic choice. It is a good match as a big system which has many responsibilities is being modeled. DDD techniques such as Event Storming are extremely helpful for correctly describing the responsibilities of each bounded context and subsequently the services they consists of. With the combination with Event Sourced architecture and use of libraries such as Akka Persistence which allows for persisting the events easily. It provides a recipe in the form of the DDD’s ubiquitous language in the form of commands and events. The commands and events can be directly encoded into programing language or schema of choice. Obviously reaching the final state is not instant and it is actually pretty time consuming but it is well worthy. Software engineers in general want to quickly get their hands dirty with writing the code. They underestimate the value of time spent in the inceptions of the projects. It is necessary to go through multiple iterations of this process and be prepared for changes of the requirements as it always happen. Even the business owners needs to understand that discussion is required and finding the compromises and the best solutions.

The core domain - which splits into multiple bounded contexts - orders - fullfilment - inventory tracking - product management - inbound workflows - outbound workflows

Supporting domains: - users and authorization - reporting tools - management tools for monitoring the workers

architecture 04
Figure 3-2. Naive Bounded Context Split

Initially the microservices itself directly map to the different bounded contexts but as the development goes further even more functionality gets extracted into separate services.

Everything inside the warehouse is about stock and its movements. Lets mention the set of the entities which can span multiple bounded contexts and have their own representation. Those are locations, orders and order lines(Sales, Procurement, Return, Exchange, Transfer etc.), products, delivery addresses, unique instances of stock and obviously the users who are doing the work as operators or access the system for various purposes.

There are additional entities which the system must to maintain so it can support the fulfilment of the orders such as order batches which represent groups of orders. Ideally those batches should be generated in real-time on demand when the operators request a new unit of work. That will help with the prioritisation and optimisation of the warehouse throughput. The priority comes from the service level agreements for the promised delivery times if there is a third party logistic provider. So the batches of orders are generated by the highest priority at a given point of time. Batching approach allows for better optimalisation in terms of from which locations the order will be picked and if it will be picked by multiple operators and merged. It is more effective split the orders to order line level. This representation of inventory is then grouped based on where the inventory is located inside the warehouse and available picking areas. The resulting entity is called a pick list and defines a chunks of fulfillable work for one worker. System needs to provide a mechanism for reserving the stock for orders which needs to be fulfilled and maintain overall stock inventory levels.

So we can see that there is more going on and we can add more services. Which are available in the following picture Figure 3-3

architecture 05
Figure 3-3. Additional Services

I recommend to read the books from Eric Evans [evans2003] and Vaughn Vernon [vaughn2013] which explains the concepts and techniques of the domain driven design in great detail. There are plenty of good practices which are generally applicable on software development and can be used extensively when when building micro service architecture as well as modularized monolithic applications.

Types of Services

One of the well known good practices in software development is the single responsibility principle which says that a specific function or method should have one well defined task to perform. That is one of the main principles for building a reactive systems as it applies not only on functions and components but as well on the micro service level as well. Each single microservice should do one thing and should do it well. That is mainly relevant from the business point of view as the different microservices will be covering different parts of the business functionality and they will work together. What I mean that is that usually the services are not as simple as a single function in order to provide some business value. From the architectural point of view we have different options how to built the services and we can classify those options in multiple ways. One of the ways is based on how they deal with state. So we can divide those into 2 well know groups. It is not the only way how to divide the microservices we can again look at DDD and check out the concepts of Aggregates, Entities, Process Managers etc. to help us to distinguish different type of services even further. And lastly we can classify the microservices from what is their general responsibility within the whole platform.

Stateless Microservices

Think of those services as a processing pipeline, you don’t need to care from where the request came from, you just need to know that you need to respond to that specific request with the results of a computation. Eventually they require to make a call to another service such as database or external API. Such services just encapsulate specific business logic. So they don’t store any specific session information and can be for example used for solving computation intensive optimalisation problems such as scheduling of trucks for next month given a current schedule and set of trucks to be scheduled. (assuming that the problem is solvable on a single machine a)

They are really easy to scale as there is not any state which is specific to single instance of the service. So we can dynamically spin up more instances and put a load balancer in front of them. Obviously it is not simple as that as there is some operations overhead required in terms of the instances management. In order to scale dynamically the nodes need to be registered dynamically to the load balancers and they can dynamically discover all the services they depend on. This works well for the for the resiliency as well because the nodes can die at any time. Running multiple instances in parallel will protect the system from losses of the nodes. The services needs to be responsible and report its health to the load balancers and cluster managers so they can decide if such node should be killed and replaced by a new one as some sort of failure has happened.

architecture 02
Figure 3-4. Stateless Microservice

The figure shows an example of such stateless node which itself is designed using the N-Tier architecture.

Stateful Microservices

Stateful microservices are mainly but not only the data stores which provides the persistence for other microservices and different kinds of caches. The are used by either stateless and stateful services for storing the data in an asynchronous way. The architecture diagram can actually look completely same as for stateless service the only difference is that the stateful service holds some information in the memory which are specific to that instance. In such case those information needs to be replicated across all the instances of such services or the infrastructure needs to ensure that the requests will end up on the correct instance.

Simply speaking the responses of stateful services for a command and the emitted events depends on the previous state. Because of the internal state it is more difficult to scale them. If you create multiple nodes of stateful service there is a problem that the requests will have different response depending on the state on that specific instance of that service. So the approach taken with the stateless services is not usable as the load balancer must be sending the requests to the same node. That is still solvable using for example sticky sessions but if you have a complicated structure of microservices and more of them are stateful it is very difficult to propagate the stickiness to all of them. Without the stickiness it would cause inconsistency and errors as the requests will hit different nodes at random. Another problem with state on the is the fact when the nodes goes down you loose all that data which were specific to that node. That can be some temporal results of computations or session specific data for a user forcing them the re-login.

We definitely need to run multiple instances of the same service in order to achieve resiliency. So the stateful services need additional mechanisms around them which ensures the routing of the requests or ensure the replication / partitioning of the data in order to work correctly. If those services need to exchange information at runtime they are becoming clustered services. That means a distributed application on it’s own.

A way of removing the sessions state from a service is introducing a caching mechanism such as Redis or Memcache which will store the information and these information can be acessed on the stateless nodes but it depends on the nature of the information which needs to be shared.

architecture 03
Figure 3-5. Distributed Services

The figure shows the difference between the architecture of a stateless and a stateful microservice.

One reason why to use a clustered to service is when the goal is to build highly performing service which holds its state loaded in memory and scales easily as new nodes can be added or removed on the fly. Think of persistent akka actors which can run on different nodes at the same time. The system needs to prevent running actors for the same entities on different nodes. We can only prevent that from happening if the nodes exchange information about what is their current state and they know which node should be processing which request.

The solution to this problem is in partitioning the data. The commonly used term is sharding. Sharding is a method for distributing the data across multiple machines. Given that we have a set of nodes n we are able to split the data across them in an equal manner using a hashing function and the number of active nodes in the cluster. In case of actors that can be FSMs which represents a Sales Order and its state. The state is already loaded in the memory so that provides faster processing times as all the events doesn’t need to be re-read from the database each time a command is received. By using this approach the system is able to propagate commands onto the instance which is currently responsible for the specific data. This can dynamically change as the total count of nodes can change.

Another good use case for sharding / request stickiness is when the total ordering of the events is required and you can’t allow concurrent updates of the database from multiple sources. Because that would cause data inconsistency and race conditions. Use of a sharded system provides point of serialisation of the updates for commands coming from different sources. Processing writes and updates of the values can be problematic in the case when we are setting the values of counters as current value can constantly change and processing of those updates / written concurrently can corrupt these values. With both SQL and NoSQL database you can’t simply allow updates of data coming from different sources at the same time. That way it is very easy to break the ordering and consistency guarantees caused by race conditions. Imagine that the system stores the commands of moving the items in and out of a location and the updates of the database mutates the counter value of a total amount. If the write itself contains the total value and the movements are processed in parallel it can easily happen the the total count is incosistent.

Another problem manifests itself in case you can’t use an SQL database and you need to perform mutations over multiple rows / entities in one operation. Example of this is a movement of an item from one location to another as the state of both locations needs updated atomically. Distributed transactions are really difficult to implement and they are not recognised as a good practice. But we don’t need a distributed transaction. One way how to solve is to use a stateful process which will ensure that that both operations are performed - it wouldn’t provide atomicity and the system will be eventually consistent. Even better would be if we in this case us Event Sourcing and only store the fact that the movement has happened and the state of the locations would be computed as sum of all the movements which affected that location. Think of a Persistence Query in terms Akka Persistence. That would provide eventually consistency as well but it will be much more scalable and resilient.

It is important that the commands (updates) in the system has their opposite command which reverts the state of an entity into the state before applying the original command. In that case we can implement a process manager which is a mechanisms which replaces a distributed transactions. Saga orchestrates a set of commands which needs to be applied and in case that one of them fails it is responsible for sending the the reverse commands to all interested parties in order to roll back the whole action.

Clustered services brings another set of complex problems to deal with. It is difficult to deal with deployments automatic scaling and partition tolerance of the cluster. That needs to be supported on both implementation and infrastructure level. By infrastructure level I mean service discovery solutions which are able to provide the addresses of other nodes in the cluster to the newly started nodes and manage deployments of never versions of the new nodes by draining the requests from the old services and slowly putting them on the newly deployed services. On the other hand if done right it actually provides another benefits in terms of the possibility of zero-downtime deployments.

Entity Management Services

Services responsible for a specific entity or group of related entities which provide the master data for the system. Such service can be implemented as a simple CRUD - Create, Read, Update, Delete application with some additional functionality like validation and support for different queries and filters.

An example of an EMS can be management of the product information which is stored inside the warehouse. The amount of the information about a product is only a subset of information required for displaying the product on an E-shop site. The required information are weight, packaged dimensions, short description, picture and classification of the product. Another example could be management of locations inside a warehouse and their definitions. Generally speaking they are relatively simple in terms of logic involved and their main purpose is to provide maintenance capabilities for the static data which are requested by other services or store data for auditing or reporting purposes. All different kind of systems need to maintain static data.

The data input can be provided manually through an user interface or by automated process in batches / files or in real time by other services using a programatic API. Most commonly entity management systems provide a tiny layer of validation on top of a datastore. They can provide some sort of visualisation of the managed data for example a layout of warehouse based on the locations definition.

Important factor to consider is the model of the data which needs to be stored. Is it dynamic, does it change often, what sort of queries it must support? The SQL schemas doesn’t need to be always the best fit.

A SQL database is the best choice if the service needs to support different types of queries and the datamodel is well defined and doesn’t vary or change very often. Now with many of the SQL databases having support for JSON fields it is better even when the data have dynamic nature.

It obviously depends on the amount of data which needs to be stored - if they are limited everything is fine.

If high availability is required read replicas can be added. That is going to increase the latency of the replication and because of that the possibility of incosistencies for small periods of time will increase. Depending on the requirements if the latency is not a problem then this is definitely a way to go. In the case that high read throughput is required and additional caching layer can be provided but that again increases the replication latency.

The latency is fine when the system is updating new version of thumbnail images and if old thumbnail is displayed it is ok. Counter example can be evaluation of a the access to a specific resource - like renting a movie on Amazon Video such information must available as fast as possible.

In case of huge amount of data like retaining audit logs which needs to quickly accessible or the amount of static data is bigger than a single SQL database can handle a NoSQL datastore must be used. Another benefit would be as mentioned before the schemaless-ness an usual drawback is the support of different queries.

The architecture for those services is mostly 3-tier architecture depending on what kind of APIs are required. Second option would be an event sourced version of such service if there is a business logic involved such as for a complex example like orders which have a more complicated lifecycle and multiple states. They can consists of more underlying entities such as order lines. Such services simply accept commands and updates the state of a specific entity. There could be a business logic involved in the sense that some of the commands are allowed only if the entity is in a specific state. They are usually implemented as finite state machines and they emit events which can be consumed by other services but in that case they are starting to become orchestration services as the produced events can cause another operations to happen.

The tracking system itself is an example of such service as well as there is only small amount of business logic but the amount of data and the required throughput is high.

Adapter Services

Another type of services which are often used in different types of architecture are adapter services. As their name suggest they live at the boundaries of our core system and they are responsible for abstracting out the different implementation of the APIs of external services. Such services are as well used for integration with the physical interfaces such as printers, conveyor belts, scanners and other hardware devices available within the warehouse over wifi, bluetooth using TCP/IP, WEB-SOCKETS in order to provide a bi-directional communication channel to different hardware devices.

Internally they should be just processing request to external services eventually pushing / pulling out events from the queue and calling external APIs. The fact that those services are placed in a cloud and don’t have a direct access to the warehouse local network where some of the controlling systems live brings additional challenges. There must be a way for opening a connections from within a warehouse and a way of maintaining such connections open in order to push commands into the hardware devices. Those services should be completely stateless without a need for access to any persistent storage. Eventually they could need to be associated with a data store which contains static data required for the mapping between the external and internal representation.

An example can be external logistic provider which generates it’s own identifiers in the system and those needs to be mapped to ids in our system. These ids should be directly included as part of the entity model and stored and so the adapter itself doesn’t need to store any state. It can contain some business logic how to map different types of data from a generic model to specific request. Another example can be an insurance aggregator where there are multiple providers and each have a different API.

Adapter services architecture is a good use case for the server-less / lambda functions. They just take some input and produce output - they can be 1 way or 2 way - only sending / receiving the requests to / from the external systems or both. operations. The lambda functions can contain the business validation rules and eventually an integration with a data store when there is need for use of static data for the mapping.

Tip

If you have a Java background Apache Camel is good example of a technology for integration with multiple services. The adapter services are one of the main places where to use it as in the adapter service you need to deal with diferrent types of endpoints and technologies on which your platform is not standardized on. It is supported by Akka. If your technology of choice is [Link to Come] there is similar project which provides connectors to different types of databases and queues (AWS Lambda) called [Link to Come].

Orchestrations Services

Another group of services which need to use the master data from other microservices (mainly the previously mentioned EMS) in order to perform more complex operations. These services eventually need to trigger some other actions on the the external services which are represented by the adapter services and may need to wait for a response from other services - allowing to compose small logical blogs together. These service are providing the logic and orchestrating multiple services together. Example use of an orchestration service is a complex workflow inside the warehouse:

Sample part of an inbound workflow:

  • Box of stock arrives in the warehouse

  • Purchase Order Lookup - identification of the items inside order

  • Pallet is associated with a specific trucking number inside the inventory tracking - call to inventory tracking system

  • Pallet is moved to a receive workstations

  • Operator performs quality checks and moves the stock into designated area

So those services must maintain an internal state in order to track the different stages of the performed workflow for reporting purposes and providing the functionality to stop and resume those workflows based on circumstances. Implementation-wise they need to process different commands, validate them against current state and update the state of the process. So they are logically implemented as FSMs with a state for each stage in the workflow. They are in fact a specialised versions of the DDD’s Process Managers. Unlike the classic database transaction Sagas the point of those workflows is not to provide a reverse operations in order to safely rollback a set of operations which are required to be successfully processed together. Their main goal is to instruct the operators / robots what to do and track the current progress of the operations. This kind of workflow can be even fully implemented as a client application which only orchestrates the different API calls. By providing a powerful API such workflows can be built easily and which makes the system very flexible for experiments and optimalisation as different steps can be changed and moved around easily.

If we go to the extreme and say that the operators inside the warehouse act as services which are part of the platform where the interactions with them are done by the GUI then there is not much difference from the the process managers which orchestrates processes such as evaluation of an order placement. Order placement consists of a set of operations which needs to be performed - check that the stock is available, place the order, decrease the available stock, reserve a logistic delivery and pass the order forward in the fulfilment pipeline. Any of those operations can logically fail - there is not enough stock, there is a delivery slot missing because one of the delivery vans broke down or there is busy period and the warehouse operators do not cope with the load so it wouldn’t be possible to deliver in time. In this case the process doesn’t need to be always automatically rollbacked by canceling the customer order and refunding the money.

Even though that full automation is possible when handled carefully order can stay in accepted state waiting to be fulfilled - new stock needs to be procured or the there is delivery coming next day - and the order needs to be converted into backorder. In that case events must be emitted in order to inform the customer / sales team that the order is going to be delayed which can eventually by the cancellation of the order from the customer. Actually this a good life example of how the services should communicate with each other. The services should be honest and communicate the errors / delays and availability to their callers within the architecture so the requesting service can deal with it.

Another important functionality in such services would be timers which can automatically trigger events or cause the FSM to wait in a state only for a specific amount of time. Example can be a reservation of a seat in a plane which you can do before you pay. Then it is reserved for given period of time so the checkout process can be finished.

These services fall into the stateful category because of the need to retain the state. It is possible to implement them on top of SQL database using transactions but it gets more complicated when a NoSQL database is used an transactions are not available. Lets leave out the usual discussion about the required throughput, latency and amount of data. Here more than ever is important to process the events in the order as they are happening - because some of the events are not valid anymore once the state of the process changes. So if the events will be processed out of order it can happen that neither of them is deemed as valid so they will be rejected - effectively loosing the information about the fact that something has happened.

For a warehouse workflow where there is one instance per a specific workflow done by one operator it is simple to preserve the order of the events as the probability of introducing race conditions is pretty low given the fact that the operations must physically happen.

These services can have an overlap with the previously define entity services in terms of functionality provided to the system. As maintenance of some of the entities requires the state. The order itself is a good example of that. It doesn’t mean that all the state related to one unique order needs to be maintained by one system. Microservices can have their own representation which better fits their needs.

An generic implementation of such service is provided by the AWS. The name of the service Simple Workflow Service where the workflows are organised as sets of different tasks and can be performed by either programs or people - https://aws.amazon.com/swf/

3.3 User Interfaces

The WMS requires different types of user interfaces for the various tools which are required for the processes which needs to be performed by the personnel. They must be considered as equally important part of the whole system as the APIs and backend services. The architecture is mostly concerned about the backend services and the frontend part gets overlooked. Lets at least summarise some of the important things about the user interfaces and user related operation in this short chapter.

The main goal is to create a working system which supports processes done by people and from experience we as software engineers and technology lovers often forget that. Depending on the technology choices we can be even forced to create dedicate services whose sole purpose will be to only serve user interfaces. The different types of the UIs and services involved would require a mechanism for central authentication which will result in adding more microservices. These new services will simplify securing the access to these interfaces and will affect the whole architecture.

There are multiple types of UIs which are required across the WMS so lets list some of them:

  • Dashboards (Throughput, Monitoring Workforce)

  • Inventory Management

  • Location Management

  • Order Management

  • Product Management

  • Reports - Inventory Checks

  • Workflows performed at workstations using computers

  • Workflows performed across the warehouse on the go using handheld devices

We can split them into 2 main categories. Workflows for the operators which the standard workforce will have access to and the back office tools which will be available to the managers and trained specialised teams who will be dealing with resolving eventual discrepancies in the stock, reassignment of tasks, checking damaged stock and other actions.

In a classic monolithic application the user interface is implemented as one of the layers in the multi tear architecture and separated into a dedicated subproject or a repository. Another option is to provide an UI part directly as part of a microservice in order to expose its functionality and make it accessible with a web UI. So it can be tested easily and provide a nice way for configuration and maintenance of the service. Web applications can be microservices on its own as they can be deployed independently and facilitate the APIs provided by the other services. In case a microservice is maintaining a specific entity and set of operations on top of that entity it can be desirable to provide a lightweight UI which will provide the possibility to manually update and create the entity data for the management purposes. The benefit of doing so is that the service will become more usable and testable by less technical members of the team such as project managers. It will as well provide additional possibilities for the operations teams so the can deal with misconfigurations of some static data without understanding the chosen datastore technology and changing the data there. The web pages can be directly served alongside with the services API itself.

On the other hand this give us less flexibility as the UI is tied together with the whole deployable process and small UI changes requires complete build, packaging and deployment of the new version of the service. That can be ok in some cases but problematic in another. So the rule of thumb would be to always split completely the user interface application from a microservice which communicates through APIs. Such architecture allows to deploy the UI changes independently. But you can gain some short therm benefits in terms of improved velocity of the development process if the UI is coupled together with the backend service. In any case I would recommend that the UIs should be developed by the team which is responsible for the service. Because the team has complete understanding of the service so it can implement the UI correctly. If the microservice requires some sort of management / back office UI and the design requirements are low it is good to leave it to the same team.

Lets put the security of the APIs aside for a while and lets assume that the API Gateway secures the access to the systems public API. The services within the platform can communicate as they will live within a VPC and they are not accessible directly form outside world.

The applications needs to support a range of devices such as workstation computers, handheld devices, mobile devices and tablets. As we are implementing a back office applications which would not be used by the customers it makes sense to simplify it as much as possible. So the design and UX of the tools which will be used by the operators is not the priority. One of the good things is that the types of the devices and their OS and versions are limited and standardised as it is enforced by the IT department. It makes the maintenance of the workstations computers and all the handheld devices easier as well. However, that doesn’t mean that substantial development time doesn’t need to be spent on those applications. The important thing which we need to keep in mind is that we are creating tools which must support the operations required. If you go for an off-shelf solution the software itself will have the processes already defined and the software will be dictating what the operators should do. The correct approach is that the software should support what the operators need to do.

Web Applications

When developing web applications as a part of a system built on top of the reactive manifesto principles those services need to responsive. Otherwise we would wouldn’t use the benefits which comes mainly from the responsiveness and the asynchronicity of the services. That means the web application UIs need to leverage javascript and asynchronous calls in order to provide best user experience. It can be done if we use either server side rendering or if we go for single web page applications or components oriented web applications. Ideally you want to employ a mechanism for asynchronous updates flowing directly into the web applications. Our system is already built as message-driven and those messages and emitted events can be directly transformed into notifications and pushed to the clients. We have multiple options how to achieve that. All the following approaches starts with the client requesting a web page using HTTP and loading the javascript which is contained on that page.

AJAX Polling / Long Polling

  • The javascript on the web page is loaded and it asynchronously requests the data from the server in a periodical manner.

  • Servers responds to the ajax requests immediately and the javascript updates the page

  • Long polling would keep the request opened until it times out or the server responds with updated data. Immediately after a request is finished the javascript will created a new requests which will be waiting for the updates.

Server Sent Events SSE

  • the javascript on the web opens an event stream connection

  • the server sends the events to the client where there is new data available

WebSocket

  • the javascript on the webpage will open full-duplex(two-way) TCP connection

  • the server can send the data to the client using that connection as well as the client can send data to the server

Any of those solutions would do. But the best one is the websockets as it provides a real-time full-duplex communication so we can even use that connection for pushing the informations to the connected microservice.

Single Page Applications

Single web applications have different approach then the good old web applications. Usually one classic web page is served from a web server to the client. From that comes the name SPA. That single page contains javascript libraries and other required assets which gets loaded to the browser engine. Then everything is dealt with by the browser. All the displayed pages are rendered directly inside the client. The browser manipulates the DOM and the URL dynamically so the behaviour is same as using a classic web application. This gives a performance boost to the whole User Interface as everything happens locally and the client doesn’t need to go back and forth to the server to retrieve the static assets. It still needs to request the business data from the server and incorporate them into the rendered pages. But that can be done asynchronously with giving immediate feedback to the user about what is happening and updating the page as multiple responses are coming back from the server.

There are many different javascript frameworks which allows the development of SPAs. Lets mention the most popular ones angularjs, react with redux, googles polymer and ember.js. The mechanisms for updating the data on the pages are still request / response based so the techniques described before must be used with SPAs as well. Another benefit of SPAs is that they are completely isolated and some of the business logic and state be completely contained within that application. The development of such application can be completely isolated from the APIs and and the deployment and releases can be independent as well. The application can be just simply served from a lightweight web server such as N-GINX and make calls through API Gateway. That gives us great flexibility as we can build completely different applications which are using the same API. The business logic which is contained in the SPA depends on the APIs. It can use pure DATA APIs for the management services - for that case the server part can be completely stateless and the applications purpose is to only view and update the data. It is different for the operator workflows which are composed from a set of actions. Then we can implement a finite state machine either on server side or inside the SPA.

If we want to go to the extreme there is even the possibility to use the internal caches and the web storage https://www.w3.org/TR/webstorage/. That will allow us to run the SPA in an offline mode. That’s obviously not good enough for the management type of ui’s but can be an useful approach for the workflow related UIs. If the internet connection goes down temporarily it can allow the user to be still able to perform the workflow and then push the data once it is done. It requires the the workflow logic to be implemented as part of the SPA but that gives a great flexibility. The operators workflows are in many cases really simple so implementation of the workflow logic inside a single web application with only some endpoints for pushing the results of the workflow could actually make sense and make the system even more resilient. Give the fact that the applications can have open a Websocket communication to be updated if there is a specific change That needs

architecture 06
Figure 3-6. Split into multiple services

Native Applications

The use of native applications comes into play when the web browsers are not able to access some specific hardware. The access to the host computer is limited because of the obvious security reasons. The use of native applications is still quite common on the handheld devices but can be considered for the workstations as well. One example, when the implementation of a native application is required, is described in the next chapter. The handheld needs to connect to bluetooth printer and print labels. That just can’t be done using the web application. There are different handheld devices on the market but in my opinion the Operating System of choice is Android. Android SDK provides the necessary APIs to do any required integration and it allows to open sockets for a direct communication. Lets look at an example of using a bluetooth socket to send a printer command in proprietary format:

Example 3-1. Example Printing From Handheld
public Boolean printLpn(Activity activity, String lpn, Boolean qc, BluetoothSocket btSocket) {
    try {
        Log.d(TAG,String.format("[printLpn] lpn: %s; qc: %s", lpn, String.valueOf(qc)));
        OutputStream outStream = btSocket.getOutputStream();
        String print = "33" + printerCommand(lpn, qc) + "



";
        byte[] command = print.getBytes();
        try {
            outStream.write(command);
        } catch (IOException exception) {
        ...  }}}

The code which maintains the the socket is part of the android Activity class. This just describes the simplicity of communication through sockets and using the Java IO. It is actually more difficult to properly identify the protocol and the correct command.

The reactive principles holds for the native applications as well. It is great when you have restful APIs which can handle huge loads of requests but if you have a poorly written native applications which can’t communicate asynchronously and receive push notifications the usability of the system will suffer greatly.

The need of native applications is advocating heavily for splitting the frontends into independent units treated as microservices or units of deployment of its own. They are still tied together with the available APIs but they are similar to SPAs. If there are restful APIs available for all the commands and queries the user interfaces can be developed completely independently. One of the benefits of native applications is that they can actually hold all the state and workflow logic and only utilise the endpoints provided by multiple services and using the local storage. That way they can work very differently compared to web applications or SPAs as the state in the handheld is persistent.

One of the main problems with native applications is they add additional maintenance costs. The distribution of newer versions of the code is required and a whole lot of infrastructure must be setup. For example for the Android ecosystem the APK must be published onto artifactory so they can be downloaded from there because publishing the applications on the Google Play is not desired as in this case the applications are only for local use only. There is a possibility of hosting private Google Play internal only for the company but that depends on how many applications are being used and developed. This bring additional work the IT teams who are responsible for updates of the handheld devices and installation of new versions - that is where the mobile web applications have their main benefit over the native ones.

Security And Users

The access to the application and its features needs to be restricted. In a monolith it is simpler to implement authentication and authorisation as it is centralized. The components are coupled in one application. User management is usually part of the application itself and the security layer is build on top of this component.

Unfortunately there are multiple services in our architecture which will need to be accessed by the users directly or to communicate with each other. For the case of user accessing the web applications a valid approach is to rely on an API Gateway. It should use another service which will provide an authentication mechanism and a service which provide authorisation. The Gateway services should be able to evaluate that a request is authorised and only after that proxy it to its destination. The APIs of specific applications just accept any of the requests coming and they do not deal with security at all. User specific information can be extracted inside the API Gateway and provided as additional parameters or headers in each requests.

One of the approaches in case of the HTTP protocol is that an authorisation header must be present for the secured endpoints. The authentication mechanism must be backed by credentials - stored in a database and then evaluated. There are possibilities how to deal with the authorization such as OAuth and SAML. If we split the UI into multiple independent parts it will present additional challenge as the users want to access the different UIs without the need of login multiple times. So a single sign on functionality is required. The previously mentioned protocols deals with single sing on as well but it is not the only way of implementing them.

A specific service for the maintenance of the user service accounts is required but usually those services already exists. One example could be Atlassian Crowd which provides connectors for management of users in multiple directories such as Active Directory and LDAP. With such solutions again depends on the scaling capabilities and how much users and requests it can handle.

Session Cookies

A session cookie based authentication is a possible solution as well. But that affects how the API Gateway is evaluates the authorization or if it defers it to another services. After user logins a session id is generated and information are associated and store on the server side. The security can be still evaluated on the API Gateway with an access to the user session store. Which is good use case for Redis or Memcached as they are key / value based stores. There are multiple problems with such approach as it represents multiple roundtrips for each requests going through the gateway because the session store must be completely decoupled from the other services. For an application which only provides Rest API it adds state which needs to be maintained and that is something we want avoid if possible. Coupling the session information with each respective UI service would complicate the scaling as the the request would need to stick to the instances which contains that specific state. The common approach is to orchestrate the calls to multiple services so the user information needs to be propagated. Using the cookies is not completely seamless as it requires Loosing a session would result in forcing the user re-login which is relatively harmless as it is scenario to easy recover from. The cookies needs to be encrypted by using a shared secret across multiple services.

Json Web Tokens

Json web token (JWT) is a mechanism which provides a way of sharing data in the form of Json documents between multiple parties based on digital signatures. The can be signed by a secret using HMAC algorithm or by RSA. After the user logins a JWT is returned back and must be stored in the browser to be used in the future request as part of the Authorization header. Important thing to keep in mind that the size of the token and data stored in it will be always included in the request so if the token data are too big it can affect the latency of the requests. For increased security the tokens can be encrypted as well and they should not contain any sensitive information. The main benefit is that those tokens are self contained and can contain information about the subject and it’s roles so the service does not need to make any external calls to the auhorization service or session service to get the information about what the user have access to. The gateway or target service can immediately deny the access if the token is too old or does not provide required permission for performing a specific action. So this approach does not have any problems with scaling or propagation of the the information to the downstream services.

The JWT has a built in expiration mechanism. After the the token expires a new one needs to be issued. That means the users needs re-login. Depending on the use case on the on a web or in native application it is possible to issue a token which never expire - user usually logins once on a native application but in our case the devices are being share between the users so we don not need to think about it. In common scenarios the token should be refreshed before it expires automatically if it is used. It doesn’t not make sense to do it on each request as that will increase overall latency. The refresh operation takes an existing valid token and updates the expiration date.

As the JWT can be stored inside the Cookies both approaches are vulnerable to the XSRF (Cross-Site Request Forgery) where the token can be used to replaying the requests or applying requests with changed payload which are triggered without the knowledge of the user. Similarly if the web applications allows inputting javascript code through a form without and malicious code can be injected. This is called XSS (Cros-Site Scripting). This code will have access to the cookies so the attacker will be able to execute any requests. Those vulnerabilities are important to consider as they are commonly overlooked.

3.4 Hardware Devices

Inside a fulfilment centre there are different types of devices and the system must be able to integrate with them. You can’t expect that they will provide REST endpoints as most of those devices has been built in the previous century and as they work reliably their vendors have no reason to update their APIs. You will be lucky if you can integrate using SOAP based web services and you can expect file based contracts using CSV files. Other options are proprietary protocols build on top of TCP or UDP so the communication goes directly over the network sockets.

Another problem is that those devices are connected to the local network inside the warehouse which makes this even more problematic. The micro services are running in cloud and they are accessible through http but those hardware devices need 2 way communication. No one wants to have some sort of polling mechanism that each device will be calling your services in order to find out if there is something to do for them. Another requirement is that those devices are uniquely identified so the system must be aware of the logical location of such devices. That will allow to route the commands accordingly. So we want to achieve push-based communication.

All the services are currently running in cloud and the handheld devices are communicating through http using REST or directly serving web pages as described in the previous chapter. But for them the communication is mostly synchronous request/response based as the operators simply propagating their actions into the system. If there is change in the service with they are communicating they will get the update the next time they get an response. This can be problematic in cases where for example a picker is picking some items and meanwhile the customer decided to cancel the order or specific item. There is a race condition in the system if the picker will pick the item or it gets cancelled before he gets to it. We are kind of fine if the picker gets to the location and tries to pick that already cancelled item and the system knows about it. Then the action will fail and the system will the picker to put it back. So we are good. But what about cases where there is reasonable distance between the location for picking and the system already knows that the item has been cancelled but the picker still goes to that location. The client application doesn’t know until it sends another request. So in this case it will be more useful to push that information to the the client. That sounds much more reactive doesn’t it.

That is exactly a job for web sockets. The client is opening the connection using a http request and does not need to expose itself publicly. The same mechanism can be used generally for communication with any other client which we want to push the information. But still it would be nice to communicate with the y

Warning

Security Aspect all this is fine as long as the communication goes through secured API Gateway and it is ideally inside a VPN The security experts are now feel surely really unhappy about this. It is a good practice that database and other infrastructure services are not directly accessible outside the VPC or the private network of the cloud.

Another problem comes if you have other systems placed inside the warehouse with are responsible for the management of some hardware. They are operating directly on the local network and you really do not want to make them accessible from outside of the warehouse.

That is bit problematic especially when you are running resilient services where there is more then 2 instances running at the same and it is difficult to provide a static ip for a specific service. For that it makes sense to run part of your system inside locally within your warehouse. So that leads us towards the creation of a hybrid cloud solution with some services running locally inside the warehouse itself. That makes complete sense as you would like to isolate your warehouse subsystem into the CFC so you can benefit from lower latency and being able to have the warehouse still operating even if the internet connection is down or if there is a some problem on the with the cloud part of the services. Even if there are not any sales order coming there is still a lot of operations which can be performed such as transfers between warehouses and inbound of inventor. Given that the system is event-driven and a queuing mechanism is used the events are published onto Kafka. Those events are going to be propagated upstream once the internet connection is established or the problems with failing services are resolved.

Barcodes & Printers

Warehouses are full of inventory and locations. I believe that almost everyone has been in IKEA so you can imagine how is it look like in a warehouse to some extent. The barcodes are still the main and the cheapest identifier of the products. It is then logical decision to use the barcodes for identifying the locations within a warehouse as well. It avoids manual input of the operators which is very error prone. One of the main goals of the warehouse system is to track the location of inventory precisely. There is plenty of different standards of barcodes but from my latest experience the devices on the market don’t have any problems with support.

The barcodes must be printed out on regular basis inside a fulfilment centre. For this a variety of the printers is used across the warehouse and don’t forget that our services is still purely cloud-based.

For example when the inventory is delivered into the warehouse and it is passed on pallets from the delivery trucks. In this case an operator needs to recognise which order is coming put a label on that pallet and clear the dock quickly so other stock can be unloaded. For such process the use big printers connected to a pc or directly to network it’s not very practical and would slow the process down.

That’s when the portable bluetooth printers are coming in. You can try to push the printer command to the device which has connected the bluetooth printer but it is simpler in this case contain the printer integration solely inside the native application of the app. That’s is actually one of the strongest arguments for using the native applications in handheld devices because they will allow you a broader API for hardware integrations. Meanwhile the state of the process is still kept in the cloud so it can be resumed if interrupted.

Another case of printing is when the stock is leaving the warehouse to be delivered as Sales Order to a customer or as a transfer to a different warehouse. Airway Labels and additional tracking identifiers must be printed out and sticked on the package. In some cases you need to print additional labels when you want to move some stock in a bulk.

Generally the way to go is to open socket to the printer and send a printer command using the correct protocol. As is surely expected printers from various vendors are using different standards so you would need to deal with that as well if you don’t go for an off shell solution which will give you all of this out of the box. So you would have nice APIs to integrated. The design of the labels it is a tedious job but there are tools which can help you with that.

So now we have good understanding why and how we need to use printers across. So we need to integrate it into our micro services architecture. One of the problem is that there are different printers as mentioned before. We can rule out the bluetooth ones as we have said that they are the responsibility of the handheld native application. Printing over the network is easy we have plenty of existing libraries which can help us with that. Opening an network socket connection to the printer and send the command thats what we need. There can be use cases where a printer is shared between multiple workplaces. For that we would need to implement a queue in order to sequentialize the processing but it depends. Many of the printers provides this functionality already. Luckily most of the workflows in the warehouse works as pipelines - one step at a time so the operator can wait until the label is printed.

One of the things which is a bit unfortunate is the fact that not all the printers directly support network access. That would make the printing so much easier we would only need a service which will be responsible for forwarding the printer command to the correct printers based on some additional identifier assigned to the workstation or printer.

So now we have couple of options how to deal with this on the architectural level. Bear in mind that we have already split the workflow processes to different parts and we would need to trigger the printing from any service if required. So it would make sense to create a specific microservice which deals with all those problems and provides a translation of printer to the specific protocols. Another question is where are we going to place this microservice. Should it live in cloud or can we leverage the addressing the network printers directly by putting additional service inside the warehouse. This brings back the idea of hybrid cloud. One of the benefits would be the latency and isolation of this subsystem from the cloud.

It’s a bit weird to send the printing command into a server application which will then send a printer command if you are using the workstation computer to which the printer is connected. But again that would require native application integration in order to print the labels properly.

printers 01
Figure 3-7. Hybrid Cloud Printing
1

Local printer service is installed in the local datacenter inside the warehouse. It opens web socket connection with the printer service in cloud and sends keep alive request to keep the connection open and to check the that the service is availbale. Printer service the uses this connection to forward the printer commands.

2

Workstation uses the existing workflow to propagate the actions which are happening inside the workstation. Once it gets to the part where a printing is needed. The workstations sends a requests to the cloud and service which orchestrates the workflow will forward the command the to the printer service - with the identifier of the workstation. It maintains the active set of the workstations printers and responds with an error back to the workflows if the printer is not connected.

3

The local printer service contains the routing table between the workstation identifiers and the ip addresses on the local network os it can correctly forward the the commands to the appropriate station and can check that the printers are connected so it can eventually report an error to the printer service in cloud.

4

Printers which support direct network printing will directly receive the commands directly from there local printer service - communication goes over the.

5

Some of the network printers are directly connected but doesn’t provide ethernet port and doesn’t support the direct network printing. For such cases additional software needs to be installed on the workstation machine. That’s further described in slightly different approach.

printers 02
Figure 3-8. Cloud Printing with clients
1

Each printer is connected to a workstation and client software is installed in order to directly communicate with the printer service in cloud. The client receives the commands from the service and maintains the 2 way communication. It registers itself to the Printer Service so the service know that tha specific workstation is active.

2

The client as well checks if the printer is active and forward the printer commands on the printer. It is run as a deamon on the start of the workstation as you want to simplify the process for the operators as much as possible. The cody only acts as a proxy for forwarding the commands onto the printer and checking its state. It needs to support different kind of commands as they are specific for the printers used.

3

The operator performs an action which is usually a submit of a form when he confirms that he has done a required action. The submit then gets processed by the corresponding workflwow service. It updates its states and a printing request is generated and the printer service is contacted.

4

The workflow waits until the Printer Service acknowledges that i receive the command. It can result in an immediate error if the workstation is not registerd inside the printer service. In such case that error gets immediately propagated back to the operator.

Warning

To open and maintain connections with the services inside cloud using clients inside your local network can be tricky. In the infrastructure there are multiple points through which the connection needs to go - api gateway, load balancers, routers and services themselves and they can have set up different timeouts. It can happen that the service goes down and the client wouldn’t find out that the connection has been closed. This needs to be handled and tested carefully. It is necessary to implement automatic reconnection mechanism on the clients and send keep alive messages. If the keep alive messages are not received for a given timeout the client needs to close the connection and try reconnect. The server needs to maintain the amount of the opened connections to prevent memory leaks to happen.

Conveying Systems

Lets start with a quick explanation what a conveying system is. The smallest and most common one which you can imagine is the one which conveys your groceries just right before the cashier in super markets. Another good example is the belt on the airports which brings you your luggage. The use case in a warehouse is quite similar. It helps to move items quickly across and lets the operators focus on more complex tasks.

In this case as warehouse management system architect you can expect that you will be provided with a system which will deal with the actual control of the hardware parts itself. But you would need to be prepared for an uneasy integration. It must work reliably as in this case there is immediate impact inside the warehouse and the totes can go to wrong destination or freeze the whole flow in case of error. The layout of the conveying system is installation specific. What is very important is that id doesn’t need to be continuos. That means that that all destinations are not reachable from all the stations across the conveyor as the conveyor belt can only go forward.

In the specific integration I have been involved in part the conveying control system needs to know about the destination of the tote. where to tote should go before you put it on the the conveyor belt. There are cameras on the conveyor which reads the bar code on the totes so they can lookup if there is an existing order for that specific totes. If you put on the conveyor a tote without order it gets stopped as it doesn’t know where it should go. Another error case is when the station is not reachable from that station, in that case the tote stops or the tote gets delivered onto error station. All this causes delays and delays costs money as you can eventually miss some deadlines for the delivery of some orders and upset your customers.

The communication must work both ways as the conveyor accepts orders and then sends back information about It is possible to dynamically add stations to existing orders.

In this case the integration choice was two-way TCP connection with a defined protocol to be implemented. The packets contained sequence numbers and were confirmed by response from the server. That is very important as it really affects how it can be integrated. It was not possible to parallelize the communication. That have implications for running the integration as we can have only one instance which communicates directly with the conveyor control system. Another problem is that the there is need for keeping state within the service to keep the continuos sequence of the packets.

Ideal use case for this is a queue. It will help with the sequentialization and give us at least once delivery so we will know can be sure that all the commands are propagated to the control system as we will be able to decide that after receiving the acknowledgments. We would face the similar problems as for the printers integrations where to place the services. In this specific case there was additional requirement to provide a static ip for the conveyor controller system to connect to which is a bit problematic when you have load balanced services inside the cloud.

conveyor 01
Figure 3-9. Simple Conveyor Integration
1

Operator on the workstation is performing a workflow which requires putting a tote onto conveyor belt. Request/Response over http updating the state of the workflow. Once is the action confirmed operator knows that the tote can be placed on the conveyor belt.

2

The workflow automatically creates a request on the Dashboard Service and it is the responsibility of the dashboard service to accept the command and respond to the workflow that the command has been registered. the creation of and order inside the conveyor controller. This communication can be done over http as JSON over over http or as a message on a specific topic.

3

Dashboard provides the opened web sockets to which the Conveyor Service inside the warehouse will connect to and will maintain an open websocket connection in the same way as the printers do in the previous part. The only difference here is that the Conveyor Service will send more intormation to the Dashboard service in order to confirm the active orders and updated the existing orders with the informations about delivery or errors for a specific order.

4

There is 2 way communication between the conveyor service and the conveyor controller application. Conveyor service is on a specific ip in the local network and opens a port for TCP communication where the Conveyor Controller System can connect to. In the same way the conveyer controller has an open port where the conveyor service can connect to. This connection is kept open and both services are active trying to reconnect once the connection is lost. The protocol provides a keep alive packets which affects the sequence number to keep the connections alive at the times when there is no communication.

The improved implementation would be to use queues instead of the web sockets if we are able to connect directly to Kafka from within the warehouse LAN.

conveyor 02
Figure 3-10. Conveying Using Queues
1

The workflows is finished and the operator puts the tote onto the

2

The workflow propagates emits an event on a topic which the conveyor dashboard is subscribed to. The dashboard part of the service translates the event into a specific command for the conveyor service which then can be sent directly to the conveyor service.

3

The commands mainly for placing orders going into a queue which is accessible by the conveyor service. That replaces the websockets communication from the previous example with queues which gives us more resiliency as the technology itself provides at least once delivery guaranties and the client could be reprocessing the same message multiple times until it is sucessfully confirmed.

4

Two way communication between the conveyor service and the conveyor controller systems exchanging

5

Information about successful deliveries of the totes to the destinations and confirmation of the processed comands goes back to the dashboard in order to provide the current state of the conveyor system.

6

There is accessible dashboard for the management of the conveyor belt system. It displays the active orders and provides the possiblity to abort them. It displays the topology of the conveyor system and provides the possibility to open or close the available stations in the system.

Cubing and Dimensioning Systems

That last interesting device which you would need to integrate with when implementing a warehouse management system is cubi scan. Cubing and Dimensioning System is used used to take the cubic dimensions and weight of the packaged items coming in. Such information are then used for classification of the producer to quickly decide where they can be stored and how they will be shipped to the customer. These information as well used for optimization of the utilization of warehouse and setting limits of what can fit where in terms of dimensions and weight. Unlike the previous two this device is responsible for only pushing the data into the system. We would need to provide a reliable interface to push those data into. Again we need to integrate devices which are connected to the local network inside the warehouse.

Cubing and Dimensioning System possibilities for integration:

  • TCP connection over LAN - directly from the hardware itself

  • CSV file export on a host machine - through software client on a workstation

  • CSV file export on a FTP server - through software client on a workstation

The CDS itself can be operated without any workstation but then the input of user defined data is limited. If you use application on the workstation the operators can provide additional fields which can be helpful in many cases.

So we have multiple options how to solve this problem. Again as previously we can create a clients which will be installed on the local machine. What comes into mind is a reader of the CSV file for the records which then can push the information into the system. So we need to transform the information inside the CSV into another format and we can use JSON over HTTP or we can commit the JSON onto a queue. We need to keep in mind that there would be more of such devices. So we can use a FTP server where all those workstation will upload the CSV files. That would be a single point of failure for all the different devices so we would like to avoid that.

That would leave us with processing the files directly on the workstation machines. The main disadvantage of that is that we have additional executable which needs to be distributed and installed. Fortunately the set of fields is predefined and the transformation from CSV into another format is very easy to write.

cubiscan 01
Figure 3-11. CDS integration
1

The CDS is connected to the workstation host computer with an USB cable. The CDS specific software is installed on the workstation machine. The main system is completely isolated from the CDS hardware as it uses it’s own proprietary protocol. Then the data are periodically exported into a CSV file.

2

Additional client software which is part of the warehouse system must be installed on the machine in order to read the CSV file and establish a connection with the Product Service API. The client application is very simple it can take only couple of paramaters such as where read the csv file from and what is the URL address of the microservice if we are using REST endpoint.

3

The CDS application periodically exports it’s data into the file and the client periodically reads the exported CSV file and then pushes the information into the Product Service. In this case the CDS is the source of truth of the data. That means that the updates can processed as upsert inside the service and we don’t need to care if we process the same row from the csv mutliple times. Th client application will just send and HTTP post with the payload which .

As said before the disadvantage of a custom client on the workstation machine is that someone needs to maintain that software and ensure that it is installed properly. On the other hand fulfilment centers have their own IT Department so such thing shouldn’t be a problem. The frequency of updates needs to be balanced between the amount of requests which the product service can handle and the latency which is required. Inside the process in the warehouse those data needs to be processes fast as the cubi scan action is actually blocking receiving of the items into the warehouse. So it needs to each couple of minutes. Lets that one of the advantages is that the data are backed up inside each machine but the unfortunately if the whole product service get lost the recovery of the data would mean that the CSV needs to be aggreagted from all workstation machines but that is a pretty unlikely even.

Tip

In order to implement the integration with the hardware devices existing libraries can be used. If you are coming from JVM based background I would recommend the use of Apache Camel. It is a great middleware which provides a lot of integrations for XML, SOAP, JSON over HTTP, REST, CSV and connectors to various databases and messaging solutions and even for AKKA. It has out of the box support for bulk heading, retries and other common patterns for achieving reliability and resiliency. The use for the adapter services gives a great flexibility and can speed up the development rapidly.

architecture 07
Figure 3-12. Added integrations with multiple services

Our diagram now contains the additional services which deals with the integration of the hardware devices into our architecture. They are in general listening to the events on their topics and relay the commands onto hardware devices and report failures.

3.5 External Interfaces

There are many external services which are in warehouse system needs to communicate with. The most obvious ones are the finance, procurement, logistics and the consumer platform which will be submitting orders to be fulfiled. These integrations in most cases must work in both ways. A simple example is when the fulfilment center passes the package. There is a call which creates a logistics order in the logistics systems. Then as the delivery of the package to the destination progresses the logistics systems needs to update the warehouse system about the current state of the package.

The external system are completely out of our control and they have their own standards of communication. We can argue that it is the same as integration with any other microservice in our system but that’s not the case. Within our system there is a pre-agreed protocol and common communication patterns are established - JSON over HTTP and publish / subscribe topics with JSON / Protobuf / Avro messages or eventualy different protocols.

When you need to integrate with and existing system it can offer you different kinds of methods integration:

  • CSV Files exchanges

  • XML SOAP

  • Restful JSON

  • JSON over HTTP

  • TCP connection

  • Queues / Database - less often

For that we need a set of microservices which can deal with those integrations. Their sole purpose is to translate from internal platform protocol into the external protocol and vice and versa. Ideally such services are completely stateless without any need of accessing data. Unfortunately that’s not always the case as there is mapping required between the external and internal identifiers. One way to solve this is to include those identifiers as part of the entities which are part those transactions and they are contained in the events which are triggering the calls into the external services. That is ideal in a way that in an ideal world microservice should have all the information it needs to perform its duties without making any external calls to another microservice. But there are obviously exceptions to this.

There is a set of important patters which help with propagation of the failure and dealing with the external integrations.

Bulkheading

This is well known pattern which purpose is to isolate the parts of an application into a failure zones. So if there is failure in one of the zones it doesn’t propagate to the other parts. That makes the application more resilient. This can be applied on the whole system as well as on single microservices.

This can be achieved on multiple levels depending on the technology used. It can is really simple to use Akka as an example.

In Akka the bulkheading is achieved in 2 ways. The failure zones area defined as the actors form hierarchies and those hierarchies can deal with propagation of the failure by the use of supervision strategies. That gives us a powerful way how the define how the errors should be propagated betwen the actors.

Supervision strategy in Akka is defined as a mapping from different types of exceptions to a specific directive. Directive specifies what the parent will do if an exception is thrown by its child actor:

  • Resume - resumes the actors operation by ignoring the exception and skipping the faulty message

  • Restart - the child actor is restarted and looses its state

  • Stop - the child actor is terminated

  • Escalate - The exception is propagated to the parent of the actor which received the exception

The most important part for bulkheading in akka are dispatchers. Dispatcher in akka handles the assignments of threads to the actors so they can perform the assigned tasks. Dispatcher is backend by a dedicated thread pool and those thread pool. Set of actors can be configured to use different dispatchers which the gives us isolated zones for computation which would not affect the performance of other computations as they would not share the same threads.

One of another strategies provided by Akka is BackOffSupervisor pattern which restarts and actor with a prolonged intervals giving the time to an external dependency of that actor being available.

You can even think of the whole microservices as a bulkheading zones as ideally the failure of one microservice shouldn’t affect functionality of another microservice. Not saying that the microservice can be able to fully process all the request but in this case it means that it would respond with information that is temporarily unavailable to fulfil it’s duties waiting for the failing service to come up.

Tip

There are multiple type of dispatchers in akka which is worthy to investigate. Akka itself has provides support for dealing with lot of problems described in this chapters and describes solution in a greater depth. So I recommend reading the documentation on akka.io

Graceful Degradation / Fallback

Any system is only as resilient as the parts from which is built. If there is a simple failure the system can stop working as a whole. That is exactly what a resilient system should prevent from happening. If a simple device stops working it is not a big problem. But in cases where the systems or machines are more critical such as nuclear plants or aircrafts their require an ability to maintain a limited functionality even when a part of the system is not working. The main purpose it is to prevent the cases when a whole system stops working because of a small failure. That is a general pattern which needs to be considered when building microservices.

Lets imagine a scenario when a customer of a video streaming platform wants to watch a paid live event such as the finale of the Soccer Champions League. There are multiple services responsible for different parts of the user experience and one of them is a service responsible for evaluating access to the content. So ordering and processing the payment for the event will go through well but because of failure in the datacenter the database is not accessible and it will take a while until this part of the system will be fully functional again. In that case we do not want to block the customer from watching as he is already paid for the content. For that the system can run in a degraded state and give temporary access to the user to everything for a short period of time.

When there are dependencies between services and they call each other to get some specific data which they need for it’s operation there can be set of default values which are used when the service which contains the correct mapping is not available. That provides a fallback mechanism when a dependency is not available. This is not always possible to achieve but all the services should be prepared what to do when their dependency is down. Another possibility is providing a secondary datastore which can be used as buffer before the main storage comes up. Even when a service is under high load and it is not able to process all the request - the fact that the process does not crash and a percentage of the request is processed successfully is a way of graceful degradation.

Throttling

In cases that we know that we are calling a non reactive service which is not able to process correctly a high amount of concurrent requests at a given time we need to limit the amount of the requests. This can be done dynamically or statically.

This can be achieved multiple ways. We can dedicate a limited amount of threads which will be making those external calls. That way we will get a limit for concurrent requests. But even by using a limited amount of threads the rate of making the request calls can be too high as the external system does not need to cope.

Statical throttling means that we set a ratio in a config which defines how many calls the client can’t make for a give time. That requires use of a counter and adds additional state which needs to be maintained. Dynamic way is to measure the time of the response and se the amount of the request in way that all the requests will be fulfilled in time.

Back Pressure

Back Pressure is an important concept for maintaining the throughput of a communication. Lets imagine a following scenario. There is a producer which sends messages to a consumer streamed or batched that doesn’t matter. The producer produces and infinite stream of messages and the consumer must process them in a timely manner. The consumer have a limited buffer or a queue in which the messages can wait before the processing resources are available and the messages can be processed. Everything is fine as far the consuming side is able to process all the messages faster or at the same rate than new messages are produced. Otherwise the messages starts to queue up, The problem starts when for some reason the consumer starts to be overloaded and it is not able to process all the messages in time. That means it would need skip or drop messages. That would inevitably break the delivery guarantees. In some cases that can be a good enough solution for example if we are streaming video it is ok if loose some of the frames for a while but if the events are more significant and the system can’t afford to lost as the data can be corrupted because of message we need to deal with it. So what to do. The consumer needs a feedback channel for sending back information back to the consumer that it is not able to consume fast enough. This feedback channel needs to be asynchronous in order to not directly block the consumptions of the messages.

There must be pre existing protocol agreed. It can be something simple as just sending message back which will be interpreted by the consumer as stop to sending the messages. If the service wouldn’t be able to to process the message in a timely manner it would it will inevitable result in the unavailability of that service. That can even cause problems with another services if they depend on this service and the error can cascade through multiple services. It is important that the service can provide the information that it is not able to respond to all the request so the other services can handle that situation correctly. For example if there is more fine grained protocol setup when the consumer can tell the producer how many request it can handle. The producing service can only accept up to the requests it knows that the downstream system will be able to handle in order to prevent message loss.

The services in the system must be designed with the support for back pressure in order to provide truly reactive system. But obviously we can’t rely on the external services that they will provide such capabilities.

Circuit Breakers

The name of this pattern is coming from the electrical engineering. The circuit breaker is a device which protects your electrical circuit from an excess current. The same idea is behind this pattern. The only difference is that it can be used to protect either the externals systems when making the calls and this pattern can be used within an applications. system or even the other microservices in your system. It can be even treated as a specific way of the throttling described before. So the goal is clear we want break the circuit which means not making the call to the external system if some conditions is met. We can achieve this by doing additional if the breaker is open or closed before the external call is made. If the circuit breaker is open we immediately reply to the part of the systems which makes the call the request has failed giving the external service time to recover. We need to monitor the amount of failures being returned from the external service. The simple approach is to count number of failures over a period of time and open the circuit breaker when the it reaches a preset amount of errors.

Akka itself has it’s own implementation of circuit breakers with description at http://doc.akka.io/docs/akka/current/common/circuitbreaker.html

I can’t forget Netflix and their Hystrix which provides implementation most of the above mentioned patters and contains isolation of execution on different threads. https://github.com/Netflix/Hystrix

All the mentioned patterns can be used within a single application and even for the communication between the microservices. They are the main part of this chapter as it is expected that most of the external services integration will require use of these patterns.

External Warehouse Integrations

In warehouse the main external services to integrate with are Finance, Logistics and Procurement. For that we would need to add specific adapter services to support those integrations. We have said before that those services should be completely stateless but it makes sense that they can maintain caches of information which are specific for the external system if we want to decrease latency as sometimes the calls through to an external system can be to costly. The products are not part of an external system as they can be maintained completely inside a warehouse but if we want to support multiple warehouse it makes sense to create a central database of products in the cloud which can be shared by multiple warehouses.

  • Logistics - two way communication for placing delivery/retrieval/transfer orders and tracking statuses of such orders

  • Finance - needs information about received and dispatched stock, more way of exporting data into finance system

  • Procurement - propagation of the purchase orders and information about them into the warehouse

architecture 08
Figure 3-13. Added integrations with multiple services

These external services are connected to a message topic and proxy the translated requests or events onto the external systems. Eventually they can provide call-through capabilities in order to lookup information in external systems which is actually very dangerous depending of what are the SLAs of those services as those needs to be carefully evaluated. The worst thing is that your customers can not place orders because an external service is not available. An example could be an external logistic provider which requires reservation of every order which needs to be delivered in its system. If such system goes down and the logic for placing orders . The only problem is that the order delivery cannot be guaranteed but we can expect that the order is going to be delivered within couple of days so we automatically degraded the quality of the service but the order can be now processed. Other information are relevant as well depending on the priority of the order and the current throughput if the priority of warehouse. In some cases it would make sense to degrade the delivery time of the order or put it in a queue and wait for some time of the external service will come up. Assuming that the service will come up.

3.6 Beyond Single System

The ultimate feature of a global fulfilment platform is the ability to dispatch and deliver shipments all over the world, regardless the location of the stock, within couple of hours. This is not currently easily achievable because of the physical limitations of the delivery methods and the specifics of the world markets. That includes different business requirements based on differences in the countries judiciary systems, cultural habits, available providers and market specific stock features such as supported GSM bands, type of the electrical sockets and so on.

One physical fulfilment centre is not efficient solution for big regions as the delivery cost would be too high and the delivery times will be too long in most of the cases. The logical step is connect multiple depots / warehouses / fulfilment centers and logistic centers together and by integrating them achieve better service for customers. Additionally it provides ways for cost optimalisation within the region which results in a competitive advantage. So far a system which can receive and fulfil orders has been described. The goal now is to add support for additional warehouses within that existing system and support multiple distinct physical locations.

There are two main approaches how to evolve our existing system to solve this:

  1. Add support for multiple warehouse inside the system (each microservice which needs it)

  2. Extend / Refactor Existing services to support multiple warehouses

In either case a new layer of microservices needs to be added so the system can function together with adding support for using information from multiple warehouses and making the usable for the clients.

Note

Support for multiple warehouse is a big architectural decision in this sub chapter and it seems that it has been omitted. This has been done in order to simplify the system for learning purposes as it is easier to explain the things from the bottom to up. Discussing further the set of the problems and the possible solutions hopefully brings additional value for the reader.

Important thing to consider as it has been mentioned before the nature of the system which is used for running the services. There are 3 options:

  1. Local - all services run in the warehouse datacenter, typically monolithic application

  2. Hybdrid - some services in the warehouse datacenter, some in the cloud

  3. Cloud - all the services are deployed in a datacenter which is not directly in the warehouse

Regardless of the approach the locality of the data and the latency of the responses is important for the operations of the warehouse. This means that the services for multiple warehouse would inevitably live in different datacenters across the world same as the depots.

Supporting the Customer

Our warehouse system maintains the information about the various stock levels for different purposes and it has the information to provide aggregated values based on the different attributes of the products. By default values for each product are maintained in different buckets to describe the state of the items. (eg. On Hand, Pickable, Reserved, Damaged …) wit additional information about the items owners. Thats mainly in order to satisfy the finance and procurement requirements and support consignment and seller stock. The tracking system based on the movements allows querying of what is the availability of specific stock as this information are required internally for successful fulfilment of the orders.

The warehouse system has an internal locking service for reserving the stock but most of the systems would probably want to directly reserve the stock inside a warehouse when customers puts it in the basket on a e-shop site. This behaviour simplifies the checkout procedure if you believe to the assumption that the reservation means that the stock is available and will be fullfilled for your order. As we know already it is not that simple. This should be dealt with in another service in the consumer platform. In order to implement such feature the stock availability information must be exposed to be available for the other systems. So it can be used for implementation of a reservation mechanisms without any significant impact on the warehouse system.

The customers always wants to know the possible delivery options and its price. Recently with the approach of putting the customer first faster methods of delivery has emerged. These methods are available only in certain locations especially cities. Customers address is required for evaluating the delivery options. For providing the customer with such options reliably the information about stock types and availability in different locations must be easily accessible.

These new features defines a new set of requirements for providing exactly what inventory is available and where, which are the available delivery options. First option is to provide the new functionality as part of the existing services and the other by embracing the microservice approach create a set of new ones. In any case these new features require a lot of data to be moved around and are read heavy if we consider that is desirable to display availability on the e-shop product detail pages and more deep checks throughout the checkout process. Caches must be used to provide responses in reasonable times for such queries.

By introducing multiple places where we store the data about single or multiple warehouses there are 2 ways how to make those data accessible:

Pull Based - querying a provided API Existing queries can be exposed within an external warehouses API to provide the required information on demand from the source.

Push Based - stream of events to which the upstream service can subscribe Events are published when the stock levels are updated. So the other possible way is to publish an event stream of increments / actual values of the stock when the overall availability is affected. Not all the movements within the warehouse directly affects the availability of the stock for the end customer.

The problem of creating a fully centralised system which aggregates the data from multiple warehouses in order to provide globally available information about stock is not that important for the customer which means it is not necessary to provide a real-time availability of the stock. On the other hand that does not mean that customer cannot order stock from a different continent using a different entry point if delivery options are available.

Data Aggregation

The aggregations of the different types of data is one of the most common use cases within a service which manages data. SQL databases have good support build in as part of the language and the performance of the queries depends on the correct design but generally performs well. If the operations do not consist of too many joins and the keys are properly indexed. There is no simpler way than to use sum and group by in order to aggregate the stock from different locations:

Example 3-2. Example SQL Aggregation
SELECT s.sku, SUM(s.available) FROM stock s GROUP BY s.sku;

This is simple approach is only possible if the data are available in one database. Lets estimate the amount of rows required. There are hundreds of millions of different products which could be available in something between higher units to tens of warehouses within a region and lower tens of regions.

If we use Amazon as a benchmark - the values doesn’t include different variants of the products.

Table 3-1. Amount of products Amazon sells (2015 - export-x.com)
Site Products [million]

Amazon.com

488

Amazon.co.uk

261

Amazon.fr

237

Amazon.co.jp

168

Amazon.it

165

Reference - https://export-x.com/2015/12/11/how-many-products-does-amazon-sell-2015/

The upper boundary in the case of a single warehouse containing all the products is in units of billions if we assume that there are different variants of the same products - sizes of the clothes and so on. That gives us tens of billions of records to provide all the stock levels within a region.

This somehow suggests that management of these values is achievable within a warehouse SQL database but things gets more complicated while the whole region comes into play.

Another important concerns is how often those values change - new inventory is coming into the warehouse and existing is being either lost, damaged, sold or returned to owner. All these events result in incrementing or decrementing the amount of available stock so the values changes constantly. The consistency of the stock availabilities will be highly eventual given that data would be synchronized from multiple system and that the amounts change constantly.

Generally aggregating some state into an intermediary service which doesn’t fully own that state and duplicates the data is an anti-pattern. It is always difficult to synchronize data an ensure its consistency as the implementation of the distributed databases can only confirm. Assuming that we can fit the data in a single table in an SQL database and get reasonable query response times, additional read replicas would be required in order to satisfy the high demand for the queries when a product page is displayed.

Note

=== From my experience people often think that aggregating state into a single centralized system can shield downstream services which are not fully reactive from high load and provide better overall availability of the whole system. On top of that it suggests that if the data are in one single database the latency of the queries from such system will be lower than when the requests are proxied directly onto the original sources of data. When using a distributed databases that is exactly what happens under the hood anyway - a specific node is hit and based on the query additional requests are dispatched between to the nodes of the database = same like having a proxy service calling a bunch of downstream services and aggregating the results. If all the downstream services are inside a same datacenter there is 0 loss in terms of performance. But querying services which lives in different datacenters definitely adds latency and increases the probability of a service being unavailable. They synchronisation is hard to implement and even if done correctly introduces latency which ends up in having an eventualy consistent system. ===

A reasonable approach would be for each display of a product detail page fire set of requests which will then make check the availability in each single warehouse and then the result could be returned once the first response with available stock comes back. This is incredibly inefficient as it takes a long time process given the fact that the request will come from a customer device. So making one request to one system which can provide an overall value seems like much better idea. In both cases uses of caches seems very reasonable thing to do.

A precomputed value per type of the product per region would provide much faster response. The amount of rows would be maintainable within one table and the performance of the query can be improved by adding read replicas or introducing a cache while increasing the inconsistency even more. The downside is that the value would be changed even more frequently as the results of the happenings in all the warehouses and orders being placed.

In case of a NoSQL database without support of joins and aggregation / grouping - the only way to provide the aggregated values is to provide a processing on the application layer writing a service which it does it for us. Such service would be required even in the case of aggregating the data from multiple SQL databases.

One of the approaches could be periodically query all the products from all the different warehouse systems and store them in a database. That is obviously terrible solution which would provide data with a big latency and cause performance strain on the warehouses systems.

In case of NoSQL database it would depend on the functionality provided for the aggregations but that is usually pretty limited especially if we want to get an overall value for specific product within a region. So again additional processing logic must be implemented outside the database itself.

In case of using Cassandra and having a model where the stock is maintained on a location level a materialized view can provide a partitions per a specific product which can then provide aggregated values from within the partition.

Example 3-3. Example CQL Aggregation
CREATE TABLE IF NOT EXISTS stock_by_location (
    location text,
    sku text,
    quantity int,
    reserved int,
    PRIMARY KEY ((location), sku)
);

CREATE MATERIALIZED VIEW IF NOT EXISTS stock_by_sku AS
  SELECT
    sku,
    location,
    quantity,
    reserved
  FROM stock_by_location
  WHERE sku IS NOT NULL
  AND location IS NOT NULL
  PRIMARY KEY ((sku), location);

SELECT sku, sum(quantity), sum(reserved) FROM stock_by_sku WHERE sku='1234567890';

Incremental Approach

Another approach is to use asynchronous messages representing increments / decrements or publishing the actual value of the stock in a specific warehouse. In that case the dependency between the warehouse systems and the service which provides the overall values is minimal. How to produce the events depends on the used implementation each time a write to the tracking system is done an event can be produced. This maps nicely on the existing system as it already uses the movement events. So stream of the movements from different warehouses consumed by one service and updating the value counters in its own central database. Depending on the chosen approach for processing of the messages race conditions and out of order processing can occur. So it is important consideration when throughput is important - so if the order needs to be preserved the messages needs to be processed sequentially but the work can be still distributed when partitioned by a specific key. Depending on the granularity of the events. Using incremental updates seem simpler as the overall sum will be correct eventually regardless of the order. The problem with this approach is that the transferred information is only partial and without having the whole log of the events it is not possible accurately reconstruct the actual value. So if the system provides capabilities for replaying all the events this is the way to go or the other solution where the current state instead of increment will be published is way to go. Value based approach would not allow for simply incrementing and decrementing a single counter which means that the multiple values per warehouse would still need to be maintained within the database.

Example solution can leverage Kafka for providing the whole history of the events together with ordering guarantees if necessary. Mechanism for publishing events mechanism can rely on external API calls reducing the latency but the queue based option has the benefit of asynchronous processing. Events published into Kafka topic can be processed using Kafka streams and then aggregated values can be directly streamed into the data store - where they are available for queries. There is an interesting feature in Kafka called compacted topics which always holds the last value for a specific key - so in our case it could be a warehouse and product identifier and that can be used for rehydration in case of failure of the system which is aggregating the data or for a replication to a different datacenter.

Does the system actually need to know the exact value of the available stock? The e-shops do not display the explicit information about the availability of the products as their competitors could use this information against them. It is enough only to indicate if the item is in stock. There can be a specific threshold set based on the popularity of the stock. Because of that a warning like less then 10 items available can be displayed to the user. This will indicate the likelihood of the order placement failure if backorders are not considered so the order can fail at the placement time without not enough stock. The fact that the values are not always consistent does not matter as much and if only the stock bellow the threshold propagates the exact value it can be cached very effectively.

It is only about fine-tuning the system to decrease the percentage of unfulfillable orders because of the unavailable stock and increasing the customer satisfaction. At the checkout process it makes more sense to do deep checks - directly query the source systems which is responsible for the stock where it should be ordered from.

There are exceptions as always. One common scenario are the limited offers, a counter must be maintained independently within the e-shops service as it is based on the amount of orders placed for a specific item. This can be additionally improved with integrating information about the lead times of products - how long it would take to get it from a manufacturer and maintaining this information up to date. This kind of information can be used for implementation of an efficient system for back orders. The stock is not actually available in any warehouse but the system is able to accept orders with indicating to the customer that the delivery date will be delayed.

Order Differentiation

The complexities of different delivery options and their availability has been ignored in the previous chapters but it is something what needs to be dealt with. The delivery and splitting options affects the price for the customer. More shipments mean more deliveries and that is more expensive the same way as prioritised fast delivery is more expensive (usually) for the customer and for the provider as well. If there are 2 hour delivery or next day delivery options available the processing throughput and the capacity of specific delivery slots provided by the logistic centre must be tracked. Everything can not be delivered in the next 2 hours. The system need to be able to provide these information together with splitting logic and stock availability so the consumers of the fulfilment API can use them to improve the customer service. The eshop system can provide accurate information about possible order deliveries and fulfil them as agreed.

Lets say that each fulfilment centre has a different set of delivery methods available which should not vary as much but can be different per region or and different depots. Obviously for 2 hours delivery in Manchester only items which are in a fulfilment centre within reach can be available for this type of delivery. So the available delivery methods must be provided and would depend on the delivery address as an input from the customer. That will translate to the distance from a specific fulfilment centre so the delivery methods can be evaluated.

Another service must provide the logic of getting this information for all the independent systems responsible maintenance of the stock - knowing the different endpoints of the services it needs to call in order to get the available methods. In a scatter and gather manner querying multiple endpoints and aggregating the response which gets presented to the customer. There is not that many different kinds of delivery methods and their variations - so the maintenance of these values is not problematic. It can be directly stored as internal state of the new service and even having a separate service for maintaining different types of deliveries which then can be used across all different centres makes sense. Importantly the available delivery methods would not change often eventually will be enabled or disabled for specific regions or facilities.

The different shipment packages are standardised and do not impact the delivery as much. So another service will be responsible for dividing the basket into smaller units based on the availability in different warehouses. The criterium would be the size and weight of the ordered items and how they can fit into different packages. That requires the information about the products and their physical and logical classification which is maintained within the product management system. These are the required information for providing the final delivery options and computing the delivery costs. The best customer experience is to provide the possibility to group the items based on the availability.

There are multiple strategies how to deal with it and different set of business rules will apply for different items. For example big items like home appliances will be always delivered separately.

  • Least number of shipments (usually longer time of delivery) - waiting until all items are available, optimising the cost of delivery

  • Shipping based on availability - multiple shipments at different times, optimising the time of the delivery

Multiple Warehouses

New features must be added to the existing set of the services to provide the functionality which is required for the functionality described above. Following the microservices approach it means services needs to be added to make all the parts of the system work together with external services and other clients. The new microservices API must provide the options for splitting and delivery so it is mainly going to be an orchestration service which will implement specific business logic and utilize information from other services - mainly from the specific systems within the same geographical region and product management systems. There are multiple ways of adding support for more warehouses. The most naive approach would be that the notion of a warehouse is factored in almost all microservices which exists and the models are enriched with this information. Multiple instances of those services can be deployed and integrated together.

The new functionality which needs to be provided by the the new layer of the services:

  • Order Related Requests - Placement, Cancelation, Updates

  • Grouping / Delivery Requests - availability of delivery methods from a specific warehouse to a destination address, current throughput of the system

  • Availability of the Stock - current stock in the warehouse, can contain some predictions based on the stock coming into the warehouse and provide an estimated delivery

There 2 possible ways of how the system can evolve: - Multitenant System - Cloud Based System - where most of the services run inside the cloud and support operations of multiple warehouses within same region - Facility Specific System - Cloud Based / Hybrid Systems

The fully multitenant systems means that there are same services which can serve requests for any warehouse inside the platform and each cluster of the same microservices will use same datastore. By summing all the data we could get an eventually consistent global view of the data. Depending on the chosen datastore the data would end up partitioned accross multiple regions or facilities.

The problem with such approach is that it dramatically increases the throughput and amount of data each single service has to handle as additional warehouses are added.

Global aggregation of the data makes sense for reporting purposes but that is a job for a data warehouse solution where latency is not big problem. It is only logical to split the system into regions allowing to guarantee similar delivery times of the stock and satisfy the regulations of the market in that given region. Market specifics are an additional set of informations which needs to be maintained and they are directly connected to the product information so they fall within the Product Inventory Management.

In order to provide resiliency by using redundancy the system will end up with multiple replicas of the data inside different data centers. And in order to have a good latency the services in the cloud must be co-located to the physical locations of the fulfilment centers. Same goes for the new services which must provide even better latency so they need to be deployed into multiple regions.

For either approach there must be a set of new services which provide the access to or aggregate the stock data. The new services will be responsible for providing the access to the data required for the order placement and order lifecycle management.

Processing the Requests

The requests designated for multiple services within the warehouse will be coming from the clients of the API - the Consumer Platform and from the operations inside the warehouse. Additionally the services will be calling each other internally. So an important part of microservices architecture is API Gateway. Such service is required for nevertheless of the multitenacy nature.

The most important responsibility of such service is routing of the requests to its destination on a specific microservice. In order to get the information about the throughput and available delivery methods which differs based on the warehouse and the destination address for example. A deep query of the stock then can be done once the stock is in basket and the customer proceeds to checkout querying directly the available data from a specific warehouse. Another example is a operator in a warehouse doing his job step by step and sending commands to the system which updates the internal state.

For that purpose it needs to be able to provide discovery functionality - be able to find out where different services are running. That can be done by a static routing table or dynamically by microservices registering itself after they are successfully redeployed. This sort of functionality can be used for different deployment strategies such as blue / green or draining the requests from older version system to newer. A/B testing can be implemented as part of this service as well as fallback to different regions when services are not available in a specific region.

It should as well provide security layer for the APIs and ensure that the unauthorized requests wouldn’t even reach the services. It must by highly distributed or stateless so it is not a single point of failure of the whole system. Additionally it can provide a throttling functionality which will limit requests into the system to prevent failure of the whole subsystem within a region.

There is a need for a service which will know how to route the orders onto different logistic centres in order to process them. Depending on the choose architecture. Think of implementing an eshop which simply delivers from any warehouse all over the world - then the request must end up at least in the region of the facillity which will process that order. That can be something really simple as an adapter which receives the order and checks the association to a specific fulfilment centre and then puts the order to a specific topic or makes call to the corresponding internal API. This can be implemented as a feature of the API gateway or a specific microsirvice inside the order management.

Inventory Management

For providing the information about the available stock the existing system already track the inventory levels in one system and this information is available to other microservices within the platform. The straight forward solution is to group the movements by the location level - and provide the aggregated information.

So thinking about having one distributed database which maintains all the stock globally is possible. Using warehouse identifiers as a partition keys to multiple tables which are used for tracking of the inventory and into orders related data so we know that specific orders are meant to fulfilled in specific warehouses.

The tracking system uses location addresses and already provides support for different locations types so the locations can be composed together and create specific areas - virtual or physical. Fulfilment centre can be a new location type and additional prefix in the location addresses will represent it. The changes of the data model for the tracking across multiple system is minimal as well as the workflows which interacts with it. From the tracking point of view the logic is not affected at all - the major change of the system would be the amount of movements the system needs to process and the latency given by the geographical position of different storage locations using the same datacenter. As latency of the movement operations is important for avoiding race conditions in scenarios where the stock is moved between different locations and orders are fulfilled in parallel. The increased latency will pose a challenge for the processing of the orders as the stock is being allocated in the specific locations and picked by the workers. Depending on the the implementation the tracking of the stock can become more error prone unless transactions can be used. This is not a big problem as long as it is possible to ensure the eventual locality of the requests in the sense that a specific warehouse will be writing and reading data within a set of nodes which live in the closest datacenter. Those requests will come mainly from the operators within the warehouse and another set of requests will be coming from the other services which perform operations on the similar set of data related to a single location. So it makes sense to put these services inside the same datacenter over to having a hybrid approach of having some systems at a the fulfilment centre and some globally.

Then the actual distributed systems data (stock tracking data in this case) will be replicated into other datacentres and will be available for queries - actually creating a copy of all the data within each datacenter. That increases the requirements for each single datacenter and it is actually not fully required because it provides informations on the most fine grained level which still needs to be aggregated - the consumer platforms don’t need to know the exact number of the stock for each location within each warehouse - they only need to have access to the aggregated values. In the case of one classic relational database with a master / replica architecture they should be completely accurate with a slight latency if read replicas are used. But even with RDBMS these aggregations will be implemented as views or using triggers to get the aggregated values.

So even it is theoretically possible to have one distributed database the amount of the data would be so huge that the performance of the aggregation queries on those data will be so low that it would be most likely too slow for use in anything else that producing reports. Additionally it could affect the operations of the warehouses which are using the same datastore and need good latency of the queries. It would heavily increase the expectations of what the central system can handle as we will be adding multiple warehouses as the business grows. This completely disallows use of off-shelf solutions and even use of the traditional SQL database as they would not provide the required elasticity.

Order Management

The next important part would be the Sales / Procurement Order Management services - again the orders must be partitioned by the locations where they are going to be fulfilled. It could be possible to provide fulfilment service API without explicitly specifying which warehouse is responsible for the processing of the order by just simply ordering a set of items. But at some point those orders must be assigned to a specific fulfilment centre and it is going to be very difficult to provide any guarantees for the delivery times of such orders.

If customer order consist from 3 items which each is going to be processed in different data center. This means that the state of the order is not going to be mutated from within the same datacenter. That inevitably adds latency of the updates as the state of the order can be different based on from which region the query is made. But does it actually matter as much as the order processing the status updates requires physical actions happen inside the warehouse?

In a scenario when the customer decides to cancel the whole order and this information needs to be propaged to all the fullfilment centres. At this point the cancel operation can fail because the order has been already dispatched but the global state hasn’t been updated yet. In this case the failure means the sub order will be dispatched and the customer will return or not the stock which will cost money.

The order system can be multitenant as well and work in the same way as described for the tracking system - one system which tracks the different state of the orders. There are not any obvious requirements for aggregation of those orders as their state would be mostly accessed directly by the order id and can be looked up. So it would depend on the data model if order is a warehouse specific - as initially or if it can be distributed across multiple fulfilment centres. The solutions will end up with the same problems with performance and replication as the tracking system.

All the services within the warehouse will be operating on the data specific to one warehouse so parametrization of the fulfilment service for the picking would be required but more importantly the logic for the processing and creating the picking waves must be warehouse specific as well. This is still ok as the data are local to the physical center and the services and datastore are deployed in datacenter.

Another important service is a service which provide static model of the warehouse and define different types of the locations with their instances and physical position and the dimension of the locations. System for maintenance of those information can easily support multiple warehouses as it is mostly static data which does not change daily. This data are crucial as the model of a warehouse with dimensions so they can be used for the optimization algorithms and serve as a basis for the picking algorithms. The layout of the warehouse doesn’t change very often so eventually consistency and lags in replication across regions does not cause any problems. These data needs to be highly available for the fulfilmet service so the orders can be processed successfully. Such service is actually a candidate for deployment within a warehouse datacenter in case of hybrid deployment.

The external services integration can cause many challenges to the global approach as it doesn’t necessarily mean that all the warehouse can use the same logistic provider. That would mean building a logistic platform which will provider an over-arching point of integration for the fulfilment centre implementation and can support multiple different providers which are available only in specific regions. This will impact as well the available delivery options for the customers.

Workflow Services wouldn’t need any changes they should have instances inside the warehouse so they can be performed without connection to the internet. The don’t rely on the layout of the of the warehouse or its specific state. The multitenancy affects the requirements of the user service. It requires be able to handle many more users as all the employees will be logging in and accessing the platform using user services microservice.

The users can be warehouse specific or global depends on decesion. An instance of user service per warehouse can be enough. That would mean that managers accessing multiple warehouse needs to have account created inside multiple systems but it depends if this will be happening. Having a central service which supports multiple warehouses is achievable as well as the users are co-located in correct datacentre. Users would get another dimension of access and it could be as granular as setting different set of rules per warehouse.

Integrating Multiple Systems

The second approach is to have the warehouses running independently in isolated deployments so most of the services do not need any changes as they are designed to support one warehouse already. No big changes are required mainly in the deployment process and making the services parametrised.

Additional benefits of this approach is that it does not matter on the internal system architecture of the warehouses as long as their implement the same API which then can be used by the customer layer built on top of them. This adds flexibility to the overall system and makes possible to integrate different existing solutions together.

On the other hand the fact that there are multiple systems within a platform built on different technologies by different teams makes the maintenance very difficult. Obviously it presents the risks that the different versions of the services starts to diverge at some point and that can end up in unmaintenable codebase but that solely depends what is more beneficial for the growing business. In case of using completely different solutions integrated together means just a temporary solution until the staff is trained on the standard system and would recommend standartization at some level.

In case that the system is built from the same services which are only parametrized to process request for a specific warehouse we get more flexibility in terms of elasticity and resilience of the system. The rollouts of updates of new services will be completely isolated for different warehouses and can be scaled based on the specific warehouse requirements. The same goes for the failures - if there is a fatal error in system the error is isolated to only one subsystem. This Provides a nicer way for testing. If the services are shared it is still possible to do such test by the use of Feature Flags which are implemented on the request level - so specific users can test new versions of the workflows in the warehouse to actually measure if it is more effective.

In the following digram we have added the 2 additional microservices which will aggregate the information. They are clustered as it is expected that they would need to handle a lot of requests from various customer platforms. Those platforms will be the consumers of the overall public fulfilment API.

architecture 09
Figure 3-14. Multiple Systems Support

Globally Distributed Systems are difficult to maintain as the network partitions are happening more often in the comparison to a deployments within a single datacenter. The latency of replication of the data within such system is much higher and it must be carefully evaluated if it is beneficial to distribute the data or partition the system into smaller isolated sub systems which live within single datacenter and offer other data-centres as failover. Main point of splitting the systems between multiple regions is in order to decrease the latency from the users and increase the availability of the system as whole providing redundancy of the subsystems in distinct data centres across multiple regions.

3.7 Summary

This chapter should provide enough information about the caveats and domain knowledge needed for building a WMS with the focus on hardware devices which you would need to integrate when implementing such system. We discussed the motivation for creation of a new warehousing and e-commerce system with a focus on how the business side works. You have learned about important patterns and concepts how to create a reactive system and use the microservice architecture to divide the responsibilities of the system into more and less independent components. We have highlighted the importance of the user interfaces and how they are tie into a reactive system so they can provide the best user experience. We have described architecture of a system which is able to track inventory for whole warehouse and provides flexibility for the future features thanks to the splitting the domain into isolated bounded contexts. We have proposed a solutions of the problems which you can face when integrating the hardware devices(printers, handhelds, conveying systems) in a hybrid cloud environment where you can expect a lot of communication over network. In the end we have discussed how to evolve the system into into using multiple warehouses.

Important takeaway is as what I have grasped from working on multiple projects and what has been mentioned on various blogs about building microservices in one form or another. I was not able to find the exact quote but it comes from the Martin Fowlers blog on https://martinfowler.com So let me rephrase it “If you are not clever enough to build a monolithic system with independent components don’t think that microservice architecture will solve your problems.” Microservices architecture is a tradeoff and adds substantial complexity which needs to be dealt with correctly. Hopefully this chapter helped you to understand that complexity.

Where to next?

Plenty of the important part of the system has been left out. Especially the approach for authentication and security across the whole microservices. The use of HTTPS and encrypted tokens such as JWT, and use of OAUTH2 are the industry standard now. Securing the external APIs using the API keys instead of direct username/password authentication. But in the scope of one chapter it was not possible to go into the details. The choices for securing or not securing the internal APIs will be discussed in another chapter related to the infrastructure.

Another important part of the systems are caches and caching which is not that important for the warehouse system as the amount of users which it needs to support is pretty limited. The main part when makes sense to us it is caching the product information which are useful at various workflow steps when they are done by human workers. The parts of the systems which provides storage layout information / static products information (pictures) / quality checks processes Descriptions such as Redis, Memcache or others. For the stock levels it is more important to operate on consistent and accurate data and the caches would eventually increase the possibility of inaccurate data and discrepancies across the warehouse.

The caches would be more discussed inside e-commerce: Chapter XXX. and the security inside Infrastructure: Chapter XXX.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset