TensorFlow Serving is a project by Google that deploys models to production environments. TensorFlow Serving offers the following advantages:
- High speed of inference with good latency
- Concurrency, providing good throughput
- Model version management, so models can be swapped without a downtime in production
These advantages make TensorFlow Serving a great tool for deployment to the cloud. TensorFlow is served by a gRPC server. gRPC is a remote procedure call system from Google. Since most of the production environments run on Ubuntu, the easiest way to install TensorFlow Serving is by using apt-get, as follows:
sudo apt-get install tensorflow-model-serving
It can also be compiled from a source in other environments. Due to the prevalence of using Docker for deployment, it's easier to build it as an Ubuntu image. For installation-related guidance, please visit https://www.tensorflow.org/serving/setup.
The following figure shows the architectural diagram of TensorFlow Serving:
Once the model has been trained and validated, it can be pushed to the model repository. Based on the version number, TensorFlow Serving will start serving the model. A client can query the server using the TensorFlow Serving client. To use the client from Python, install the TensorFlow Serving API, as follows:
sudo pip3 install tensorflow-serving-api
The preceding command installs the client component of TensorFlow Serving. This can then be used to make inference calls to the server.