Howard et al. proposed a solution for faster inference with a learning paradigm. The following is an illustration of the use of mobile inference for various models. The models generated with this technique can be used to serve from the cloud, as well:
There are three ways to apply convolution, as shown in the following diagram:
A normal convolution can be replaced with depthwise convolution, as follows:
The following is a graph that represents the linear dependence of accuracy, with respect to the number of operations performed:
The following is a graph representing the accuracy dependence on the number of parameters. The parameters are plotted in a logarithmic scale:
From the preceding discussion, it is clear that quantization gives a performance boost to the inference of models. In the next section, we will see how TensorFlow Serving can be used to serve models in production.