Once you understand how convolutional layers work, the pooling layers are quite easy to grasp. A pooling layer typically works on every input channel independently, so the output depth is the same as the input depth. You may alternatively pool over the depth dimension, as we will see next, in which case the image's spatial dimensions (height and width) remain unchanged, but the number of channels is reduced. Let's see a formal definition of pooling layers from a well-known TensorFlow website:
Therefore, just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. You must define its size, the stride, and the padding type, just like before. However, a pooling neuron has no weights; all it does is aggregate the inputs using an aggregation function such as the max or mean.
Well, the goal of using pooling is to subsample the input image in order to reduce the computational load, the memory usage, and the number of parameters. This helps to avoid overfitting in the training stage. Reducing the input image size also makes the neural network tolerate a little bit of image shift. In the following example, we use a 2 x 2 pooling kernel and a stride of 2 with no padding. Only the max input value in each kernel makes it to the next layer since the other inputs are dropped:
Usually, (stride_length)* x + filter_size <= input_layer_size is recommended for most CNN-based network development.