AlexNet is a modern architecture, from 2012, which works on RGB images and has way more convolution and fully-connected neurons. It's similar to LeNet-5, but AlexNet is bigger and deeper because the processing power in 2012 was quite good and during that time even GPU was created and used in the architecture, which gave really good results.
The following diagram depicts an AlexNet:
We start with of 224 x 224 x 3 RGB image, and then apply a convolution having a stride value of 4 and a filter of 11 x 11 x 3 (where we don't show "x 96" for the reason mentioned previously). After we perform convolution, we have 55 x 55 x 96; the number of third-dimensional channels dramatically increases to 96, which is equal to the number of filters.
Post this, we apply a max pooling stride of 2 and a filter of 3 x 3, which simply shrinks the first two dimensions but leaves the third one untouched, giving us an output of 27 x 27 x 96. After applying a convolution with the same dimensions of 5 x 5, we, we obtain the output of 27 x 27 x 256.
The total number of channels has now been increased to 256. To shrink the first two dimensions but leave the third dimension at 256, we use a max pooling layer having a stride of 2 and then use the convolution same with dimensions 3 x 3 x 384, and increase the third dimension further to 13 x 13 x 384.
Applying a convolution same, we again obtain 3 x 3 x 384, leaving the number of channels as it was. We then apply a convolution same with dimensions of 3 x 3 x 256, leading to a decrease in the number of channels since we obtain 13 x 13 x 256. After that, we use max pooling with, stride of 2 and dimensions of 3 x 3 to decrease these first two dimensions even more, and leave the third, dimension 256, untouched.
Let's understand why we choose each of these values and dimensions:
We begin with the first two dimensions, 224 x 224, and the final output has dimensions of 6 x 6, whereas the number of channels began at 3 and concluded at 256.
This is a typical technique in convolution architectures. We tend to decrease the first two dimensions, but increase the number of channels significantly, because we want to capture more features.
Then, we'll have 3 hidden layers of neurons; the first layer has 9,216 neurons, while the following two layers have 4,096. In the end, we'll try to predict 1,000 classes rather than just 10. This architecture gives us really great results for the first time and, therefore, a lot of people were convinced that deep learning works well for image problems. It has approximately 60,000,000 parameters, basically 100 times more than LeNet-5.