Chapter 2: Face and Audio Recognition Using Siamese Networks

A siamese network is a special type of neural network, and it is one of the simplest and most commonly used one-shot learning algorithms. Siamese networks basically consist of two symmetrical neural networks that share the same weights and architecture and are joined together at the end using an energy function, E.
The contrastive loss function can be expressed as follows:

In the preceding equation, the value of Y is the true label, which will be 1 when the two input values are similar and 0 if the two input values are dissimilar, and E is our energy function, which can be any distance measure. The term margin is used to hold the constraint; that is, when two input values are dissimilar and if their distance is greater than a margin, then they do not incur a loss.
The energy function tells us how similar the two inputs are. It is basically any similarity measure, such as Euclidean distance and cosine similarity.
The input to the siamese networks should be in pairs, (X₁, X₂), along with their binary label, Y ∈ (0, 1), stating whether the input pairs are genuine pairs (the same) or imposite pairs (different).
The applications of siamese networks are endless; they've been stacked with various architectures for performing various tasks, such as human action recognition, scene change detection, and machine translation.

Table of Contents for Chapter 2: Face and Audio Recognition Using Siamese Networks