Convolutional neural networks (CNNs) have revolutionized the field of machine learning in recent years, especially in the field of image classification. The various applications and advances in the understanding and explainability of these models also allow us to understand their faults: spatial problems or so-called “adversarial” attacks (contradictory).
Picasso — Guernica (source Wikiart)
For example, we can use the analogy of the faces of Guernica by Picasso to illustrate the inability of CNN models to understand the spatial relationships between objects in images.
If you had to transcribe in French what a CNN interprets when you submit a face image to it, it would be something like this:
“If in the image we find:
— One or more eyes,
— A nose,
— A mouth,
— Eyebrows...
So it's a face! ”
At first glance, these criteria seem sufficient to correctly detect a real human face. However, if our image contains The characteristics of a face, but with a slightly original layout (4 eyes, 2 mouths, a nose on the left, vertical eyebrows...), should we really let the algorithm recognize the image as a face?
We could also discuss adversarial attacks ” which are the haunt of these models. These are disturbances that are almost invisible to the naked eye, but can radically deceive the network.
To overcome these problems, Geoffrey Hinton and his team proposed in 2017 an improvement to this type of model by introducing capsules to replace neurons in a standard neural network.
The capsules increase neurons by retaining a information vector rather than a simple quantity (scalar). This vector retains much more details: the pose, the texture, the location...
By using a process similar to conventional CNN, the capsule network can thus learn more complex patterns.
Going back to our face detection example, such a network can learn the spatial relationship between eyes, nose, and mouth, and could in theory even include pose (face orientation). So, the transcript in French could give this:
“If in the image we find:
— Two eyes, to the left or right of another eye, Above the nose
— A nose, oriented vertically, in the center, under the eyes
— A mouth, oriented horizontally, under the nose and eyes
— A pair Of eyebrows oriented horizontally, above the eyes ...
So it's a face! ”
Depending on the information retained in the capsule vector, the network may in theory be sensitive or not sensitive to the characteristics that can be interpreted from this information.
Since then, numerous papers have been published on the subject to propose improvements, adaptations to various use cases (other than 2D), as well as the sensitivity of capsules to “adversarial” attacks.
We recommend that you read this article available on medium.com for more details on how they work and how they work: https://medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b