Surprisingly, researchers with deep learning computer vision algorithms often fail to classify images because they mainly focus on textures rather than shapes.
If you look at a photo of a cat, with a high probability you will be able to recognize this animal, regardless of whether it is red or striped - or even if the photo is black and white, stained, battered or tarnished. You will probably be able to spot a cat when she curls up behind a pillow or jumps onto a table, representing only a blurry shape. You naturally learned to recognize cats in almost any situation. But machine vision systems based on deep neural networks, although sometimes able to furnish people with cats recognition tasks under fixed conditions, can be confused with images that are at least slightly different from what they know, or that contain noise or too much strong grit.
And now German researchers have discovered an unexpected reason for this: if people pay attention to the shapes of the objects depicted, computer vision with deep learning clings to the textures of objects.
, presented in May at an international conference of learning representations, emphasizes the sharp contrast between the “thinking” of people and machines, and illustrates how wrong we can be in understanding how AI works. And also it can tell us why our vision became so as a result of evolution.
Ivory cats and watch airplanes
Deep learning algorithms work by driving thousands of images through a neural network that either have a cat or not. The system looks for patterns in this data, which it then uses to put the best mark on the image that it has not encountered before. The network architecture is a bit like the structure of the human visual system, since it has connected layers that allow it to extract more and more abstract features from the image. However, the process of building a system of associations leading to the correct answer is a black box that people can only try to interpret after the fact. “We tried to understand what leads to the success of these deep-learning computer vision algorithms, and why they are so vulnerable,” said Thomas Ditterich
, an informatics specialist at the University of Oregon who is not related to this study.
Some researchers prefer to study what happens when they try to trick the network by slightly altering the image. They found that even small changes could cause the system to mark the image incorrectly - and large changes might not cause
the label to change. Meanwhile, other experts track changes in the system to analyze how individual neurons respond to the image, and compile an “ activation atlas
” based on the attributes that the system has learned.
But a group of scientists from the laboratories of the computational neuroscientist Mathias Betge
and the psychophysiologist Felix Wichmann
from the University of Tübingen in Germany chose a qualitative approach. Last year, the team reported
that when training images that were altered by noise of a certain kind, the network began to recognize images better than people trying to make out the same noisy pictures. However, the same images, modified slightly differently, completely confused the network, although for people the new distortion looked almost the same as the old one.Robert Geyros, Postgraduate Student in Computational Neurobiology from the University of Tübingen
To explain this result, the researchers thought about what image quality changes the most even with the addition of a little noise. The obvious choice is texture. “The shape of an object remains more or less unscathed if you add a lot of noise for a long time,” said Robert Geyros
, a graduate student in the laboratories of Betge and Wichmann, the lead author of the study. But "the local image structure is distorted very quickly when a small amount of noise is added." So they came up with a tricky way to test how the visual systems of machines and people process images.
Geyros, Betge and their colleagues created images with two conflicting features, taking the shape from one object and the texture from another: for example, a cat silhouette painted in gray elephant skin texture, or a bear made of aluminum cans, or a plane silhouette filled with overlapping each other with images of dials. People labeled hundreds of such images based on their shapes — cat, bear, plane — almost every time, as intended. However, four different classification algorithms leaned in the opposite direction, giving out labels reflecting the textures of objects: an elephant, cans, and clocks.
“This changes our understanding of how deep neural networks with direct distribution — without additional settings, after the usual learning process — recognize images,” said Nikolaus Kriegscorte
, a computational neuroscientist at Columbia University who was not involved in the study.
At first glance, the preference for AI textures over shapes may seem strange, but it makes sense. “Texture is a bit of a high-resolution shape,” said Kriegscorte. And it’s easier for the system to cling to such a scale: the number of pixels with texture information significantly exceeds the number of pixels that make up the boundary of the object, and the very first steps of the network are related to the recognition of local features, such as lines and faces. “That's exactly what texture is,” said John Tsotsos
, a computer vision specialist at York University in Toronto who is not associated with this study. “For example, a grouping of segments lining up in the same way.”
Geyros and colleagues showed that these local features are enough for the network to carry out the classification. This is the proof of Betge and another of the authors of the study, the post-doc Wiland Brendel
, brought to a finished look in the work
, which was also presented at the May conference. In that work, they built a deep learning system that worked in much the same way that classification algorithms worked before deep learning spread - based on the “bag of attributes” principle. The algorithm breaks the picture into small fragments, like the current models (such as Geyros used in his experiment), but then, instead of gradually integrating this information to extract signs of a higher level of abstraction, the algorithm immediately makes an assumption about the contents of each piece ( “In this piece there is evidence of a bicycle, in this - evidence of a bird”). He simply folded all the decisions to determine the object (“if more pieces contain signs of a bicycle, then this is a bicycle”), not paying attention to the spatial relationships of the pieces. And yet he was able to recognize objects with unexpectedly high accuracy.
“This work challenges the assumption that deep learning does something completely different” from previous models, Brendel said. “Obviously, a big leap has been made. I’m just saying that it wasn’t as big as some hoped. ”
According to Amir Rosenfeld, a postdoc from York University and the University of Toronto who did not participate in the study, “there is a big difference between what we think neural networks should do and what they do,” including how well they manage reproduce human behavior.
The pretzel spoke in the same vein. It is easy to assume that neural networks will solve problems in the same way as people, he said. “However, we constantly forget about the existence of other methods.”
A shift towards a more human view of things
Modern deep learning methods can integrate local features, such as textures, into more global patterns, such as forms. “What is unexpectedly and very convincingly shown in these works - although the architecture allows you to classify standard images, this does not happen automatically if you just train the network about this,” said Kriegescorte.
Geyros wanted to see what happens if the team forces models to ignore textures. The team took the images traditionally used for training classification algorithms and painted them in different styles, depriving them of useful texture information. When they retrained each model in new images, the systems began to rely on larger, global patterns, and showed a greater tendency towards pattern recognition, which was more like people.Wieland Brendel, Computational Neuroscientist at the University of Tübingen in Germany
And after that, the algorithms began to better classify noisy images, even when they were not trained to deal with such distortions. “The shape-recognition network has become more reliable for free,” said Geyros. “This suggests that the right bias for certain tasks, in our case, the propensity to use forms, helps to generalize knowledge to new conditions.”
This also suggests that in humans such a tendency could form in a natural way, since the use of forms is a more reliable way to recognize what we see in new or noisy conditions. People live in a three-dimensional world where objects are visible from many angles under many different conditions, and where our other senses, such as touch, can optionally complement object recognition. Therefore, for our vision, it makes sense to put the form a priority texture. In addition, some psychologists have shown a connection between language, learning, and a tendency to use forms: when children were taught to pay more attention to forms when studying certain categories of words, later they were able to develop much more extensive vocabulary of nouns than others.
This work serves as a reminder that “data has a stronger effect on the prejudice and bias of models than we thought,” Wichman said. This is not the first time researchers have encountered this problem: it has already been shown that face recognition programs, automatic resume search, and other neural networks give too much importance to unexpected signs due to prejudices deeply rooted in the data on which they are trained. Removing unwanted prejudices from the decision-making process proved to be a difficult task, but Wichman said the new work demonstrates that this is possible in principle, and is encouraging.
Nevertheless, even Geyros’s models that focus on forms can be fooled by adding too much noise to the images, or by changing certain pixels, which means that they still have a long way to go to achieve a quality comparable to human vision. In the same vein, a new work by Rosenfeld, Tsotsos and Marcus Solbach, a graduate student from Tsotsos laboratory, demonstrates that machine learning algorithms are not able to capture the similarity of different images in the same way as people do. Nevertheless, such works “help to indicate exactly in which aspects these models do not yet reproduce important aspects of the human brain,” said Kriegscorte. And Wichman said that "in some cases, it may be more important to examine the data set."
Sanya Fiedler, an IT specialist at the University of Toronto who did not participate in the study, agrees. “It's our job to develop smart data,” she said. She and her colleagues are exploring how ancillary tasks can help neural networks improve the quality of their core tasks. Inspired by the discoveries of Geyros, they recently trained the image classification algorithm not only to recognize the objects themselves, but also to determine which pixels belong to their contours. And the network automatically began to better recognize objects. “If you are given only one task, then the result is selective attention and blindness in relation to many other things,” Fiedler said. “If I give you several tasks, you will learn about different things, and this may not happen.” It’s the same with these algorithms. ” Solving various problems helps them “develop a tendency to various information”, which is similar to what happened in the experiment of Geyros with shapes and textures.
All this research is “a very interesting step towards deepening our understanding of what is happening with deep learning, and perhaps it will help us overcome the limitations that we encounter,” said Dietrich. “That's why I love this series of work.”