A smooth introduction to computer vision
The history of computer vision has always been intermingled with our pursuit for artificial intelligence. The research groups that, in the late 50s, started investigating how machines can become intelligent, were also concerned with the question of how computers can mimic the ability of humans to see, and thus acquire a high level understanding of the physical world that surrounds them.
Admittedly, it would be quite irresponsible to create self-driving cars that cannot see their surroundings. And how willing would you be to undertake augmented reality-assisted surgery with a computer that sees well – 99 percent of the time? In a highly interconnected society that uploads close to 50,000 pictures to instagram every second, how useful could a blind marketing-purposed intelligent algorithm prove?
The first computer vision algorithms
To embrace the challenges that scientists encountered when designing the first computer vision algorithms, one has to comprehend how differently humans and computers perceive their environment. For humans, the world is a three-dimensional space made intelligible through their five senses, with vision denoting their ability to perceive light in the continuous visual spectrum. Computers on the other hand, at least in their current form, only perceive a binary reality, as their electrical components can simply detect the presence or absence of current.
The first step towards bridging this interpretational gap was made in 1959, when Kirsch and his group introduced today’s common concept of a digital image. For a computer, an image is a simple two-dimensional table, where each cell is a pixel that holds the intensity of light or color in a particular point of it. The toughest problems to solve in this area, thereafter, were focused on how this two-dimensional information can be used to reconstruct the three-dimensional reality. To this end, scientists devised techniques that required ideas from geometry, computer engineering and physics, and thus shaped today’s advanced field of digital image processing.
However, seeing is another thing from understanding. Computer vision is one of the many examples in our technological history where scientists tried to reverse-engineer processes performed by living organisms. Animal vision was, and to date is, a largely unexplored, complex process, and it took us long experimentation, as well as a bit of luck, to realize that animal neurons react to the presence of edges in their visual field. This observation was enough to inspire the bottom-up approach that computer vision uses to date. Computers gradually build their understanding of a scene by detecting dots and edges, constructing simple geometric surfaces and, finally, combining these surfaces to form intelligible objects.
Why is deep learning useful?
Successes in the area of computer vision occurred simultaneously with the bloom of deep learning; and this was not a coincidence. Human vision is a complex process that requires the collaboration of the retina, optical nerve and brain, it would therefore be unrealistic to pursue computer vision of human-level accuracy without first equipping traditional image analysis algorithms with some form of intelligence.
Perhaps the primary use for learning-based computer vision algorithms is in automatically labelling pictures, which can be useful for visual search in huge repositories of images, such as Google Photos. What is remarkable in this case, is that a computer has the conceptual understanding that allows it to recognize people, places and objects, just by observing raw pixel values.
Performance of algorithms
The striking performance of today’s algorithms can be largely attributed to the introduction of convolutional neural networks. Traditional feed-forward neural networks were early on recognized as incapable of solving vision-related problems due to the high complexity involved with assigning one neuron to each pixel value. This approach was not just inefficient, but came in contrast to our understanding of how neurons in a human brain work: when looking at a picture, we don’t break it into pixels and assign one neuron to each one of them. In reality, each neuron has a receptive field, which means that it can observe a specific area in the visual space, an observation that convolutional networks leverage to reduce the complexity of training.
A more recent and controversial area today is that of artificial image generation. Enabled by the introduction of specialized neural network architectures such as generative adversarial Networks and variatonal autoencoders, it is possible today for a computer to generate high-quality images that appear to be real, but are synthesized by allowing the algorithm to analyse huge repositories of images. The potential in this area is enormous, with computers already generating artistic paintings and image editing reaching a perfection level. Perhaps more eminent is, however, the danger of deepfakes, realistic images that can be used to for enticing false advertising, propagandistic policies or the targeted defamation of people.
Accelerating AI in Computer Vision
Although artificial intelligence has offered to us a variety of techniques, such as Natural Language Processing (NLP) and speech recognition, computer vision is the field that revolutionized applications from a user-perspective. In contrast to solutions that focus on cloud-based processing of huge amounts of data, computer vision algorithms shift the processing focus to personal devices, a practise we often call edge computing.
The abundance and quality of computer vision applications would not have been possible, if it weren’t for the acceleration that personal devices recently experienced in their hardware capabilities. And the adoption of Graphics Processing Units in parallelizing deep learning algorithms was essential for rendering high quality image processing on personal devices feasible, while companies like Nvidia Tesla aim at bringing even further sophistication and customization into on-chip hardware architectures. The failure of Google Goggles, which closely resembles today’s products such as Google Lens and Google Photos, is a suggestive example of an application that was introduced in the market too early to offer a satisfying user experience.
It is hard to foresee whether computer vision of the future will progress by advancing deep learning techniques, designing more efficient hardware or coming up with a technique that will obliterate today’s needs for labelled data. One thing is however clear: bringing artificial intelligence closer to the physical world by equipping it with “eyes” was a decisive step in the AI revolution that we are today experiencing.