Have you ever wondered how the Google Photos app cluster photos based on events and places & offers face detection too, how visual search allows you to shop for similar products using a reference image you took with your camera, how Facebook can identify your friend’s face with only a few tagged pictures, how driverless cars detect obstacles, pedestrians, and other vehicles, road signs and stop lights, how cancer is detected by these computers with soo much accuracy than human doctors?
Well For many decades, people dreamed of all this but now it has became the reality.
For example, If we see a dog’s image, we can easily tell that it’s an image of an adorable dog without even thinking. Do you know, how a computer sees that image?
An image contains multiple pixels (basically a dot in the image) arranged in rows and columns. But a computer does not understand pixels as dots of color. It only understands numbers. To convert colors to numbers, the computer uses various color models to see the image. We aren’t going into details here.
Even after seeing things computers can’t recognize as what this image is about, here comes the use of artificial intelligence.
Computer vision is a field of artificial intelligence that trains computers to interpret and understand the visual world, in this case; understanding the image of a dog.
In the computer vision, several methods are used to evaluate the inputs and obtain the outputs. Techniques such as image classification, object detection, object tracking, semantic segmentation, and image segmentation help create computer vision by combining them or separately. Here we will try to understand the Image Classification technique.
Image classification is the process of predicting a specific class, or label, for something that is defined by a set of data points. For example, a picture will be classified as a daytime or nighttime shot, images of cars and motorcycles will be automatically placed into their own groups, etc.
There are countless categories, or classes, in which a specific image can be classified. Consider a manual process where images are compared and similar ones are grouped according to like-characteristics, but without necessarily knowing in advance what you are looking for. Obviously, this is a burdensome task. To make it even more so, assume that there are hundreds of thousands of sets of images. It becomes readily apparent that an automatic system is needed to do this quickly and efficiently.
Many image classification tasks involve photographs of objects. Two popular examples include the CIFAR-10 and CIFAR-100 datasets that have photographs to be classified into 10 and 100 classes respectively.
Deep Learning for Image Classification:
To understand the recent process of computer vision technology, we need to dive into algorithms this technique relies on. Modern computer vision relies on deep learning, a specific subset of machine learning, which uses algorithms to glean insights from data. Machine learning, on the other hand, relies on artificial intelligence, which acts as a foundation for both technologies.
Deep learning represents a more effective way to do computer vision—it uses a specific algorithm called a convolutional neural network. The neural networks are used to extract patterns from provided data samples. The algorithms are inspired by the human understanding of how brains function, in particular, the interconnections between the neurons in the cerebral cortex.
According to Kaz Sato, Staff Developer Advocate at Google Cloud Platform “A neural network is a function that learns the expected output for a given input from training datasets”.
One of the great challenges of developing CNNs is adjusting the weights of the individual neurons to extract the right features from images. The process of adjusting these weights is called “training” the neural network.
In the beginning, the CNN starts off with random weights. During training, the developers provide the neural network with a large dataset of images annotated with their corresponding classes (cat, horse, dogs, etc.). The ConvNet processes each image with its random values and then compares its output with the image’s correct label. If the network’s output does not match the label—which is likely the case at the beginning of the training process—it makes a small adjustment to the weights of its neurons so that the next time it sees the same image, its output will be a bit closer to the correct answer.
After training the CNN, the developers use a test dataset to verify its accuracy. The test dataset is a set of labeled images that are were not part of the training process. Each image is run through the ConvNet, and the output is compared to the actual label of the image. Essentially, the test dataset evaluates how good the neural network has become at classifying images it has not seen before.
Till now you must be wondering as what are these datasets. Right?
Now let's introduce you to the CIFAR-10 dataset, It is a set of images that can be used to teach a computer how to recognize objects. CIFAR stands for Canadian Institute For Advanced Research.
It is a collection of images that are commonly used to train machine learning and computer vision algorithms.
It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.
Since the images in CIFAR-10 are low-resolution (32x32), this dataset can allow researchers to quickly try different algorithms to see what works. Various kinds of convolutional neural networks tend to be the best at recognizing the images in CIFAR-10.
Similarly, there is another dataset i.e CIFAR-100 dataset,
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
These are very small images, much smaller than a typical photograph, and the dataset is intended for computer vision research.
It is relatively straightforward to achieve 80% classification accuracy. Top performance on the problem is achieved by deep learning convolutional neural networks with a classification accuracy above 90% on the test dataset.
The above content focuses on image classification only and the architecture of deep learning used for it. But there is more to computer vision than just classification tasks. The detection, segmentation, and localization of classified objects are equally important.
Hope you got the idea of how a computer sees and recognize the image, obviously the process is very complex similar to that in humans.
Thanks to advancements in artificial intelligence and computational power, computer vision technology has taken a huge leap toward integration in our daily lives. (Forbes expected the computer vision market to reach USD 49 billion by 2022.)