The success of Deep Learning applied to automated visual perception problems has been one of the main drivers of the current wave of the AI revolution. One Artificial Neural Network architecture called AlexNet took the academic community and industry by surprise after it showed vast accuracy improvement during the ImageNet competition 2012. The competition focused on programming computers to recognize object categories based just on photos.
Since this critical event, researchers have tried many different Deep Network architectures with further improvement in object recognition accuracy. Sometimes they have surpassed human capability in certain, specialized applications.
What about more general visual capabilities such as accurately recognizing posture and/or motions of an object, describing a given scene as a phrase or a sentence, or figure out the surrounding 3D structure so that a car or a robot can navigate by itself? These other ‘Visual AI’ tasks mimicking human visual perception have been investigated for many decades already but real-world adoption has been slow except for a few successful areas due to low accuracy and limited generalizability.
After the initial breakthrough in 2012, the revolution is spreading from the task of object recognition to a wider scope of image understanding tasks with the same degrees of success and further promises in everyday life and businesses.
Early Computer Vision
An academic area called ‘computer vision’ has existed for several decades in computer science and engineering domain. It strives to automate human and animal visual processing such as recognizing objects and their movements and deriving high-level semantic understanding of a scene. While these functions of visual perception seem effortless to human, programming computers to perform the same tasks is extremely challenging. Visual perception would be one of the most crucial elements that need to be automated if we want to build a ‘thinking machine’ that can automate a wide range of human tasks.
The performance of computer vision algorithms have been gradually improving by adapting many existing theories and algorithms in signal processing, statistical modeling, mathematical optimization, and computational geometry.
The first approach was to explicitly model how humans and animals process visual signal and program computers to mimic the modeled behavior. For example, studies of brain anatomy and animal behavior indicate brain first extracts low-level visual elements such as edge, color, and texture before recognizing objects. The function of motion perception seemed to be processed differently and in different brain regions close to motor control. The algorithms developed based on these findings have been successful and found real-world applications, including target recognition in military, recognizing clinically meaningful anomalies in medical imaging, modeling human movements for generating computer animations, etc. However, when a model achieves acceptable accuracy in certain conditions, it did not fare well when applied to slightly different scenes or environments. The models often have so many parameters to adjust or did not have means to model certain image elements.
Machine Learning Takes Over
To overcome these limitations, the academic community started to adopt approaches that depended on image-data statistics from the 1990’s because the statistical models directly estimated from natural scenes represent similar kind of images more efficiently. Many ‘Statistical Machine Learning’ classification algorithms have been developed. Some of the popular classifiers were Support Vector Machine, Boosting, and Random Forest.Artificial Neural Network was one of the earlier such algorithms with strong initial attention and adoption, but has not been so popular afterwards due to the relative success of the above mentioned algorithms. Almost all of these modern Machine Learning approaches contributed to the improvement of computer vision algorithms.
These algorithms are very general and work well to improve accuracies as long as you provide enough samples to estimate model parameters. One major drawback of this approach is that its accuracy depends greatly on the choice of attributes (or features) of the data that are fed into the classifiers. Often it has been the role of computer vision researchers to figure out the best features to use based on model assumptions of the problem or simply through trial and error. This step is called ‘feature engineering’ and has been the main driver of gradual performance improvement of computer vision algorithms (before AlexNet). But this step is time consuming and requires a lot of time invested in trials without any guarantee that the newly invented features will be optimal.
Deep Learning Breakthrough
As everybody knows by now, Deep Learning took over both R&D activities and industrial applications of Visual AI. There have been three major forces that led to its success: abundance of internet originated visual data, rapidly improving computational power and lowering cost, and new algorithmic breakthroughs in Artificial Neural Network technology. Deep Learning is essentially a Machine Learning technologies, with one critical difference. It does not require manual ‘feature engineering’ because both the features and classifier are automatically learned by the Deep Learning training algorithm.
The Deep Networks consists of many layers so that the roles of feature identifier and the classifier can be handled in the same unified structure.
This approach translates to several unique strengths that set it apart from previous technologies:
(1) High accuracy getting even better: After AlexNet in 2012, the object recognition accuracy has improved even further as new network architectures are developed, surpassing human capability.
(2) Functional scalability: The common Deep Neural Network framework, still can be specialized to handle vastly different tasks, facilitates unified software (Caffe, Tensorflow, Torch, etc.) and hardware platforms (GPU + CUDA).
(3) Machines learn from machines (transfer learning): Image features learned for recognizing thousands of object categories (as in ImageNet) can be re-purposed as a part of different networks to perform different visual tasks.
▶   The contents are protected by copyrights laws and the copyrights are owned by the creator.
▶   Re-use or reproduction as well as commercial use of the contents without prior consent is strictly prohibited.
Dr. Hankyu Moon works as a Director at Samsung SDSRA’s AI Research Group. Prior to joining Samsung SDSRA he worked at NEC Research Institute and HRL Laboratories as a research scientist. His R&D career in Visual AI started when he joined Center for Automation Research, University of Maryland Collage Park in 1996 as a Ph.D student. His is very excited about the current AI boom, and taking the initiative within his team to further broaden the capabilities of Visual AI by leveraging the power of Deep Learning to serve diverse business use cases.