2015 IEEE International Conference on Control System ... · et. al [16] presented an architectural...

GPU Acceleration of Real Time Viola-Jones Face Detection

Adrian Wong Yoong Wai, Shahirina Mohd Tahir, Yoong Choon Chang Information & Communication Technology

MIMOS Berhad Technology Park Malaysia, Kuala Lumpur, Malaysia

[email protected], [email protected], [email protected]

Abstract—Face detection is a stepping stone to all facial processing systems such as face recognition with the task of determining face region from the input frame for applications like surveillance and law enforcement. However, face detection is a computational expensive process and thus, with acceleration it can influence the performance of the system. The latest Graphics Processing Unit (GPU) technology via Compute Unified Device Architecture (CUDA) has proven its capability to accelerate computation intensive algorithms to improve overall system performance. Thus, in this paper, a GPU acceleration of frontal face detection system utilizing Viola-Jones algorithm based on the Adaptive Boosting (Adaboost) using OpenCV and CUDA is presented. Experiments results show thattheproposed GPU acceleration of face detection is able to achieve a speed up of up to 18 times, as compared to the conventional Central Processing Unit (CPU) version algorithm and yet maintain its detection accuracy.

IndexTerms—Face Detection, Viola-Jones, OpenCV, CUDA, GPU Computing, Real Time

I. INTRODUCTION Recent development in the field of video surveillance has

been moving toward more advanced vision applications such as face detection and recognition. Human face detection is often the primary step in applications such as face recognition and human computer interface with the task of determining faces in a video sequence and returning the size and location of the face.Over the past decade, the field of face detection has made huge leap in terms of improved detection speed and accuracy due to its demand and potential applications. In particular, the pivotal work by Viola-Jones [1] has made face detection practically achievable with high detection rate by using a well-trained classifier [5]. It also has been widely used in applications such as surveillance, video conferencing and others. A shared similarity in these applications is their time critical and interactivity nature which makes it essential to develop the system to fulfill the real time constraints. For instance, smart video surveillance system aims to analyze the 24/7 uninterrupted video streams from multiple IP CCTV cameras in different location to avoid missing out any important events that require immediate action. Thus, this system requires a great amount of computing power to meet the specific latency constraints without missing any frames. However, face detection is a computationally expensive task in which software solutions in Central Processing Unit (CPU) only provide limited frame rates of 1.78 [6]. The speed of face detection will influence the performance of further processing

such as face recognition as detection time increase linearly with the size of the images. Therefore, it would be beneficial to accelerate the face detection process.

Over the past few years, Graphics Processing Unit(GPUs) has emerged as a source of massive computing power that can be used for general purpose computations as compared to a classic CPU. Today, GPU can be utilized not only for rendering 2D and 3D graphics but also for other tasks specifically in parallel computation. Compute Unified Device Architecture (CUDA) [8] is a C-based programming technology developed by NVidia to exploit the GPU’s parallel computing capabilities for speeding up the computational intensive process.

In this paper, a GPU accelerated implementation of Viola-Jones frontal face detection system based on the Adaptive Boosting (Adaboost) using the OpenCV and CUDA is presented. A comparison in terms of detection speed between the GPU enabled implementation and the CPU implementation was carried out and will be presented in this paper. In addition, an experiment of the face detector performance running on different GPU cardswas also conducted.

The remaining of the paper is organized as follows. We will begin by describing the previous work of accelerating facial detection in Section II. Sections III and IV illustrate the concept of Viola-Jones face detection algorithm and the proposed GPU implementationof face detection respectively. Experiment results and analysis of CPU and GPUs are discussedin Section V. Lastly, conclusions and future work are provided in Section VI.

II. RELATED WORK Research activities in human faces processing have gained

popularity since the late nineties. Many researchers have proposed numerous approaches for identifying human faces [1], [3] and [4]. Out of these approaches, Viola-Jones face detection [5] which is based on Adaboost has shown a promising result with 15 frames per second (fps) on images of resolution 320×288 as compared to previous approaches while still retaining the detection accuracy. Later on, Lienhart [13] proposed using a set of rotated Haar-like features that enriches the simple features in the original Viola-Jones’ algorithm.

However, face detection is a time consuming task especially as the size of the image to be processed increases. To date, a lot of work has been reported in the literature in attempts to speed up the process of object detection particularly

2015 IEEE International Conference on Control System, Computing and Engineering, 27 - 29 November 2015, Penang, Malaysia

978-1-4799-8252-3/15/$31.00 ©2015 IEEE 183

on face. This is to accommodate the current demand for applications like real time video surveillance system. Software approach which uses multi-threading in the optimized OpenCV [2] implementation in CPU based system is able to achieve 1.78 fps on VGA size images [6] and 14.2 fps on smaller images of resolution 256x192 [11]. On the other hand, venturing into hardware approach will be another alternative method to accelerate the computational intensive algorithm based on application design. For instance, Theocharides et al. [7] proposed an ASIC architecture that heavily exploits parallelism of the Viola-Jones algorithm by parallelizing the accesses of image data. As a result, it shows a computation rate of 52 fps but the input image resolutions are not mentioned. Besides that, Cho et al. [9] also presented a FPGA-based face detection system with Haar classifiers using buffers and special frame grabbers to accelerate the processing, which is able to obtain 6.55 fps for VGA image. This particular implementation was computed using three features in parallel. Most recent parallelized implementation on FPGA can accomplish up to 16.08 fps by computing up to eight feature classifiers in parallel [10].

In recent years, NVidia’s GPU computing is getting popular for parallelization usage due to its architecture. There are already several works related to the acceleration of Viola-Jones face detection algorithm by using CUDA. For instance, the primary GPU implementations on the algorithm are running at 2.8 fps on a NVidia GTX285 and 4.3 fps on dual NVidia GTX295 for images of resolution 640x480 [6]. Another GPU accelerated implementation is proposed by Hefrenbrok et al. [12] where stream-based multi-GPU implementations on 4 identical cards are able to achieve 15.2 fps for VGA images. However, the integral image computation was not parallelized to maximize its performance. Furthermore, Li-chao Sun et. al [14] presented a real-time face detection system based Viola-Jones cascade classifer in CUDA platform. The experimental results show that the CUDA program running on an Nvidia GTX 570 graphics card for a VGA input image could achieve 6 times speedup with 9.35 fps compared to CPU version. In addition, Shivashankar J. Bhutekar et. al [15] proposed a technique that process image for face detection and recognition in parallel on NVIDIA GeForce GTX 770 GPU. The Viola-Jones face detection algorithm shows approximately 3 fps on an image of 700Í580 pixels in CUDA framework compare to 0.71 fps in CPU. Lastly, Hadi Santoso et. al [16] presented an architectural design for a parallel and multiple-face detection technique based on Viola-Jones'

Figure 1 : Flow diagram of proposed face detection system.

framework. The design was tested with a 320 Í240 pixel image that contains 4 faces in an Intel 4-core processor and able to achieve 20 fps compare to 0.3 fps in serial CPU version.

Despite with all these improvements, we still want to improve the detection speed for the face detector to meet the benchmark of 25 fps for real time application by using the latest GPU.

The accelerated implementation discussed above presented only speed performance in fps. However, other parametrical setup such as scaling factor, minimal size and step control of the search window of their detector has an influence to the detection performance in term of speed and accuracy. Hence, careful consideration on both parametrical setup conditions and the frame rate are needed to accurately justify the performance of any face detection system.

III. VIOLA-JONES ALGORITHM

The Viola-Jones algorithm [5] is employed in this proposed system for robust face detection and it can be divided into two portions. The first portion is about training a set of weak classifiers based on the Haar-like features and forming a stage cascade classifier with all the promising weak classifiers by using Adaboost. The classifier is trained with few thousand sample views of face (positive example) and arbitrary images (negative example) that are scaled to the same size. The second portion will be detection where the algorithm will search through every location of the input frame by applying the trained stage cascade classifier in a dynamic search window to look for features of a human face as shown in Figure. 1.

A. Haar-like Features Haar-like features are used for computing face feature

values during training and detection. The example of Haar-like features is shown in Figure 2.

! Figure 2 : Example of Haar-like features.

! Figure 3 : Example of simple 2 rectangular Haar-like feature; A person’s

eyes are always darker than their forehead [12].

Viola-Jones algorithm depends on these feature values to judge whether there is a face. In order to compute the value of a feature such as shown in Figure. 3, total pixels 𝒇 𝒓 in each rectangle must be computed by using Eq. 1:

𝒇 𝒓 = 𝒓(𝒘𝒉𝒊𝒕𝒆)𝒘∈𝑾 − 𝒓 𝒃𝒍𝒂𝒄𝒌𝒃∈𝑩 (1)

Once finished, respective 𝒇 𝒓 is multiplied with the corresponding feature’s weight and results accumulation will be done. If the accumulated value meets the pre-trained threshold value, then a face feature is consider found in the

Load%trained%classifiers%

Results%

Integral%Image%

Pre5processing%

Load%input%frame%

Detec9on%


184

search window. The sizes and weights of each feature can be obtained from the trained classifier.

B. Integral Image To compute feature’s sums with fewer arithmetic

operations, integral image algorithm is used. Each value in the integral image is computed by summing pixel values of input frame that occurs above and in left from pixel (x, y). Eq. 2 shows the description of integral image where “Image” represents original Image and “II” is the integral image. Figure 4 explains Eq. 2 pictorially.

𝑰𝑰 𝒙,𝒚 = 𝑰𝒎𝒂𝒈𝒆 𝒙,𝒚 + 𝑰𝑰 𝒙 − 𝟏,𝒚+𝑰𝑰 𝒙,𝒚 − 𝟏 − 𝑰𝑰 𝒙 − 𝟏,𝒚 − 𝟏 (2)

In addition, features’ rectangular sum can be computed in constant time shown in Figure. 5 as the sum of pixels in rectangle D can be found using integral image of the four corners as in the Eq. 3:

𝑺𝒖𝒎𝑫 = 𝑰𝑰 𝑳𝟒 − 𝑰𝑰 𝑳𝟑 − 𝑰𝑰 𝑳𝟐 + 𝑰𝑰 𝑳𝟏 (3)

! Figure 4 : An 4x4 image and its corresponding result of integral image [12].

Figure 5 : Sum of pixels in Integral Image [12].

TABLE1 : Cascade Classifier Organisation [18]

! C. Hierarchical Stage Cascade Classifier

Although the feature values can be computed constantly, excessive works still need to be done when a search window does not contain a face. As over 2000 Haar-like features will be applied in each search window and it would not be efficient to consider all of these features needlessly. In this case, AdaBoost is used to select the most promising features which cover facial characteristics to avoid meaningless calculation.

The OpenCV cascade classifier contains a total of 22 stages with early stages covering fewer and most promising features. The higher the stage, the more detailed features are covered. The totals of Haar-like features across all the stages are 2135 and Table 1 shows its organization. Hence, the workload in each stage changes significantly as earlier stages are approved more regularly than the later stages. At each

stage, the calculated feature values are accumulated to compare against the stage threshold. The search window is considered faceless when the threshold value is higher. Figure. 6 shows the flow of the stage cascade classifier.

Figure 6 : The flow of Stage Cascade Classifier [19].

D. Scaling and sliding of search window By using stage cascade classifier, the availability of a face in a search window can be determined rapidly. As the faces’ location and size are uncertain, the search window has to scale and slide accordingly to account for different size of faces in an image as shown in Figure. 7 to detect faces by comparing the selected Haar-like features with the trained classifers. The minimum size of search window is depending on the minimum size of the trained image which is 20x20 pixels for OpenCV.

Figure 7 : Face detection search window process [20].

IV. THE PROPOSED GPU IMPLEMENTATION OF FACEDETECTION

In this section, an implementation of Viola-Jones face detection algorithm using CUDA will be presented. Firstly, a brief overview of the GPU architecture and CUDA programming model are discussed, followed by the discussion of the details on the approaches of parallelizing the algorithm for acceleration and its design considerations.

A. Overview GPU architecture and CUDA Programming Model

CUDA is a C-based programming technology developed by the NVidia which can be used for diverse demanding computations on the GPU. A CUDA program runs series of kernelswhich have scalar threads that are organized in 2D thread blocks and these thread blocks are partitioned into a grid as shown in Figure. 9. Furthermore, the GPU also offers a


185

set of streaming multiprocessors (SMs) and each of it has on-chip memory such as shared memories, texture cache, constant cache and 32-bit registers. All SMs share an off-chip memory (global) that is not cached.

In general, CUDA program starts with memory allocation in the device (GPU) while data on the host (CPU) are prepared. Then, the data are copied from the host to the device. It is necessary to minimize the data required to be transferred from host to device and from the device to host as is a time consuming process. After the data are ready on the device, it is possible to launch kernels. When the computation ends, results are returns to the host for displaying and the allocated memory will be released. The typical CUDA program cycle is shown in Figure. 8.

Figure 8: Typical CUDA program cycle.

Figure 9: CUDA architecture [21].

B. Approaches of Parallelizing Viola-Jones Face Detection Algorithm

A GPU accelerated OpenCV implementation of Viola-Jones face detector using GPUCV framework was described. This implementation was optimized in various ways in order to enhance its performance. CUDA version of face detector comprisesof four main parts:

1) Loading of Cascade ClassifierThe cascade classifier is the base of the face detection

system. The OpenCV classifier is stored in a pictorial representation for ease of browsing. In order to load and parse the whole cascade classifier data into GPU, the data are saved in the GPU global memory with bounded texture reference. Therefore, the best method is to fit the data into a few arrays structures with elements of 32, 64, or 128 bits as texture reference able to read up to 128-bit (16-byte) of data types.

2) Integral Image ComputationOne of the best ways to reduce the spatial data dependency

created by integral image computation is by using a 2D wave algorithm as shown in Figure. 10. This algorithm computes the first pixel of the input image and checks on each following step for the resultant pixels that can be computed.

! Figure 10: A 2D wave approach to integral image computation [22]. (a) Input image (b) Initialization stage (c) Stage after few iterations

However, this algorithm shows some weaknesses when porting it to GPU as it involves sparse non-linear accesses of memory. Thus, an efficient method is to implement the integral image computation is by considering the separability property to performed rows and then only columns computation. Hence, the steps of computation are as follows:

1. Rows scanning of the input image2. Result transposing3. Rows scanning of the transposed result4. Output generating after transposing the last result

Scanning of all columns or rows is somehow a different process compared to scanning only a single array. It is inefficient to scan the image rows by N times because data in a single row is insufficient to provide adequate workload to the GPU and causes additional time overhead. Hence, this can be solved by using a single CUDA block to process the image row with each block comprise of 128 threads to carry out iterations along the row as shown in Figure 11. Threads will load 128 elements each iteration from the row into the shared memory through exclusive scan and addition of carrier value to the prefix sums will be done. Generally, the carrier value will be initialized to zero at the beginning but it will be updated with the sum of all elements being scanned up to the moment after the first iteration. Finally, the threads will transfer the result into the global memory.

Figure 11: Integral Image generation of scan algorithm with 4 thread CUDA blocks [22].

3) Scaling of Search WindowThere are 2 methods of searching face of different scales in

OpenCV. The first method downscales the input image with the preset scaling factor but it is also the most computationally intensive. The second method is by scaling the classifiers and MÍM search window with the preset scaling factor. However, the features in the classifier will not map to the pixels grid when upscaling the MÍM search window by real value in this

Allocate GPU

Memory

Copy CPU

Memory to GPU

Configure Threads

Launch Threads

CUDA Kernel Code

Synchronize Threads

Copy GPU

Memory to CPU

Free GPU

Memory


186

approach. Hence, the weights of all features have to be updated on every scale based on the changed area of the features’ rectangles to avoid any precision loss that may cause by coordinates rounding. For GPU implementation, data layout organization needs to be considered carefully as global memory access latency will be included by accessing the coalescing for direct access and spatial locality of adjacent accesses for cached access such as texture cache or L1/L2 cache. Implementation using only integer scales of the input image can also be one of the solutions.

4) Scaling of Search WindowThis is last phase of the CUDA kernel implementation

which performs the processing of search window at original scale by applying the classifiers. The kernel will use the previous modules’ result for processing and launch threads of MÍM size to work within the search window for features computation until the classification decision is done. Finally, it will output coordinates of the detected face if face is present. The GPUCV framework in OpenCV employed an approach called the stage parallel processing to do the parallelization inside the processing of single cascade stage. Thus, the entire linear CUDA block is working with only one search window and subsequently one classifier stage will be processed.

Firstly, each CUDA thread will pick up one weak classifier at the start of the stage and perform computation to find the value on the given search region by assuming that the CUDA block dimension are less than the weak classifiers in a stage. Upon completion, it will accumulate the computed value in the register and go to the next weak classifier by locating blockDim.x weak classifiers apart. Lastly, all threads perform parallel reduction of the local accumulators after all the weak classifiers are evaluated and comparison between stage value and stage threshold is done to make classification decision. The main benefit of implementing this method is that it able to resolve the issue of sparse memory accesses and work starvation. However, it is only able to show its efficiency on the second half of the cascade classifier when the stages comprise of huge number of weak classifiers.

Figure 12: Stage-parallel processing [22].

V. RESULTS AND DISCUSSION The test environment of the proposed GPU acceleration of

face detection is as shown in Table 2 below.

TABLE 2 : Test Environment OS Windows 7 64 bits CPU Intel® Xeon (2.40GHz, 8 cores) Memory 4GB (DDR3) GPU NVidia C2075, NVidia K20, NVidia K40 Software VS2010, CUDA 5.5, OpenCV 2.4.9

The implementation uses the OpenCV cascade of classifiers for frontal faces with approximately 2730 Haar-like classifiers. The detection parameters of 20x20 for minimum search window size and 1.1 for window’s scale factor were used. Experiments based on both static images and live video feeds of VGA resolution at 24 fps from IP camera were conducted, with some of the detection results shown in Figures 13 and 14 respectively.

Furthermore, the proposed detection algorithm was tested on several NVIDIA GPU cards for their performance. Figure. 15 gives a comparative analysis of the face detection application between the CPU and the proposed GPUacceleration in terms of fps. The maximum fps performance of our proposed implementation is 37.91 fps in NVIDIA K40, in comparison of the OpenCV CPU implementation of 1.65fps.In addition, Table 3 shows the response time of the Viola-Jones algorithm running on CPU as well as for different GPUs.

All results were collected by running a similar video source of 24 fps in VGA resolution (640x480). Face detection application takes much longer time on CPU than GPU as it was running serially to wait for all faces detected in frame. In summary, the CUDA-based GPU accelerated of Viola-Jones face detection is having a maximum speed up of 22 times, as compared with the CPU version and also better than the fastest FPGA implementation of 16 fps. In addition, the proposed system is also proficient of dealing with minor pose variations up to 15 degrees.

Figure 13: Face detection on static image.

Figure 14: Face detection via live video captured frames.

TABLE 3: Response time performance of the proposed GPU acceleration of face detection, in comparison with other work

Hardware

CPU

Proposed GPU

Acceleration (NVida C2075)

Proposed GPU

Acceleration (NVidia

K20)

Proposed GPU

Acceleration (NVidia

K40) Response Time (s) 0.55237 0.0308 0.02718 0.02394


187

Figure 15: FPS performance of the proposed GPU acceleration of face detection, in comparison with other work

VI. CONCLUSION

In this paper, a GPU accelerated implementation of Viola-Jones face detector using GPUCV framework was presented. A prototype system has been built and a series of extensive evaluationswere carried out to examine the acceleration on different GPU cards, in comparison with CPU implementation. Experimentalresults show that the proposed GPU acceleration can achieve up to 22 times of computational speed up based on 640x480 resolution frames in NVIDA K40,as compared with one based on CPU.

REFERENCES [1] P. Viola and M. Jones, "Rapid Object Detection Using A Boosted

Cascade of Simple Features", Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, Vol. 1 pp 511-528.

[2] Dice Holdings, Inc., (2015). Open Source Computer Vision Library, [Online]. Available: http://sourceforge.net/projects/opencvlibrary/.

[3] M. Hsuan Yang et. al, “A Snow-Based Face Detector,” in Advances in Neural Information Processing Systems 12. MIT Press, 2000, pp. 855–861.

[4] H. Schneiderman and T. Kanade, “A Statistical Method for 3d Object Detection Applied to Faces and Cars,” IEEE CVPR 2000, vol. 1, pp. 746–751.

[5] P. Viola and M. J. Jones, “Robust Real-Time Face Detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.

[6] J. P. Harvey, “GPU Acceleration of Object Classification Algorithms Using NVIDIA CUDA,” Master’s Thesis, Rochester Institute of Technology, Rochester, NY, United States, 2009.

[7] T. Theocharides et.al, “A Parallel Architecture For Hardware Face Detection,” IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures, 2006, pp. 452-453.

[8] NVIDIA Corporation, NVIDIA CUDA Zone, 2015. [Online]. Available: http://developer.nvidia.com/category/zone/cuda-zone.

[9] J. Cho et.al, “FPGA-Based Face Detection System Using Haar Classifiers,” Proceeding of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. New York, NY, USA: ACM, 2009, pp. 103–112.

[10] J. Cho et.al, “Parallelized Architecture of Multiple Classifiers for Face Detection,” Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors. Washington, DC, USA: IEEE Computer Society, 2009, pp. 75–82.

[11] A. Dutta et.al, “Real Time Face Tracking and Recognition (RTFTR),” Tribhuvan University-Institute of Engineering, Latipur, Nepal, 2009.

[12] D. Hefenbrock et.al, “Accelerating Viola-Jones Face Detection to FPGA Level Using GPUs,” 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2010, pp. 11–18.

[13] R. Lienhart and J. Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection,” 2002 International Conference on Image Processing, 2002, pp. I–900.

[14] Ren Meng et. al, "Acceleration Algorithm for CUDA-based Face Detection,”2013 International Conference on Signal Processing, Communication and Computing, 2013, pp 1-5.

[15] Shivashankar J. Bhutekar et. al, “Parallel Face Detection and Recognition on GPU,” International Journal of Computer Science and Information Technologies Vol. 5 (2) , pp. 2013-2018, 2014.

[16] Hadi Santoso et. al, “A Parallel Architecture for Multiple-Face Detection Technique Using Adaboost Algorithm and Haar Cascade,” Information Systems International Conference (ISICO), Dec 2013, pp. 592-597

[17] Karl Berggren andPär Gregersson, “Camera focus controlled by face detection on GPU,” Master’s Thesis, Lund University, Lund, Sweden, 2008.

[18] Laurentiu Acasandrei et. al, “Accelerating Viola-Jones Face Detection for Embedded and SoC Environments,” Fifth ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC) Aug 2011 pp. 1-6.

[19] Tiba,(2014), Face and Eyes Detection Using Haar Cascades [Online]. Available: http://blog.tibarazmi.com/face-and-eyes-detection-using-haar-cascades/.

[20] Saehanseul Yi et. al, “Real-time Integrated Face Detection and Recognition on Embedded GPGPUs,” IEEE 12th Symposium on Embedded Systems for Real-time Multimedia, 2014 pp. 98-107.

[21] Whisky and Rum Bestellen, (2012) CUDA Programming [Online]. Available: http://cuda-programming.blogspot.com/2013/01/thread-and-block-heuristics-in-cuda.html.

[22] Wen-Mei W., GPU Computing Gems, Emerald Edition, Elsevier Inc., 2011, pp. 526 – 537

0 5

10 15 20 25 30 35 40

Fram

e pe

r secon

d (FPS)


188

2015 IEEE International Conference on Control System ... · et. al [16] presented an architectural...

Documents

Transcript of 2015 IEEE International Conference on Control System ... · et. al [16] presented an architectural...