Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 https://doi.org/10.1186/s42492-022-00109-0 ORIGINAL ARTICLE Robust facial expression recognition system in higher poses Ebenezer Owusu1, Justice Kwame Appati1* and Percy Okae2 Abstract Facial expression recognition (FER) has numerous applications in computer security, neuroscience, psychology, and engineering. Owing to its non-intrusiveness, it is considered a useful technology for combating crime. However, FER is plagued with several challenges, the most serious of which is its poor prediction accuracy in severe head poses. The aim of this study, therefore, is to improve the recognition accuracy in severe head poses by proposing a robust 3D head-tracking algorithm based on an ellipsoidal model, advanced ensemble of AdaBoost, and saturated vector machine (SVM). The FER features are tracked from one frame to the next using the ellipsoidal tracking model, and the visible expressive facial key points are extracted using Gabor filters. The ensemble algorithm (Ada-AdaSVM) is then used for feature selection and classification. The proposed technique is evaluated using the Bosphorus, BU-3DFE, MMI, CK + , and BP4D-Spontaneous facial expression databases. The overall performance is outstanding. Keywords: Facial expressions, Three-dimensional head pose, Ellipsoidal model, Gabor filters, Ada-AdaSVM © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. Introduction Applications Facial expression recognition (FER) is the automatic detection of the emotional state of a human face using computer-based technology. The field of study is cur- rently a hotspot of research because it has increasing applications in several domains, such as psychology, soci- ology, health science, transportation, gaming, communi- cation, security, and business. According to Panksepp [1], facial expressions and emotions guide the lives of people in a variety of ways, and emotions are key aspects that enlighten us in how we should act, from elementary pro- cesses to the most intricate acts [2, 3]. The sporadic advancements in the use of facial expres- sions in neuropsychiatric complications have shown more positive results [4], and current studies are focusing on human behavior and the detection of mental illnesses [5, 6]. FER can also affect data collection in specific research projects. For example, Shergill et  al. [7] proposed an intelligent assistant FER framework that could be imple- mented in e-commerce to determine the product prefer- ences of customers. The system captures the facial data as they browse the e-shop for products to acquire. Based on the facial expression, the systems can automatically sug- gest more products of possible interest. Certain physiological features of people have been dis- covered to be useful as intelligent data in the search for criminals [8, 9]. This theory is based on the tendency for someone with ego to commit a high-profile crime, such as terrorism, exhibits specific emotions such as anger and fear. Consequently, the accurate recognition of these expressions could lead to further security measures in apprehending criminals. FER can also be valuable during the testing phase of video games. Target groups are frequently invited to play a game for a set amount of time, and their behaviors and emotions are observed as they play. Game developers may acquire more insights and valuable deductions about the emotions recorded during gameplay using FER tech- nology, and incorporate the feedback into production. Open Access Visual Computing for Industry, Biomedicine, and Art *Correspondence: jkappati@ug.edu.gh 1 Department of Computer Science, University of Ghana, P. O. Box LG 163, Accra, Ghana Full list of author information is available at the end of the article http://orcid.org/0000-0003-2798-4524 http://creativecommons.org/licenses/by/4.0/ http://crossmark.crossref.org/dialog/?doi=10.1186/s42492-022-00109-0&domain=pdf Page 2 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 Technical issues on the use of two‑dimensional facial data Two-dimensional (2D) FER systems are extremely sen- sitive to head orientation. Therefore, to achieve good results, the subject must be constantly in a fronto-par- allel orientation. The problem resulting from this is that the throughput of most site-access systems is signifi- cantly reduced. This implies that subjects are frequently required to perform several verifications to attain an ideal facial orientation. Consequently, surveillance systems operate on luck, hoping the subject faces the camera. Another problem that arises from the use of 2D tech- nology is the illumination conditions of the surrounding environment. If the subject is in a setting with varying lighting conditions, FER reduces in accuracy because the FER processes are sensitive to the direction of light- ing and the ensuing shading pattern. Consequently, cast shadows may obstruct recognition by concealing inform- ative features. Three-dimensional (3D) FER systems have a higher detection rate than 2D systems because of their higher intensity modality, and they also have more object description geometry information [10, 11]. This demon- strates the importance of pushing FER into higher face orientations to improve its realism and practicality. Related work The primary focus of this study is to improve FER accu- racy in higher facial orientations. Yadav and Singha [12] adopted the Viola-Jones descriptor [13] to detect faces and used a combina- tion of local binary patterns (LBP) and the histogram of gradients (HOG) as a feature extraction tool. Sub- sequently, traditional SVM with the k-means method was employed as a training algorithm. LPB feature extraction techniques, such as Gabor, are orienta- tion-selective, and thus, highly robust in tracking key facial features. However, the Viola-Jones descriptor is computationally demanding and has a low detec- tion accuracy. Furthermore, the conventional SVM described in the study is slow to classify. Consequently, the overall architecture used in the study was com- putationally expensive. Yao et  al. [14] proposed a lin- ear SVM method that used AUs to recognize seven facial expression prototypes in the CK database. The Viola-Jones descriptor was used as the face-detection technique again. Although the goal of the study was to minimize computational complexity and enhance recognition accuracy, the resulting average recogni- tion accuracy of 94.07% for females and 90.77% for males was too low for a viable implementation. Ashir et al. [15] also proposed an SVM-based multiclass clas- sification for detecting seven facial expressions across four prominent databases. The Nyquist–Shannon sam- pling method [16] was used to compress the extracted facial feature samples. Although the sampling method reduces data loss, it is prone to aliasing issues, par- ticularly when the bandwidth is extremely large. The Nyquist-Shannon sampling technique is difficult to deploy because it assumes the sampled signal is com- pletely band-restricted. In real-world applications, this is a concern because no actual signal is genuinely and completely band-restricted. The compressing sampling [17] paradigm could have been a better option because it is less restrictive. Perez-Gomez et  al. [18] recently proposed a 2D–3D FER system that used principal component analysis (PCA) and a genetic algorithm for feature selection, and a k-nearest neighbor (KNN)- multiclass SVM for learning. In this study, the synthetic minority oversampling technique (SMOTE) [19] was used to balance the instances. However, SMOTE cre- ates an equal number of synthetic samples for each minority data sample and relies on the hypothesis performance to update the distribution function. The adaptive synthetic (ADASYN) [20] method tends to generate more synthetic data for minority class sam- ples that are harder to learn than with SMOTE, which is easy to learn. In addition, PCA uses observations from all the extracted features in the projection to the subspace and only considers linear relationships, ignor- ing the input multivariate structures. Compared to other recent studies, the findings of this study were not positive. Li et  al. [21] proposed a robust 3D local coordinate technique for extracting pose-invariant facial features at key points. The descriptor in this method is a multi-task sparse representation fine-grained matching algorithm. The method was evaluated using the Bosphorus data- sets, and an average recognition accuracy of 98.9% was obtained. The success of this study is largely owed to the accurate tracking of 3D key points. This recent study is a primary driving force behind our proposed study. The following are the significant contributions of this work: (1) A robust head-tracking algorithm that tracks facial features from one frame to the next, accounting for more features in the overall prediction process; (2) A unique ensemble approach that employs AdaBoost for feature selection, and a combination of AdaBoost and SVM for classification. AdaBoost is extremely fast, whereas SVM is extremely accurate. Consequently, the proposed technique becomes extremely fast while also improving the recognition accuracy. The remainder of this paper is organized as follows. Meth- ods section  delves into the proposed strategy.  Results and discussion section discusses the findings, debates, and anal- yses. Finally, Conclusions section concludes the study. Page 3 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 Methods We robustly tracked the facial features from one frame to the next using 3D facial data. With 3D data, information, such as the size and shape of an object, can be correctly estimated in each frame without prior assumptions. The first priority is to detect the focal points in each frame. The next step is to search for matching features or objects across all frames. This method addresses the changing behavior of a moving object and the preceding annotations of the scene. In this approach, the location of an object is projected by iteratively updating the object position from previous frames [22, 23]. Architectural framework Figure 1 presents the framework of this study. This procedure uploads images and robustly tracks the features across frames using the proposed ellipsoi- dal model. Subsequently, the Gabor feature-extraction approach was used. Feature points extraction  section explains the reason for using Gabor features in this study. Feature selection and classification were executed using the Ada-AdaSVM. Ellipsoidal feature tracking method Accurate tracking of a human face from the forehead, to the left cheek, to the chin, to the right cheek, and back to the same spot on the forehead where the tracking began unmistakably demonstrates that the human face is best shaped like an ellipse. Thus, considering the 3D facial representation in Fig.  2 with N feature points tracked across frames, we denote: where N represents the most relevant feature points. In this study, we assumed N to be 24. In addition, let fj(t) ∈ α(t) denote a facial feature. As the features move from one frame to the next at time t + 1, the (1)α(t) = { fj(t) ∣ ∣1 ≤ j ≤ N } position of feature fj(t) becomes fj(t + 1) . Therefore, fj(t + 1) ∈ α(t + 1) . Assuming that Yj is the position of αj on the 3D facial model and αj,p[∅(t + 1)] represents its back projection on the image plane, the 3D facial orienta- tion at t + 1 is the vector ∅(t + 1) that minimizes ∑N j=1S 2 j , where: This is a multi-view system based on the assumption that cameras are positioned around the subject to cap- ture various rotation movements. Consequently, the facial (2)Sj[φ(t + 1)] = ||αj,p[φ(t + 1)] − αj(t + 1)|| Fig. 1 Architectural framework of this study Fig. 2 Tracking of 3D feature points from one frame to another Page 4 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 image can be captured with a high degree of precision in any orientation. We extracted the features in the same manner as for 2D images. The right and left eyes, lips, and muscles around the cheeks are important parts of the face to consider. Slight disruptions primarily and severely dis- tort the muscles in these places. The Gabor technique is then used to extract the features of the captured face. The algorithm models a procedure that chooses a set of features and robustly tracks them from one frame to the next while discarding all other features that are no longer required for tracking. The ellipsoidal 3D face was mod- elled, as shown in Fig. 3. Adopting homogeneous coordinates for an ellip- soid of the semi-axis, a, b, and c, states that a point X0 = ( x0, y0, z0, 1 ) belongs to the surface of the ellipsoid if XT 0 E0X0 = 0. The algorithm tracks the facial features that are more noticeable by slight deformation from one frame to the (3)E0 =     b2c2 0 0 0 0 a2c2 0 0 0 0 a2b2 0 0 0 0 −a2b2c2     next using the brightness change constraint [24]. These muscles are usually near the eyes, mouth, cheeks, and edges, as shown in Fig. 4 and contour τ in Fig. 3. Given that pixel (x, y) with luminance I ( x, y )T moves from position (x, y)T at frame t to position ( x + u, y+ v )T at frame t + 1 in high frame rates. In this instance, we can deduce that By applying Taylor’s series, and considering Ix and Iy as gradients and that It is a temporal deviation of the image, we can infer that If a whole window ωk is considered instead of a single pixel, we deduce that The solution of Eq. (6) is an optimization problem. By introducing the cost function, it follows that (4)I(x + u, y+ v, t + 1) = I(x, y, t) (5)[Ix(x, y, t)Iy(x, y, t)] ( u v ) + It(x, y, t) = 0 (6)J (u, v) = �� �k Ix � x, y, t �� �k Iy � x, y, t ��⎛⎜⎜⎝ uk vk ⎞ ⎟⎟⎠ + � �k It � x, y, t � = 0 (7)J (u, v) = ⎧⎪⎨⎪⎩ �� �k Ix � x, y, t �� �k Iy � x, y, t ��⎛⎜⎜⎝ uk vk ⎞⎟⎟⎠ + � �k It � x, y, t �⎫⎪⎬⎪⎭ 2 Fig. 3 Ellipsoidal face model Fig. 4 Model of feature extraction points in 3D Page 5 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 The optimal displacement vector that determines the new position of face ωk is given by: where, (uk, vk) represents the image at a new position. By computing the derivative of J with respect to u and v and equating them to zero, we obtain: where Ck = ( ∑ ωk I2x ∑ ωk IxIy ∑ ωk IxIy ∑ ωk I2y ) , and Dk = (∑ ωk IxIt ∑ ωk IxIt ) . Assuming that I : [1,m]× [1, n] ⊆ N 2 → [0, 1] is the matrix of the 3D face, then the jth level of the pyramid description of the face image is expressed by the recursion: The displacement vector in Eq. (9) can also be rewrit- ten as: The displacement vector in Eq.  (10) is computed at the deepest pyramid level jmax (in the Newton–Raph- son fashion), and the result of the computation is prop- agated to the upper level jmax − 1 by the expression: Equation (12) was used as the initial estimate for the evaluation of the displacement vector of the 3D face. The final displacement vector is given by the expression (8) ( uk vk ) = arg min ︸︷︷︸ ( u v ) ∈R2 J (u, v) (9)Ck ( uk vk ) + Dk = 0 (10)I j(x, y) = ⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩ I(x, y) , j = 0 1 4 I j−1(2x, 2y)+ 1 8 [I j−1(2x − 1, 2y) + I j−1(2x + 1, 2y)+ I j−1(2x, 2y − 1) + I j−1(2x, 2y + 1)]+ 1 16 [I j−1(2x − 1, 2y − 1) + I j−1(2x + 1, 2y + 1)+ I j−1(2x + 1, 2y − 1) + I j−1(2x − 1, 2y + 1)] , j ≠ 0 (11) ( uk vk ) = −C−1 k Dk (12) ( u j−1 k v j−1 k ) = 2 ( u j k v j k ) (13) ( uk vk ) = jmax∑ j=0 2j ( u j k v j k ) The visible features of the face can be extracted from any location on the face, similar to any other 2D dimen- sional face. The extracted features are candidates for predicting the overall expression of the face. The Gabor extraction technique is critical for extracting the maxi- mum amount of information required for the classifier. Feature points extraction The 2D Gabor filters are spatial sinusoids localized by the Gaussian window, and because they are orientation-, localization-, and frequency-selective, they are useful in this study. Demonstrate images using Gabor wavelets provides flexibility because the details about their spatial relations are preserved in the process. The general form of the Gabor function is given by: where R1 = uxcosθ and R2 = uysinθ , u is the spatial frequency of the band pass, θ is the spatial orientation, σ is the standard deviation that the 2D Gaussian envel- ops, and (x, y) is the position of the light impulse in the visual field. To allow for more robustness in illumination, we set the filter to zero direct current. The Gabor wavelet is then given by: where ( x, y, θ ,u, σ ) are parameters with (i, j) being the new position of the 2D input point, θ is the scale, u is the orientation of the Gabor kernel, σ is the standard devia- tion of the Gaussian window in the kernel, n is the maxi- mum size of the face peak, and q is the size of the filter given by q = (2n+ 1)2 . In this study, we used 8 orienta- tions given by { 0, π8 , π 4 , 3π 8 , π2 , 5π 8 , 3π4 , 7π8 } and 5 scales given by { 4, 4 √ 2, 8, 8 √ 2, 16 } . The sample points of the filtered image are coded into two bits (x1, x2) such that: where I is a sub-image of the expressional face; R and I are the real and imaginary parts of each Gabor kernel, respectively; and the star (*) is the convolution operator. The final magnitude response, representing the feature (14) G(x, y, θ ,u, σ) = 1 2πσ 2 exp { − x2 + y2 2σ 2 } exp [2π i(R1 + R2)] (15) G̃(x, y, θ ,u, σ) ≃ G(i, j, θ ,u, σ) = 1 q   n� i=−n n� j=−n G(x, y, θ ,u, σ)   (16)G1 =    x1 = 1, if � ℜ[G̃(x, y, θ ,u, σ)] ∗ I � ≥ 0 x1 = 0, if � ℜ[G̃(x, y, θ ,u, σ)] ∗ I � < 0 (17)G2 =    x2 = 1, if � ℑ[G̃(x, y, θ ,u, σ)] ∗ I � ≥ 0 x2 = 0, if � ℑ[G̃(x, y, θ ,u, σ)] ∗ I � < 0 Page 6 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 vectors, was computed by determining the square root of the sum of the squares of G1 and G2. Figure 5 shows the magnitude response of a template image. Classification using Ada‑AdaSVM For this optimization problem, an SVM with a radial basis function kernel was used as a weak classifier. This weak SVM classifier was trained to produce the optimum Gauss- ian value for the scale parameter δ and regularization parameter ∂ . Typically, the best parameters are {′∂ ′ : 1.0, ′δ′ : 0.1 } . The feature selection hypothesis is then computed from the expression sgn [ ∑T t−1ωth 1 t ( ϕ1 t )] , where T is the final iteration, h1t is the hypothesis with the most discriminating information, and ωt is weights that weigh h1t based on its classification performance. The learning pro- cess formulated in our recent study [25] is as follows: Step 1: Input the training sets, [ ( y1, x1 ) , ( y2, x2 ) , . . . , ( yN , xN ) ] , N = a+ b ; where data- sets a and b comprise yi = +1 and yi = −1 datasets, respectively. Initially, δ = δini, δmin, δstep . The scale param- eter δ , x, and y are the feature vectors selected by the AdaBoost algorithm. Step 2: Initialize the training set weights, w (1) i = 1 / 2a , ∀ ( yi = +1 ) and w(1) i = 1 / 2a , ∀ ( yi = −1 ) . Do while δ > δmin Step 3: Apply the RBFSVM kernel to train the weighted training datasets by applying the leave-one- subject-out cross validation (LOSOCV) approach and compute the training error for the weak classifier ht as Step 4: At ξt = 1 / 2 , reduce δ by a factor of δstep and then jump to Step 1. Step 5: Place the weight of the constituent classifier ht such that Step 6: Update the weights by computing: (18)ξt = ∑N i=1 wt i , yi �= ht(xi) (19)ht : αt = ln [ 1 ξt − 1 ] 1 2 Fig. 5 Gabor magnitude response of the expressive face image: sample image (left), magnitude response image of the whole Gabor filter bank of 40 Gabor filters (right) Fig. 6 Sample Bosphorus datasets Page 7 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 where Nt is a normalization constant and ∑n i=1w t+1 i = 1 Step 7: The final classifier is given by The LOSOCV approach is given by the expression: 1 / 2n = ∑n t=1 ∣ ∣fi(xi)− li ∣ ∣ , where n represents the total trained data. Facial expression datasets The algorithm was trained and tested on five popu- lar datasets: Bosphorus, BU-3DFE, MMI, CK + , and BP4D-Spontaneous, and executed on a (4 CPUs), approximately 2.2  GHz processor with a memory capacity of 8192 MB RAM. (20)wt+1 i = wt i exp { −αtyiht(xi) } Nt (21)H(x) = sgn [ ∑T t=1 αtht(x) ] Results and discussion Experiments on databases Bosphorus contains 4666 images of 105 subjects [26] comprising 60 men and 5 women, with the majority being Caucasian; 27 of whom were professional actors, in various poses, expressions, and occlusion condi- tions. In addition to the 6 basic emotional expressions, various systematic head poses (13 yaw and pitch rota- tions) were present. The texture images have a reso- lution of 1600 × 1200 pixels, whereas the 3D faces comprise approximately 35,000 vertices [27]. Figure  6 presents sample datasets from Bosphorous. Occlu- sion images were discarded because they were not the focus of this study. The datasets used comprised 6 poses and 7 expressions. The images were partitioned into training and testing sets using the conventional LOSOCV approach. One specimen from each of the 6 groups of expressions was used as a test dataset dur- ing each training run, whereas the rest of the samples were used as a testing set. Table 1 summarizes the FER in Bosphorus.  The BU-3DFE database was created at Binghamton University [28]. There were 100 respondents, ranging in age from 18 to 70 years old. Whites, Blacks, East Asians, Middle East Asians, Indians, and Hispanics are among the ethnic groups. Each participant displayed 7 expres- sions at 4 intensity levels, including neutral, and 6 arche- typal facial expressions. Figure  7 shows sample datasets in the database. The images were separated into training and testing sets using the same LOSOCV method as that used for the Bosphorus datasets, and the average recog- nition accuracy was 94.56%. The MMI database comprises over 2900 high-reso- lution videos submitted by more than 20 students and research staff members, of which 44% are female, rang- ing in age from 19 to 62 years old. Seventy-five subjects were included in total, and Fig.  8 shows samples. The Table 1 FER in Bosphorus database Average recognition accuracy = 92.7% Pose Expression Average recognition (%) Expressions Average recognition (%) 100 Yaw Neutral 100 Happiness 99.2 200 Yaw Neutral 99.8 Sadness 98.0 300 Yaw Neutral 99.2 Disgust 98.4 L450 Yaw Neutral 97.3 Angry 99.4 R450 Yaw Neutral 97.8 Fear 99.6 L900 Yaw Neutral 63.2 Surprise 99.0 R900 Yaw Neutral 78.2 Overall average 98.9 PR Neutral 99.7 CR Neutral 98.9 Fig. 7 Sample BU3DFE datasets Page 8 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 datasets are partitioned into training and testing sets using the LOSOCV technique. One sample from each of the 7 types of expressions was used as the test data- set during each training run. The remaining samples were used as training sets. For each training cycle, the samples were repeated with new test samples. The expressions included anger, disgust, fear, happiness, neutral, sadness, and surprise. The average recognition accuracy is 97.2%. The CK + database is a version of the 210 adult CK database. Participants were 18 to 50 years old, with 69% female, 81% Euro-American, 13% Afro-American, and 6% from other ethnic groups. The expressions included anger, contempt, disgust, fear, happiness, sadness, and surprise. Figure  9 presents sample datasets. A tenfold cross-validation procedure was used to partition the datasets into training and testing sets. The average recog- nition accuracy is 99.48%. Finally, the BP4D-Spontaneous dataset is a 3D video collection of spontaneous facial expressions from young individuals. The database comprises 41 subjects Fig. 8 Sample MMI datasets Fig. 9 Sample images in CK + database Page 9 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 (23 women and 18 men) ranging in age from 18 to 29 years old, including 11 Asians, 6 African-Americans, 4 Hispanics, and 20 Euro-Americans. Figure  10 shows sample images. We extracted expressions of anger, dis- gust, fear, pain, happiness, sadness, and surprise. The datasets were partitioned into training and testing sets using tenfold cross-validation. The average recognition accuracy is 97.2%. Figures  11 and 12 exhibit the respective confusion matrices for facial expressions and pose predictions in the Bosphorus database. Figures 13, 14, 15, and 16 show the rest of the confusion matrices for FERs in BU3DFE, MMI, CK + , and BP4D-Spontaneous, respectively. Comparison of methods In Table  2, the proposed method was compared to some recent techniques. These results clearly dem- onstrated that the proposed method is promising. Figures  17, 18, and 19 show the performance of each of the 7 facial expressions. In the BU3DFE database, many authors failed to report the performance of neu- tral expressions; thus, the comparison was performed using the other 6. The performance shown in Fig.  17 was encouraging. Figure  18 shows the performance of the CK + database. Although the result, as shown in Fig.  18, depicts fierce rivalry between three cur- rent methods [29–31], the overall average recognition shows that the proposed technique is promising. In Fig. 10 Sample BP4D-Spontaneous datasets Fig. 11 Confusion matrix of facial expressions in Bosphorus Page 10 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 Fig. 12 Confusion matrix of pose prediction in Bosphorus Fig. 13 Confusion matrix of facial expressions in BU3DFE database Page 11 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 Fig. 14 Confusion matrix of facial expressions in MMI database Fig. 15 Confusion matrix of facial expressions in CK + database Page 12 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 Fig. 16 Confusion matrix of facial expressions in BP4D-Spontaneous datasets Table 2 Comparison of results on different methods Method Database Recognition (%) Ref Twin support vector machines classifier MMI 92.56± 3.02 [32] DBM-DACNN with entropy loss MMI 79.25 [33] Deep learning neural network-regression CK + 97.27 [30] Deep learning + random forest CK + 99.00 [31] Twin support vector machines classifier CK + 93.42± 3.25 [32] DBM-DACNN with entropy loss CK + 96.46 [33] Geotopo + BP4D-Spontaneous 88.56 [34] Two-phase weighted collaborative representation classification BP4D-Spontaneous 100 [35] Fine-grained matching of 3D keypoint descriptors Bosphorus 98.90 [21] Kernel methods on Riemannian manifold Bosphorus 86.70 [36] SVM with EPE Bosphorus 84.00 [37] Two-phase weighted collaborative representation classification Bosphorus 98.90 [35] Kernel methods on Riemannian manifold BU-3DFE 92.62 [36] SVM with EPE BU-3DFE 85.81 [37] Manifold CNN BU-3DFE 86.67 [38] CNN model BU-3DFE 92.57 [39] Proposed method MMI 97.20 This study Proposed method CK + 98.20 This study Proposed method BP4D-Spontaneous 97.20 This study Proposed method Bosphorus 98.90 This study Proposed method BU-3DFE 93.50 This study Page 13 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 the Bosphorus database, the proposed method out- performed the most recent methods (Fig. 19). A com- parison of the performances of the individual FER prototypes in the MMI and BP4D-Spontaneous data- bases could not be executed because there were no reported data for comparison at the time of compila- tion. Statistical analysis using ANOVA shows the fol- lowing performance results: In the Bosphorus database, an analysis of vari- ances demonstrated statistically significant differences between the proposed technique and the following: Hariri et al. [36] (p = 0.001), Azazi et al. [37] (p = 0.000), and Moeini A and Moeini H [40] (p = 0.013). In addition, the outcome is the same as in the BU3DFE: the variance analysis shows that a statistically sig- nificant difference (p < 0.05) exists between the pro- posed method and all other methods. However, in the CK + FER database, the statistical analysis shows that, except ref. [41], where a statistically significant differ- ence (p < 0.05) exists, the remaining datasets show no statistically significant differences (p > 0.05). The pro- posed method compared to yields from An and Liu [29] (p = 0.847), Ch [30] (p = 0.909), and Liao et al. [31] (p = 0.991). Although the analysis appears to reveal a balanced performance between the proposed meth- odology and the last three techniques, the average Fig. 17 Performance of 6 FER prototypes in BU3DFE database Fig. 18 Performance of 6 FER prototype in CK + database Page 14 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 recognition accuracy of the proposed method against any of them, as shown in Fig. 18, indicates that the pro- posed method is superior. Conclusions This study improves the FER performance in higher poses. 2D pose conversion schemes have been established to handle pose-invariant FER problems successfully, within a small-scale pose variation. However, they often flop for large-scale, in-depth face variations because of the disjointedness of the image. Human face geometry is ellipsoidal; there- fore, the feature points are robustly tracked from one frame to next using an ellipsoidal model. We use the Gabor feature extraction technique for the sali- ent visible features, mostly around the cheeks, eyes, mouth, and nose ridges. The Gabor feature extrac- tion algorithm is useful for this study because it is selective toward orientation, localization, and fre- quency. We then used an ensemble classification technique, which combines SVM and AdaBoost, for feature selection and classification. The proposed technique outperforms the most recent and popular methods. In the future, we intend to investigate this problem using other feature extraction methods such as LBP and LBP + HOG. Abbreviations FER: Facial expression recognition; SVM: Saturated vector machine; LBP: Local binary patterns; HOG: Histogram of gradients; PCA: Principal component analysis; KNN: K-nearest neighbor; SMOTE: Synthetic minority oversampling technique; 2D: Two-dimensional; 3D: Three-dimensional; LOSOCV: Leave-one- subject-out cross validation. Acknowledgements Not applicable. Authors’ contributions All authors drafted this manuscript. Ideation was proposed by EO. EO and JKA developed the proposed solution. PO performed the experimentation. All authors finally discussed and analyzed the results from the experimentation. All authors read and approved the final manuscript. Funding Not applicable. Availability of data and materials All data used for this data are publicly available on request from the original authors. Declarations Competing interests All authors declare that there is no known competing interest. Author details 1 Department of Computer Science, University of Ghana, P. O. Box LG 163, Accra, Ghana. 2 Department of Computer Engineering, University of Ghana, P. O. Box LG 77, Accra, Ghana. Received: 18 November 2021 Accepted: 19 April 2022 References 1. Panksepp J (2005) Affective consciousness: Core emotional feelings in animals and humans. Conscious Cogn 14(1):30-80. https:// doi. org/ 10. 1016/j. concog. 2004. 10. 004 2. Plutchik R (2001) The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. Amer Scient 89(4):344-350. https:// doi. org/ 10. 1511/ 2001.4. 344 3. Zautra AJ (2003) Emotions, stress, and health. Oxford University Press, Oxford. 4. Kohler CG, Martin EA, Stolar N, Barrett FS, Verma R, Brensinger C et al (2008) Static posed and evoked facial expressions of emotions in Fig. 19 Performance of 7 FER prototypes in Bosphorus database https://doi.org/10.1016/j.concog.2004.10.004 https://doi.org/10.1016/j.concog.2004.10.004 https://doi.org/10.1511/2001.4.344 https://doi.org/10.1511/2001.4.344 Page 15 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art (2022) 5:14 schizophrenia. Schizophr Res 105(1-3):49-60. https:// doi. org/ 10. 1016/j. schres. 2008. 05. 010 5. Ambron E, Foroni F (2015) The attraction of emotions: irrelevant emo- tional information modulates motor actions. Psychon Bull Rev 22(4):1117- 1123. https:// doi. org/ 10. 3758/ s13423- 014- 0779-y 6. Kumari J, Rajesh R, Kumar A (2016) Fusion of features for the effective facial expression recognition. Paper presented at the international confer- ence on communication and signal processing, IEEE, Melmaruvathur, 6–8 June 2016. https:// doi. org/ 10. 1109/ ICCSP. 2016. 77541 78 7. Shergill GS, Sarrafzadeh A, Diegel O, Shekar A (2008) Computerized sales assistants: the application of computer technology to measure consumer interest-a conceptual framework. J Electron Commer Res 9(2):176-191. 8. Tierney M (2017) Using behavioral analysis to prevent violent extremism: Assessing the cases of Michael Zehaf-Bibeau and Aaron Driver. J Threat Assessm Manag 4(2):98-110. https:// doi. org/ 10. 1037/ tam00 00082 9. Nonis F, Dagnes N, Marcolin F, Vezzetti E (2019) 3D approaches and chal- lenges in facial expression recognition algorithms - A literature review. Appl Sci 9(18):3904. https:// doi. org/ 10. 3390/ app91 83904 10. Sandbach G, Zafeiriou S, Pantic M, Rueckert D (2011) A dynamic approach to the recognition of 3D facial expressions and their temporal models. Paper presented at the ninth IEEE international conference on automatic face and gesture recognition, IEEE, Santa Barbara, 21–25 March 2011. https:// doi. org/ 10. 1109/ FG. 2011. 57714 34 11. Vieriu RL, Tulyakov S, Semeniuta S, Sangineto E, Sebe N (2015) Facial expression recognition under a wide range of head poses. Paper presented at the 11th IEEE international conference and workshops on automatic face and gesture recognition, IEEE, Ljubljana, May 4–8, 2015. https:// doi. org/ 10. 1109/ FG. 2015. 71630 98 12. Yadav KS, Singha J (2020) Facial expression recognition using modi- fied Viola-John’s algorithm and KNN classifier. Multimed Tools Appl 79(19):13089-13107. https:// doi. org/ 10. 1007/ s11042- 019- 08443-x 13. Jones M, Viola P (2003) Fast multi-view face detection. Mitsubishi Electric Research Laboratories, Cambridge. 14. Yao L, Wan Y, Ni HJ, Xu BG (2021) Action unit classification for facial expression recognition using active learning and SVM. Multimed Tools Appl 80(16):24287-24301. https:// doi. org/ 10. 1007/ s11042- 021- 10836-w 15. Ashir AM, Eleyan A, Akdemir B (2020) Facial expression recognition with dynamic cascaded classifier. Neural Comput Appl 32(10):6295-6309. https:// doi. org/ 10. 1007/ s00521- 019- 04138-4 16. Farrow CL, Shaw M, Kim H, Juhás P, Billinge SJL (2011) Nyquist-Shannon sampling theorem applied to refinements of the atomic pair distribution function. Phys Rev B 84(13):134105. https:// doi. org/ 10. 1103/ PhysR evB. 84. 134105 17. Li F, Cornwell TJ, de Hoog F (2011) The application of compressive sampling to radio astronomy. I. Deconvolution. Astron Astrophys 528:A31. https:// doi. org/ 10. 1051/ 0004- 6361/ 20101 5045 18. Perez-Gomez V, Rios-Figueroa HV, Rechy-Ramirez EJ, Mezura-Montes E, Marin-Hernandez A (2020) Feature selection on 2D and 3D geometric features to improve facial expression recognition. Sensors 20(17):4847. https:// doi. org/ 10. 3390/ s2017 4847 19. Duan J (2019) Financial system modeling using deep neural networks (DNNs) for effective risk assessment and prediction. J Franklin Inst 356(8):4716-4731. https:// doi. org/ 10. 1016/j. jfran klin. 2019. 01. 046 20. Kurniawati YE, Permanasari AE, Fauziati S (2018) Adaptive synthetic-nomi- nal (ADASYN-N) and adaptive synthetic-KNN (ADASYN-KNN) for multiclass imbalance learning on laboratory test data. Paper presented at the 4th international conference on science and technology, IEEE, Yogyakarta, 7–8 August 2018. https:// doi. org/ 10. 1109/ ICSTC. 2018. 85286 79 21. Li HB, Huang D, Morvan JM, Wang YH, Chen LM (2015) Towards 3D face recognition in the real: a registration-free approach using fine-grained matching of 3D keypoint descriptors. Int J Comput Vis 113(2):128-142. https:// doi. org/ 10. 1007/ s11263- 014- 0785-6 22. Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564-577. https:// doi. org/ 10. 1109/ TPAMI. 2003. 11959 91 23. Hao GT, Du XP, Chen H, Song JJ, Gao TF (2015) Scale-unambiguous rela- tive pose estimation of space uncooperative targets based on the fusion of three-dimensional time-of-flight camera and monocular camera. Opt Eng 54(5):053112. https:// doi. org/ 10. 1117/1. OE. 54.5. 053112 24. Dibeklioglu H, Salah AA, Akarun L (2008) 3D facial landmarking under expression, pose, and occlusion variations. Paper presented at the IEEE second international conference on biometrics: theory, applications and systems, IEEE, Washington, 29 September-1 October 2008. https:// doi. org/ 10. 1109/ BTAS. 2008. 46993 24 25. Owusu E, Wiafe I (2021) An advance ensemble classification for object recognition. Neural Comput Appl 33(18):11661-11672. https:// doi. org/ 10. 1007/ s00521- 021- 05881-3 26. Dharavath K, Laskar RH, Talukdar FA (2013) Qualitative study on 3D face databases: A review. Paper presented at the annual IEEE India conference, IEEE, Mumbai, 13–15 December 2013. https:// doi. org/ 10. 1109/ INDCON. 2013. 67260 93 27. Sandbach G, Zafeiriou S, Pantic M, Yin LJ (2012) Static and dynamic 3D facial expression recognition: A comprehensive survey. Image Vision Comput 30(10):683-697. https:// doi. org/ 10. 1016/j. imavis. 2012. 06. 005 28. Quan W, Matuszewski BJ, Shark LK, Ait-Boudaoud D (2009) Facial expres- sion biometrics using statistical shape models. EURASIP J Adv Signal Process 2009:261542. https:// doi. org/ 10. 1155/ 2009/ 261542 29. An FP, Liu ZW (2020) Facial expression recognition algorithm based on parameter adaptive initialization of CNN and LSTM. Vis Comput 36:483- 498. https:// doi. org/ 10. 1007/ s00371- 019- 01635-4 30. Ch S (2021) An efficient facial emotion recognition system using novel deep learning neural network-regression activation classifier. Multimed Tools Appl 80(12):17543-17568. https:// doi. org/ 10. 1007/ s11042- 021- 10547-2 31. Liao HB, Wang DH, Fan P, Ding L (2021) Deep learning enhanced attrib- utes conditional random forest for robust facial expression recognition. Multimed Tools Appl 80(19):28627-28645. https:// doi. org/ 10. 1007/ s11042- 021- 10951-8 32. Kumar MP, Rajagopal MK (2019) Detecting facial emotions using normal- ized minimal feature vectors and semi-supervised twin support vector machines classifier. Appl Intell 49(12):4150-4174. https:// doi. org/ 10. 1007/ s10489- 019- 01500-w 33. Li S, Deng WH (2019) Blended emotion in-the-wild: Multi-label facial expression recognition using crowdsourced annotations and deep local- ity feature learning. Int J Comput Vis 127(6):884-906. https:// doi. org/ 10. 1007/ s11263- 018- 1131-1 34. Danelakis A, Theoharis T, Pratikakis I, Perakis P (2016) An effective methodology for dynamic 3D facial expression retrieval. Pattern Recogn 52:174-185. https:// doi. org/ 10. 1016/j. patcog. 2015. 10. 012 35. Lei YJ, Guo YL, Hayat M, Bennamoun M, Zhou XZ (2016) A two-phase weighted collaborative representation for 3D partial face recognition with single sample. Pattern Recogn 52:218-237. https:// doi. org/ 10. 1016/j. patcog. 2015. 09. 035 36. Hariri W, Tabia H, Farah N, Benouareth A, Declercq D (2017) 3D facial expression recognition using kernel methods on Riemannian manifold. Eng Appl Artif Intell 64:25-32. https:// doi. org/ 10. 1016/j. engap pai. 2017. 05. 009 37. Azazi A, Lutfi SL, Venkat I, Fernández-Martínez F (2015) Towards a robust affect recognition: Automatic facial expression recognition in 3D faces. Expert Syst Appl 42(6):3056-3066. https:// doi. org/ 10. 1016/j. eswa. 2014. 10. 042 38. Chen ZX, Huang D, Wang YH, Chen LM (2018) Fast and light manifold CNN based 3D facial expression recognition across pose variations. Paper presented at the 26th ACM international conference on multimedia, ACM, Seoul, 22–26 October 2018. https:// doi. org/ 10. 1145/ 32405 08. 32405 68 39. Huynh XP, Tran TD, Kim YG (2016) Convolutional neural network models for facial expression recognition using BU-3DFE database. In: Kim K, Joukov N (eds) Information Science and Applications (ICISA) 2016. Lecture Notes in Electrical Engineering, vol 376. Springer, Singapore, pp 441–450. https:// doi. org/ 10. 1007/ 978- 981- 10- 0557-2_ 44 40. Moeini A, Moeini H (2015) Real-world and rapid face recognition toward pose and expression variations via feature library matrix. IEEE Trans Inform Forensics secur 10(5):969-984. https:// doi. org/ 10. 1109/ TIFS. 2015. 23935 53 41. Meena HK, Sharma KK, Joshi SD (2020) Effective curvelet-based facial expression recognition using graph signal processing. Signal Image Video Process 14(2):241-247. https:// doi. org/ 10. 1007/ s11760- 019- 01547-9 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- lished maps and institutional affiliations. https://doi.org/10.1016/j.schres.2008.05.010 https://doi.org/10.1016/j.schres.2008.05.010 https://doi.org/10.3758/s13423-014-0779-y https://doi.org/10.1109/ICCSP.2016.7754178 https://doi.org/10.1037/tam0000082 https://doi.org/10.3390/app9183904 https://doi.org/10.1109/FG.2011.5771434 https://doi.org/10.1109/FG.2015.7163098 https://doi.org/10.1007/s11042-019-08443-x https://doi.org/10.1007/s11042-021-10836-w https://doi.org/10.1007/s00521-019-04138-4 https://doi.org/10.1103/PhysRevB.84.134105 https://doi.org/10.1103/PhysRevB.84.134105 https://doi.org/10.1051/0004-6361/201015045 https://doi.org/10.3390/s20174847 https://doi.org/10.1016/j.jfranklin.2019.01.046 https://doi.org/10.1109/ICSTC.2018.8528679 https://doi.org/10.1007/s11263-014-0785-6 https://doi.org/10.1109/TPAMI.2003.1195991 https://doi.org/10.1109/TPAMI.2003.1195991 https://doi.org/10.1117/1.OE.54.5.053112 https://doi.org/10.1109/BTAS.2008.4699324 https://doi.org/10.1109/BTAS.2008.4699324 https://doi.org/10.1007/s00521-021-05881-3 https://doi.org/10.1007/s00521-021-05881-3 https://doi.org/10.1109/INDCON.2013.6726093 https://doi.org/10.1109/INDCON.2013.6726093 https://doi.org/10.1016/j.imavis.2012.06.005 https://doi.org/10.1155/2009/261542 https://doi.org/10.1007/s00371-019-01635-4 https://doi.org/10.1007/s11042-021-10547-2 https://doi.org/10.1007/s11042-021-10547-2 https://doi.org/10.1007/s11042-021-10951-8 https://doi.org/10.1007/s11042-021-10951-8 https://doi.org/10.1007/s10489-019-01500-w https://doi.org/10.1007/s10489-019-01500-w https://doi.org/10.1007/s11263-018-1131-1 https://doi.org/10.1007/s11263-018-1131-1 https://doi.org/10.1016/j.patcog.2015.10.012 https://doi.org/10.1016/j.patcog.2015.09.035 https://doi.org/10.1016/j.patcog.2015.09.035 https://doi.org/10.1016/j.engappai.2017.05.009 https://doi.org/10.1016/j.engappai.2017.05.009 https://doi.org/10.1016/j.eswa.2014.10.042 https://doi.org/10.1016/j.eswa.2014.10.042 https://doi.org/10.1145/3240508.3240568 https://doi.org/10.1145/3240508.3240568 https://doi.org/10.1007/978-981-10-0557-2_44 https://doi.org/10.1109/TIFS.2015.2393553 https://doi.org/10.1007/s11760-019-01547-9 Robust facial expression recognition system in higher poses Abstract Introduction Applications Technical issues on the use of two-dimensional facial data Related work Methods Architectural framework Ellipsoidal feature tracking method Feature points extraction Classification using Ada-AdaSVM Facial expression datasets Results and discussion Experiments on databases Comparison of methods Conclusions Acknowledgements References