Owusu et al. 
Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  
https://doi.org/10.1186/s42492-022-00109-0

ORIGINAL ARTICLE

Robust facial expression recognition system 
in higher poses
Ebenezer Owusu1, Justice Kwame Appati1*   and Percy Okae2 

Abstract 

Facial expression recognition (FER) has numerous applications in computer security, neuroscience, psychology, and 
engineering. Owing to its non-intrusiveness, it is considered a useful technology for combating crime. However, FER 
is plagued with several challenges, the most serious of which is its poor prediction accuracy in severe head poses. 
The aim of this study, therefore, is to improve the recognition accuracy in severe head poses by proposing a robust 
3D head-tracking algorithm based on an ellipsoidal model, advanced ensemble of AdaBoost, and saturated vector 
machine (SVM). The FER features are tracked from one frame to the next using the ellipsoidal tracking model, and the 
visible expressive facial key points are extracted using Gabor filters. The ensemble algorithm (Ada-AdaSVM) is then 
used for feature selection and classification. The proposed technique is evaluated using the Bosphorus, BU-3DFE, MMI, 
CK + , and BP4D-Spontaneous facial expression databases. The overall performance is outstanding.

Keywords: Facial expressions, Three-dimensional head pose, Ellipsoidal model, Gabor filters, Ada-AdaSVM

© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which 
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the 
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or 
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line 
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory 
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this 
licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.

Introduction
Applications
Facial expression recognition (FER) is the automatic 
detection of the emotional state of a human face using 
computer-based technology. The field of study is cur-
rently a hotspot of research because it has increasing 
applications in several domains, such as psychology, soci-
ology, health science, transportation, gaming, communi-
cation, security, and business. According to Panksepp [1], 
facial expressions and emotions guide the lives of people 
in a variety of ways, and emotions are key aspects that 
enlighten us in how we should act, from elementary pro-
cesses to the most intricate acts [2, 3].

The sporadic advancements in the use of facial expres-
sions in neuropsychiatric complications have shown more 
positive results [4], and current studies are focusing on 
human behavior and the detection of mental illnesses [5, 6].

FER can also affect data collection in specific research 
projects. For example, Shergill et  al. [7] proposed an 
intelligent assistant FER framework that could be imple-
mented in e-commerce to determine the product prefer-
ences of customers. The system captures the facial data as 
they browse the e-shop for products to acquire. Based on 
the facial expression, the systems can automatically sug-
gest more products of possible interest.

Certain physiological features of people have been dis-
covered to be useful as intelligent data in the search for 
criminals [8, 9]. This theory is based on the tendency for 
someone with ego to commit a high-profile crime, such 
as terrorism, exhibits specific emotions such as anger 
and fear. Consequently, the accurate recognition of these 
expressions could lead to further security measures in 
apprehending criminals.

FER can also be valuable during the testing phase of 
video games. Target groups are frequently invited to play 
a game for a set amount of time, and their behaviors and 
emotions are observed as they play. Game developers 
may acquire more insights and valuable deductions about 
the emotions recorded during gameplay using FER tech-
nology, and incorporate the feedback into production.

Open Access

Visual Computing for Industry,
Biomedicine, and Art

*Correspondence:  jkappati@ug.edu.gh

1 Department of Computer Science, University of Ghana, P. O. Box LG 163, 
Accra, Ghana
Full list of author information is available at the end of the article

http://orcid.org/0000-0003-2798-4524
http://creativecommons.org/licenses/by/4.0/
http://crossmark.crossref.org/dialog/?doi=10.1186/s42492-022-00109-0&domain=pdf


Page 2 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

Technical issues on the use of two‑dimensional facial data
Two-dimensional (2D) FER systems are extremely sen-
sitive to head orientation. Therefore, to achieve good 
results, the subject must be constantly in a fronto-par-
allel orientation. The problem resulting from this is that 
the throughput of most site-access systems is signifi-
cantly reduced. This implies that subjects are frequently 
required to perform several verifications to attain an ideal 
facial orientation. Consequently, surveillance systems 
operate on luck, hoping the subject faces the camera.

Another problem that arises from the use of 2D tech-
nology is the illumination conditions of the surrounding 
environment. If the subject is in a setting with varying 
lighting conditions, FER reduces in accuracy because 
the FER processes are sensitive to the direction of light-
ing and the ensuing shading pattern. Consequently, cast 
shadows may obstruct recognition by concealing inform-
ative features.

Three-dimensional (3D) FER systems have a higher 
detection rate than 2D systems because of their higher 
intensity modality, and they also have more object 
description geometry information [10, 11]. This demon-
strates the importance of pushing FER into higher face 
orientations to improve its realism and practicality.

Related work
The primary focus of this study is to improve FER accu-
racy in higher facial orientations.

Yadav and Singha [12] adopted the Viola-Jones 
descriptor [13] to detect faces and used a combina-
tion of local binary patterns (LBP) and the histogram 
of gradients (HOG) as a feature extraction tool. Sub-
sequently, traditional SVM with the k-means method 
was employed as a training algorithm. LPB feature 
extraction techniques, such as Gabor, are orienta-
tion-selective, and thus, highly robust in tracking key 
facial features. However, the Viola-Jones descriptor 
is computationally demanding and has a low detec-
tion accuracy. Furthermore, the conventional SVM 
described in the study is slow to classify. Consequently, 
the overall architecture used in the study was com-
putationally expensive. Yao et  al. [14] proposed a lin-
ear SVM method that used AUs to recognize seven 
facial expression prototypes in the CK database. The 
Viola-Jones descriptor was used as the face-detection 
technique again. Although the goal of the study was 
to minimize computational complexity and enhance 
recognition accuracy, the resulting average recogni-
tion accuracy of 94.07% for females and 90.77% for 
males was too low for a viable implementation. Ashir 
et al. [15] also proposed an SVM-based multiclass clas-
sification for detecting seven facial expressions across 

four prominent databases. The Nyquist–Shannon sam-
pling method [16] was used to compress the extracted 
facial feature samples. Although the sampling method 
reduces data loss, it is prone to aliasing issues, par-
ticularly when the bandwidth is extremely large. The 
Nyquist-Shannon sampling technique is difficult to 
deploy because it assumes the sampled signal is com-
pletely band-restricted. In real-world applications, this 
is a concern because no actual signal is genuinely and 
completely band-restricted. The compressing sampling 
[17] paradigm could have been a better option because 
it is less restrictive. Perez-Gomez et  al. [18] recently 
proposed a 2D–3D FER system that used principal 
component analysis (PCA) and a genetic algorithm 
for feature selection, and a k-nearest neighbor (KNN)-
multiclass SVM for learning. In this study, the synthetic 
minority oversampling technique (SMOTE) [19] was 
used to balance the instances. However, SMOTE cre-
ates an equal number of synthetic samples for each 
minority data sample and relies on the hypothesis 
performance to update the distribution function. The 
adaptive synthetic (ADASYN) [20] method tends to 
generate more synthetic data for minority class sam-
ples that are harder to learn than with SMOTE, which 
is easy to learn. In addition, PCA uses observations 
from all the extracted features in the projection to the 
subspace and only considers linear relationships, ignor-
ing the input multivariate structures. Compared to 
other recent studies, the findings of this study were not 
positive.

Li et  al. [21] proposed a robust 3D local coordinate 
technique for extracting pose-invariant facial features at 
key points. The descriptor in this method is a multi-task 
sparse representation fine-grained matching algorithm. 
The method was evaluated using the Bosphorus data-
sets, and an average recognition accuracy of 98.9% was 
obtained. The success of this study is largely owed to the 
accurate tracking of 3D key points. This recent study is a 
primary driving force behind our proposed study.

The following are the significant contributions of this 
work: (1) A robust head-tracking algorithm that tracks 
facial features from one frame to the next, accounting 
for more features in the overall prediction process; (2) 
A unique ensemble approach that employs AdaBoost 
for feature selection, and a combination of AdaBoost 
and SVM for classification. AdaBoost is extremely fast, 
whereas SVM is extremely accurate. Consequently, the 
proposed technique becomes extremely fast while also 
improving the recognition accuracy.

The remainder of this paper is organized as follows. Meth-
ods section  delves into the proposed strategy.  Results and 
discussion section discusses the findings, debates, and anal-
yses. Finally, Conclusions section concludes the study.


Page 3 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

Methods
We robustly tracked the facial features from one frame to 
the next using 3D facial data. With 3D data, information, 
such as the size and shape of an object, can be correctly 
estimated in each frame without prior assumptions.

The first priority is to detect the focal points in each 
frame. The next step is to search for matching features 
or objects across all frames. This method addresses the 
changing behavior of a moving object and the preceding 
annotations of the scene. In this approach, the location of 
an object is projected by iteratively updating the object 
position from previous frames [22, 23].

Architectural framework
Figure 1 presents the framework of this study.

This procedure uploads images and robustly tracks 
the features across frames using the proposed ellipsoi-
dal model. Subsequently, the Gabor feature-extraction 
approach was used. Feature points extraction  section 
explains the reason for using Gabor features in this study. 
Feature selection and classification were executed using 
the Ada-AdaSVM.

Ellipsoidal feature tracking method
Accurate tracking of a human face from the forehead, to 
the left cheek, to the chin, to the right cheek, and back to 
the same spot on the forehead where the tracking began 
unmistakably demonstrates that the human face is best 
shaped like an ellipse. Thus, considering the 3D facial 
representation in Fig.  2 with N feature points tracked 
across frames, we denote:

where N represents the most relevant feature points. 
In this study, we assumed N to be 24. In addition, let 
fj(t) ∈ α(t) denote a facial feature. As the features 
move from one frame to the next at time t + 1, the 

(1)α(t) =
{
fj(t)

∣
∣1 ≤ j ≤ N

}

position of feature fj(t) becomes fj(t + 1) . Therefore, 
fj(t + 1) ∈ α(t + 1) . Assuming that Yj is the position of αj 
on the 3D facial model and αj,p[∅(t + 1)] represents its 
back projection on the image plane, the 3D facial orienta-
tion at t + 1 is the vector ∅(t + 1) that minimizes 

∑N
j=1S

2
j  , 

where:

This is a multi-view system based on the assumption 
that cameras are positioned around the subject to cap-
ture various rotation movements. Consequently, the facial 

(2)Sj[φ(t + 1)] = ||αj,p[φ(t + 1)] − αj(t + 1)||

Fig. 1 Architectural framework of this study

Fig. 2 Tracking of 3D feature points from one frame to another


Page 4 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

image can be captured with a high degree of precision in 
any orientation. We extracted the features in the same 
manner as for 2D images. The right and left eyes, lips, and 
muscles around the cheeks are important parts of the face 
to consider. Slight disruptions primarily and severely dis-
tort the muscles in these places. The Gabor technique is 
then used to extract the features of the captured face.

The algorithm models a procedure that chooses a set of 
features and robustly tracks them from one frame to the 
next while discarding all other features that are no longer 
required for tracking. The ellipsoidal 3D face was mod-
elled, as shown in Fig. 3.

Adopting homogeneous coordinates for an ellip-
soid of the semi-axis, a, b, and c, states that a point 
X0 =

(
x0, y0, z0, 1

)
 belongs to the surface of the ellipsoid 

if XT
0 E0X0 = 0.

The algorithm tracks the facial features that are more 
noticeable by slight deformation from one frame to the 

(3)E0 =







b2c2 0 0 0

0 a2c2 0 0

0 0 a2b2 0

0 0 0 −a2b2c2







next using the brightness change constraint [24]. These 
muscles are usually near the eyes, mouth, cheeks, and 
edges, as shown in Fig. 4 and contour τ in Fig. 3.

Given that pixel (x, y) with luminance I
(
x, y

)T moves 
from position (x, y)T at frame t to position 

(
x + u, y+ v

)T 
at frame t + 1 in high frame rates. In this instance, we can 
deduce that

By applying Taylor’s series, and considering Ix and Iy as 
gradients and that It is a temporal deviation of the image, 
we can infer that

If a whole window ωk is considered instead of a single 
pixel, we deduce that

The solution of Eq. (6) is an optimization problem. By 
introducing the cost function, it follows that

(4)I(x + u, y+ v, t + 1) = I(x, y, t)

(5)[Ix(x, y, t)Iy(x, y, t)]
(
u
v

)

+ It(x, y, t) = 0

(6)J (u, v) =
��

�k

Ix
�
x, y, t

��
�k

Iy
�
x, y, t

��⎛⎜⎜⎝
uk

vk

⎞
⎟⎟⎠
+
�

�k

It
�
x, y, t

�
= 0

(7)J (u, v) =

⎧⎪⎨⎪⎩

��
�k

Ix
�
x, y, t

��
�k

Iy
�
x, y, t

��⎛⎜⎜⎝
uk

vk

⎞⎟⎟⎠
+
�

�k

It
�
x, y, t

�⎫⎪⎬⎪⎭

2

Fig. 3 Ellipsoidal face model Fig. 4 Model of feature extraction points in 3D


Page 5 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

The optimal displacement vector that determines the 
new position of face ωk is given by:

where, (uk, vk) represents the image at a new position. 
By computing the derivative of J with respect to u and v 
and equating them to zero, we obtain:

where Ck =
( ∑

ωk
I2x

∑

ωk
IxIy

∑

ωk
IxIy

∑

ωk
I2y

)

 , and 

Dk =
(∑

ωk
IxIt

∑

ωk
IxIt

)

 . Assuming that I : [1,m]× [1, n] ⊆

N
2 → [0, 1] is the matrix of the 3D face, then the jth 

level of the pyramid description of the face image is 
expressed by the recursion:

The displacement vector in Eq. (9) can also be rewrit-
ten as:

The displacement vector in Eq.  (10) is computed at 
the deepest pyramid level jmax (in the Newton–Raph-
son fashion), and the result of the computation is prop-
agated to the upper level jmax − 1 by the expression:

Equation (12) was used as the initial estimate for the 
evaluation of the displacement vector of the 3D face. 
The final displacement vector is given by the expression

(8)

(
uk
vk

)

= arg min
︸︷︷︸

(

u
v

)

∈R2

J (u, v)

(9)Ck

(
uk
vk

)

+ Dk = 0

(10)I j(x, y) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

I(x, y) , j = 0

1

4
I j−1(2x, 2y)+

1

8
[I j−1(2x − 1, 2y) + I j−1(2x + 1, 2y)+

I j−1(2x, 2y − 1) + I j−1(2x, 2y + 1)]+

1

16
[I j−1(2x − 1, 2y − 1) + I j−1(2x + 1, 2y + 1)+

I j−1(2x + 1, 2y − 1) + I j−1(2x − 1, 2y + 1)]

, j ≠ 0

(11)
(
uk
vk

)

= −C−1
k Dk

(12)

(

u
j−1
k

v
j−1
k

)

= 2

(

u
j
k

v
j
k

)

(13)
(
uk
vk

)

=
jmax∑

j=0

2j

(

u
j
k

v
j
k

)

The visible features of the face can be extracted from 
any location on the face, similar to any other 2D dimen-
sional face. The extracted features are candidates for 
predicting the overall expression of the face. The Gabor 
extraction technique is critical for extracting the maxi-
mum amount of information required for the classifier.

Feature points extraction
The 2D Gabor filters are spatial sinusoids localized by 
the Gaussian window, and because they are orientation-, 
localization-, and frequency-selective, they are useful in 
this study. Demonstrate images using Gabor wavelets 
provides flexibility because the details about their spatial 
relations are preserved in the process. The general form 
of the Gabor function is given by:

where R1 = uxcosθ and R2 = uysinθ , u is the spatial 
frequency of the band pass, θ is the spatial orientation, 
σ is the standard deviation that the 2D Gaussian envel-
ops, and (x, y) is the position of the light impulse in the 
visual field. To allow for more robustness in illumination, 
we set the filter to zero direct current. The Gabor wavelet 
is then given by:

where 
(
x, y, θ ,u, σ

)
 are parameters with (i, j) being the 

new position of the 2D input point, θ is the scale, u is the 
orientation of the Gabor kernel, σ is the standard devia-
tion of the Gaussian window in the kernel, n is the maxi-
mum size of the face peak, and q is the size of the filter 
given by q = (2n+ 1)2 . In this study, we used 8 orienta-
tions given by 

{

0, π8 ,
π
4 ,

3π
8 , π2 ,

5π
8 , 3π4 , 7π8

}

 and 5 scales 
given by 

{

4, 4
√
2, 8, 8

√
2, 16

}

 . The sample points of the 
filtered image are coded into two bits (x1, x2) such that:

where I is a sub-image of the expressional face; R and 
I are the real and imaginary parts of each Gabor kernel, 
respectively; and the star (*) is the convolution operator. 
The final magnitude response, representing the feature 

(14)

G(x, y, θ ,u, σ) =
1

2πσ 2
exp

{

−
x2 + y2

2σ 2

}

exp [2π i(R1 + R2)]

(15)

G̃(x, y, θ ,u, σ) ≃ G(i, j, θ ,u, σ) =
1

q





n�

i=−n

n�

j=−n

G(x, y, θ ,u, σ)





(16)G1 =







x1 = 1, if
�

ℜ[G̃(x, y, θ ,u, σ)] ∗ I
�

≥ 0

x1 = 0, if
�

ℜ[G̃(x, y, θ ,u, σ)] ∗ I
�

< 0

(17)G2 =







x2 = 1, if
�

ℑ[G̃(x, y, θ ,u, σ)] ∗ I
�

≥ 0

x2 = 0, if
�

ℑ[G̃(x, y, θ ,u, σ)] ∗ I
�

< 0


Page 6 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

vectors, was computed by determining the square root of 
the sum of the squares of G1 and G2. Figure 5 shows the 
magnitude response of a template image.

Classification using Ada‑AdaSVM
For this optimization problem, an SVM with a radial basis 
function kernel was used as a weak classifier. This weak 
SVM classifier was trained to produce the optimum Gauss-
ian value for the scale parameter δ and regularization 
parameter ∂ . Typically, the best parameters are 

{′∂ ′ : 1.0, ′δ′ : 0.1
}
 . The feature selection hypothesis is then 

computed from the expression sgn
[
∑T

t−1ωth
1
t

(
ϕ1
t

)]

 , where 
T is the final iteration, h1t is the hypothesis with the most 
discriminating information, and ωt is weights that weigh h1t  
based on its classification performance. The learning pro-
cess formulated in our recent study [25] is as follows:

Step 1: Input the training sets, 
[
(
y1, x1

)
,
(
y2, x2

)
, . . . ,

(
yN , xN

)
] , N = a+ b ; where data-

sets a and b comprise yi = +1 and yi = −1 datasets, 
respectively. Initially, δ = δini, δmin, δstep . The scale param-
eter δ , x, and y are the feature vectors selected by the 
AdaBoost algorithm.

Step 2: Initialize the training set weights, 
w
(1)
i = 1

/
2a , ∀

(
yi = +1

)
 and w(1)

i = 1
/
2a , ∀

(
yi = −1

)
.

Do while δ > δmin

Step 3: Apply the RBFSVM kernel to train the 
weighted training datasets by applying the leave-one-
subject-out cross validation (LOSOCV) approach and 
compute the training error for the weak classifier ht as

Step 4: At ξt = 1
/
2 , reduce δ by a factor of δstep and 

then jump to Step 1.
Step 5: Place the weight of the constituent classifier ht 

such that

Step 6: Update the weights by computing:

(18)ξt =
∑N

i=1
wt
i , yi �= ht(xi)

(19)ht : αt = ln

[
1

ξt
− 1

] 1
2

Fig. 5 Gabor magnitude response of the expressive face image: 
sample image (left), magnitude response image of the whole Gabor 
filter bank of 40 Gabor filters (right)

Fig. 6 Sample Bosphorus datasets


Page 7 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

where Nt is a normalization constant and 
∑n

i=1w
t+1
i = 1

Step 7: The final classifier is given by

The LOSOCV approach is given by the expression: 
1
/
2n =

∑n
t=1

∣
∣fi(xi)− li

∣
∣ , where n represents the total 

trained data.

Facial expression datasets
The algorithm was trained and tested on five popu-
lar datasets: Bosphorus, BU-3DFE, MMI, CK + , and 
BP4D-Spontaneous, and executed on a (4 CPUs), 
approximately 2.2  GHz processor with a memory 
capacity of 8192 MB RAM.

(20)wt+1
i =

wt
i exp

{
−αtyiht(xi)

}

Nt

(21)H(x) = sgn

[
∑T

t=1
αtht(x)

]

Results and discussion
Experiments on databases
Bosphorus contains 4666 images of 105 subjects [26] 
comprising 60 men and 5 women, with the majority 
being Caucasian; 27 of whom were professional actors, 
in various poses, expressions, and occlusion condi-
tions. In addition to the 6 basic emotional expressions, 
various systematic head poses (13 yaw and pitch rota-
tions) were present. The texture images have a reso-
lution of 1600 × 1200 pixels, whereas the 3D faces 
comprise approximately 35,000 vertices [27]. Figure  6 
presents sample datasets from Bosphorous. Occlu-
sion images were discarded because they were not 
the focus of this study. The datasets used comprised 6 
poses and 7 expressions. The images were partitioned 
into training and testing sets using the conventional 
LOSOCV approach. One specimen from each of the 6 
groups of expressions was used as a test dataset dur-
ing each training run, whereas the rest of the samples 
were used as a testing set. Table 1 summarizes the FER 
in Bosphorus. 

The BU-3DFE database was created at Binghamton 
University [28]. There were 100 respondents, ranging in 
age from 18 to 70 years old. Whites, Blacks, East Asians, 
Middle East Asians, Indians, and Hispanics are among 
the ethnic groups. Each participant displayed 7 expres-
sions at 4 intensity levels, including neutral, and 6 arche-
typal facial expressions. Figure  7 shows sample datasets 
in the database. The images were separated into training 
and testing sets using the same LOSOCV method as that 
used for the Bosphorus datasets, and the average recog-
nition accuracy was 94.56%.

The MMI database comprises over 2900 high-reso-
lution videos submitted by more than 20 students and 
research staff members, of which 44% are female, rang-
ing in age from 19 to 62 years old. Seventy-five subjects 
were included in total, and Fig.  8 shows samples. The 

Table 1 FER in Bosphorus database 

Average recognition accuracy = 92.7%

Pose Expression Average 
recognition 
(%)

Expressions Average 
recognition 
(%)

100 Yaw Neutral 100 Happiness 99.2

200 Yaw Neutral 99.8 Sadness 98.0

300 Yaw Neutral 99.2 Disgust 98.4

L450 Yaw Neutral 97.3 Angry 99.4

R450 Yaw Neutral 97.8 Fear 99.6

L900 Yaw Neutral 63.2 Surprise 99.0

R900 Yaw Neutral 78.2 Overall average 98.9

PR Neutral 99.7

CR Neutral 98.9

Fig. 7 Sample BU3DFE datasets


Page 8 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

datasets are partitioned into training and testing sets 
using the LOSOCV technique. One sample from each 
of the 7 types of expressions was used as the test data-
set during each training run. The remaining samples were 
used as training sets. For each training cycle, the samples 
were repeated with new test samples. The expressions 
included anger, disgust, fear, happiness, neutral, sadness, 
and surprise. The average recognition accuracy is 97.2%.

The CK + database is a version of the 210 adult CK 
database. Participants were 18 to 50 years old, with 69% 

female, 81% Euro-American, 13% Afro-American, and 
6% from other ethnic groups. The expressions included 
anger, contempt, disgust, fear, happiness, sadness, and 
surprise. Figure  9 presents sample datasets. A tenfold 
cross-validation procedure was used to partition the 
datasets into training and testing sets. The average recog-
nition accuracy is 99.48%.

Finally, the BP4D-Spontaneous dataset is a 3D video 
collection of spontaneous facial expressions from 
young individuals. The database comprises 41 subjects 

Fig. 8 Sample MMI datasets

Fig. 9 Sample images in CK + database


Page 9 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

(23 women and 18 men) ranging in age from 18 to 
29 years old, including 11 Asians, 6 African-Americans, 
4 Hispanics, and 20 Euro-Americans. Figure  10 shows 
sample images. We extracted expressions of anger, dis-
gust, fear, pain, happiness, sadness, and surprise. The 
datasets were partitioned into training and testing sets 
using tenfold cross-validation. The average recognition 
accuracy is 97.2%.

Figures  11 and 12 exhibit the respective confusion 
matrices for facial expressions and pose predictions in 
the Bosphorus database. Figures 13, 14, 15, and 16 show 
the rest of the confusion matrices for FERs in BU3DFE, 
MMI, CK + , and BP4D-Spontaneous, respectively.

Comparison of methods
In Table  2, the proposed method was compared to 
some recent techniques. These results clearly dem-
onstrated that the proposed method is promising. 
Figures  17, 18, and 19 show the performance of each 
of the 7 facial expressions. In the BU3DFE database, 
many authors failed to report the performance of neu-
tral expressions; thus, the comparison was performed 
using the other 6. The performance shown in Fig.  17 
was encouraging. Figure  18 shows the performance 
of the CK + database. Although the result, as shown 
in Fig.  18, depicts fierce rivalry between three cur-
rent methods [29–31], the overall average recognition 
shows that the proposed technique is promising. In 

Fig. 10 Sample BP4D-Spontaneous datasets

Fig. 11 Confusion matrix of facial expressions in Bosphorus


Page 10 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

Fig. 12 Confusion matrix of pose prediction in Bosphorus

Fig. 13 Confusion matrix of facial expressions in BU3DFE database


Page 11 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

Fig. 14 Confusion matrix of facial expressions in MMI database

Fig. 15 Confusion matrix of facial expressions in CK + database


Page 12 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

Fig. 16 Confusion matrix of facial expressions in BP4D-Spontaneous datasets

Table 2 Comparison of results on different methods

Method Database Recognition (%) Ref

Twin support vector machines classifier MMI 92.56± 3.02 [32]

DBM-DACNN with entropy loss MMI 79.25 [33]

Deep learning neural network-regression CK + 97.27 [30]

Deep learning + random forest CK + 99.00 [31]

Twin support vector machines classifier CK + 93.42± 3.25 [32]

DBM-DACNN with entropy loss CK + 96.46 [33]

Geotopo + BP4D-Spontaneous 88.56 [34]

Two-phase weighted collaborative representation classification BP4D-Spontaneous 100 [35]

Fine-grained matching of 3D keypoint descriptors Bosphorus 98.90 [21]

Kernel methods on Riemannian manifold Bosphorus 86.70 [36]

SVM with EPE Bosphorus 84.00 [37]

Two-phase weighted collaborative representation classification Bosphorus 98.90 [35]

Kernel methods on Riemannian manifold BU-3DFE 92.62 [36]

SVM with EPE BU-3DFE 85.81 [37]

Manifold CNN BU-3DFE 86.67 [38]

CNN model BU-3DFE 92.57 [39]

Proposed method MMI 97.20 This study

Proposed method CK + 98.20 This study

Proposed method BP4D-Spontaneous 97.20 This study

Proposed method Bosphorus 98.90 This study

Proposed method BU-3DFE 93.50 This study


Page 13 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

the Bosphorus database, the proposed method out-
performed the most recent methods (Fig. 19). A com-
parison of the performances of the individual FER 
prototypes in the MMI and BP4D-Spontaneous data-
bases could not be executed because there were no 
reported data for comparison at the time of compila-
tion. Statistical analysis using ANOVA shows the fol-
lowing performance results:

In the Bosphorus database, an analysis of vari-
ances demonstrated statistically significant differences 
between the proposed technique and the following: 
Hariri et al. [36] (p = 0.001), Azazi et al. [37] (p = 0.000), 
and Moeini A and Moeini H [40] (p = 0.013). In 

addition, the outcome is the same as in the BU3DFE: 
the variance analysis shows that a statistically sig-
nificant difference (p < 0.05) exists between the pro-
posed method and all other methods. However, in the 
CK + FER database, the statistical analysis shows that, 
except ref. [41], where a statistically significant differ-
ence (p < 0.05) exists, the remaining datasets show no 
statistically significant differences (p > 0.05). The pro-
posed method compared to yields from An and Liu 
[29] (p = 0.847), Ch [30] (p = 0.909), and Liao et al. [31] 
(p = 0.991). Although the analysis appears to reveal a 
balanced performance between the proposed meth-
odology and the last three techniques, the average 

Fig. 17 Performance of 6 FER prototypes in BU3DFE database

Fig. 18 Performance of 6 FER prototype in CK + database


Page 14 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14 

recognition accuracy of the proposed method against 
any of them, as shown in Fig. 18, indicates that the pro-
posed method is superior.

Conclusions
This study improves the FER performance in higher 
poses. 2D pose conversion schemes have been 
established to handle pose-invariant FER problems 
successfully, within a small-scale pose variation. 
However, they often flop for large-scale, in-depth 
face variations because of the disjointedness of the 
image. Human face geometry is ellipsoidal; there-
fore, the feature points are robustly tracked from 
one frame to next using an ellipsoidal model. We use 
the Gabor feature extraction technique for the sali-
ent visible features, mostly around the cheeks, eyes, 
mouth, and nose ridges. The Gabor feature extrac-
tion algorithm is useful for this study because it is 
selective toward orientation, localization, and fre-
quency. We then used an ensemble classification 
technique, which combines SVM and AdaBoost, for 
feature selection and classification. The proposed 
technique outperforms the most recent and popular 
methods. In the future, we intend to investigate this 
problem using other feature extraction methods such 
as LBP and LBP + HOG.

Abbreviations
FER: Facial expression recognition; SVM: Saturated vector machine; LBP: Local 
binary patterns; HOG: Histogram of gradients; PCA: Principal component 
analysis; KNN: K-nearest neighbor; SMOTE: Synthetic minority oversampling 
technique; 2D: Two-dimensional; 3D: Three-dimensional; LOSOCV: Leave-one-
subject-out cross validation.

Acknowledgements
Not applicable.

Authors’ contributions
All authors drafted this manuscript. Ideation was proposed by EO. EO and JKA 
developed the proposed solution. PO performed the experimentation. All 
authors finally discussed and analyzed the results from the experimentation. 
All authors read and approved the final manuscript.

Funding
Not applicable.

Availability of data and materials
All data used for this data are publicly available on request from the original 
authors.

Declarations

Competing interests
All authors declare that there is no known competing interest.

Author details
1 Department of Computer Science, University of Ghana, P. O. Box LG 163, 
Accra, Ghana. 2 Department of Computer Engineering, University of Ghana, P. 
O. Box LG 77, Accra, Ghana. 

Received: 18 November 2021   Accepted: 19 April 2022

References
 1. Panksepp J (2005) Affective consciousness: Core emotional feelings in 

animals and humans. Conscious Cogn 14(1):30-80. https:// doi. org/ 10. 
1016/j. concog. 2004. 10. 004

 2. Plutchik R (2001) The nature of emotions: Human emotions have deep 
evolutionary roots, a fact that may explain their complexity and provide 
tools for clinical practice. Amer Scient 89(4):344-350. https:// doi. org/ 10. 
1511/ 2001.4. 344

 3. Zautra AJ (2003) Emotions, stress, and health. Oxford University Press, 
Oxford.

 4. Kohler CG, Martin EA, Stolar N, Barrett FS, Verma R, Brensinger C et al 
(2008) Static posed and evoked facial expressions of emotions in 

Fig. 19 Performance of 7 FER prototypes in Bosphorus database

https://doi.org/10.1016/j.concog.2004.10.004
https://doi.org/10.1016/j.concog.2004.10.004
https://doi.org/10.1511/2001.4.344
https://doi.org/10.1511/2001.4.344


Page 15 of 15Owusu et al. Visual Computing for Industry, Biomedicine, and Art            (2022) 5:14  

schizophrenia. Schizophr Res 105(1-3):49-60. https:// doi. org/ 10. 1016/j. 
schres. 2008. 05. 010

 5. Ambron E, Foroni F (2015) The attraction of emotions: irrelevant emo-
tional information modulates motor actions. Psychon Bull Rev 22(4):1117-
1123. https:// doi. org/ 10. 3758/ s13423- 014- 0779-y

 6. Kumari J, Rajesh R, Kumar A (2016) Fusion of features for the effective 
facial expression recognition. Paper presented at the international confer-
ence on communication and signal processing, IEEE, Melmaruvathur, 6–8 
June 2016. https:// doi. org/ 10. 1109/ ICCSP. 2016. 77541 78

 7. Shergill GS, Sarrafzadeh A, Diegel O, Shekar A (2008) Computerized sales 
assistants: the application of computer technology to measure consumer 
interest-a conceptual framework. J Electron Commer Res 9(2):176-191.

 8. Tierney M (2017) Using behavioral analysis to prevent violent extremism: 
Assessing the cases of Michael Zehaf-Bibeau and Aaron Driver. J Threat 
Assessm Manag 4(2):98-110. https:// doi. org/ 10. 1037/ tam00 00082

 9. Nonis F, Dagnes N, Marcolin F, Vezzetti E (2019) 3D approaches and chal-
lenges in facial expression recognition algorithms - A literature review. 
Appl Sci 9(18):3904. https:// doi. org/ 10. 3390/ app91 83904

 10. Sandbach G, Zafeiriou S, Pantic M, Rueckert D (2011) A dynamic approach 
to the recognition of 3D facial expressions and their temporal models. 
Paper presented at the ninth IEEE international conference on automatic 
face and gesture recognition, IEEE, Santa Barbara, 21–25 March 2011. 
https:// doi. org/ 10. 1109/ FG. 2011. 57714 34

 11. Vieriu RL, Tulyakov S, Semeniuta S, Sangineto E, Sebe N (2015) Facial 
expression recognition under a wide range of head poses. Paper 
presented at the 11th IEEE international conference and workshops on 
automatic face and gesture recognition, IEEE, Ljubljana, May 4–8, 2015. 
https:// doi. org/ 10. 1109/ FG. 2015. 71630 98

 12. Yadav KS, Singha J (2020) Facial expression recognition using modi-
fied Viola-John’s algorithm and KNN classifier. Multimed Tools Appl 
79(19):13089-13107. https:// doi. org/ 10. 1007/ s11042- 019- 08443-x

 13. Jones M, Viola P (2003) Fast multi-view face detection. Mitsubishi Electric 
Research Laboratories, Cambridge.

 14. Yao L, Wan Y, Ni HJ, Xu BG (2021) Action unit classification for facial 
expression recognition using active learning and SVM. Multimed Tools 
Appl 80(16):24287-24301. https:// doi. org/ 10. 1007/ s11042- 021- 10836-w

 15. Ashir AM, Eleyan A, Akdemir B (2020) Facial expression recognition with 
dynamic cascaded classifier. Neural Comput Appl 32(10):6295-6309. 
https:// doi. org/ 10. 1007/ s00521- 019- 04138-4

 16. Farrow CL, Shaw M, Kim H, Juhás P, Billinge SJL (2011) Nyquist-Shannon 
sampling theorem applied to refinements of the atomic pair distribution 
function. Phys Rev B 84(13):134105. https:// doi. org/ 10. 1103/ PhysR evB. 84. 
134105

 17. Li F, Cornwell TJ, de Hoog F (2011) The application of compressive 
sampling to radio astronomy. I. Deconvolution. Astron Astrophys 528:A31. 
https:// doi. org/ 10. 1051/ 0004- 6361/ 20101 5045

 18. Perez-Gomez V, Rios-Figueroa HV, Rechy-Ramirez EJ, Mezura-Montes E, 
Marin-Hernandez A (2020) Feature selection on 2D and 3D geometric 
features to improve facial expression recognition. Sensors 20(17):4847. 
https:// doi. org/ 10. 3390/ s2017 4847

 19. Duan J (2019) Financial system modeling using deep neural networks 
(DNNs) for effective risk assessment and prediction. J Franklin Inst 
356(8):4716-4731. https:// doi. org/ 10. 1016/j. jfran klin. 2019. 01. 046

 20. Kurniawati YE, Permanasari AE, Fauziati S (2018) Adaptive synthetic-nomi-
nal (ADASYN-N) and adaptive synthetic-KNN (ADASYN-KNN) for multiclass 
imbalance learning on laboratory test data. Paper presented at the 4th 
international conference on science and technology, IEEE, Yogyakarta, 7–8 
August 2018. https:// doi. org/ 10. 1109/ ICSTC. 2018. 85286 79

 21. Li HB, Huang D, Morvan JM, Wang YH, Chen LM (2015) Towards 3D face 
recognition in the real: a registration-free approach using fine-grained 
matching of 3D keypoint descriptors. Int J Comput Vis 113(2):128-142. 
https:// doi. org/ 10. 1007/ s11263- 014- 0785-6

 22. Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE 
Trans Pattern Anal Mach Intell 25(5):564-577. https:// doi. org/ 10. 1109/ 
TPAMI. 2003. 11959 91

 23. Hao GT, Du XP, Chen H, Song JJ, Gao TF (2015) Scale-unambiguous rela-
tive pose estimation of space uncooperative targets based on the fusion 
of three-dimensional time-of-flight camera and monocular camera. Opt 
Eng 54(5):053112. https:// doi. org/ 10. 1117/1. OE. 54.5. 053112

 24. Dibeklioglu H, Salah AA, Akarun L (2008) 3D facial landmarking under 
expression, pose, and occlusion variations. Paper presented at the IEEE 

second international conference on biometrics: theory, applications and 
systems, IEEE, Washington, 29 September-1 October 2008. https:// doi. 
org/ 10. 1109/ BTAS. 2008. 46993 24

 25. Owusu E, Wiafe I (2021) An advance ensemble classification for object 
recognition. Neural Comput Appl 33(18):11661-11672. https:// doi. org/ 10. 
1007/ s00521- 021- 05881-3

 26. Dharavath K, Laskar RH, Talukdar FA (2013) Qualitative study on 3D face 
databases: A review. Paper presented at the annual IEEE India conference, 
IEEE, Mumbai, 13–15 December 2013. https:// doi. org/ 10. 1109/ INDCON. 
2013. 67260 93

 27. Sandbach G, Zafeiriou S, Pantic M, Yin LJ (2012) Static and dynamic 3D 
facial expression recognition: A comprehensive survey. Image Vision 
Comput 30(10):683-697. https:// doi. org/ 10. 1016/j. imavis. 2012. 06. 005

 28. Quan W, Matuszewski BJ, Shark LK, Ait-Boudaoud D (2009) Facial expres-
sion biometrics using statistical shape models. EURASIP J Adv Signal 
Process 2009:261542. https:// doi. org/ 10. 1155/ 2009/ 261542

 29. An FP, Liu ZW (2020) Facial expression recognition algorithm based on 
parameter adaptive initialization of CNN and LSTM. Vis Comput 36:483-
498. https:// doi. org/ 10. 1007/ s00371- 019- 01635-4

 30. Ch S (2021) An efficient facial emotion recognition system using 
novel deep learning neural network-regression activation classifier. 
Multimed Tools Appl 80(12):17543-17568. https:// doi. org/ 10. 1007/ 
s11042- 021- 10547-2

 31. Liao HB, Wang DH, Fan P, Ding L (2021) Deep learning enhanced attrib-
utes conditional random forest for robust facial expression recognition. 
Multimed Tools Appl 80(19):28627-28645. https:// doi. org/ 10. 1007/ 
s11042- 021- 10951-8

 32. Kumar MP, Rajagopal MK (2019) Detecting facial emotions using normal-
ized minimal feature vectors and semi-supervised twin support vector 
machines classifier. Appl Intell 49(12):4150-4174. https:// doi. org/ 10. 1007/ 
s10489- 019- 01500-w

 33. Li S, Deng WH (2019) Blended emotion in-the-wild: Multi-label facial 
expression recognition using crowdsourced annotations and deep local-
ity feature learning. Int J Comput Vis 127(6):884-906. https:// doi. org/ 10. 
1007/ s11263- 018- 1131-1

 34. Danelakis A, Theoharis T, Pratikakis I, Perakis P (2016) An effective 
methodology for dynamic 3D facial expression retrieval. Pattern Recogn 
52:174-185. https:// doi. org/ 10. 1016/j. patcog. 2015. 10. 012

 35. Lei YJ, Guo YL, Hayat M, Bennamoun M, Zhou XZ (2016) A two-phase 
weighted collaborative representation for 3D partial face recognition 
with single sample. Pattern Recogn 52:218-237. https:// doi. org/ 10. 1016/j. 
patcog. 2015. 09. 035

 36. Hariri W, Tabia H, Farah N, Benouareth A, Declercq D (2017) 3D facial 
expression recognition using kernel methods on Riemannian manifold. 
Eng Appl Artif Intell 64:25-32. https:// doi. org/ 10. 1016/j. engap pai. 2017. 05. 
009

 37. Azazi A, Lutfi SL, Venkat I, Fernández-Martínez F (2015) Towards a robust 
affect recognition: Automatic facial expression recognition in 3D faces. 
Expert Syst Appl 42(6):3056-3066. https:// doi. org/ 10. 1016/j. eswa. 2014. 10. 
042

 38. Chen ZX, Huang D, Wang YH, Chen LM (2018) Fast and light manifold 
CNN based 3D facial expression recognition across pose variations. Paper 
presented at the 26th ACM international conference on multimedia, 
ACM, Seoul, 22–26 October 2018. https:// doi. org/ 10. 1145/ 32405 08. 32405 
68

 39. Huynh XP, Tran TD, Kim YG (2016) Convolutional neural network models 
for facial expression recognition using BU-3DFE database. In: Kim K, 
Joukov N (eds) Information Science and Applications (ICISA) 2016. 
Lecture Notes in Electrical Engineering, vol 376. Springer, Singapore, pp 
441–450. https:// doi. org/ 10. 1007/ 978- 981- 10- 0557-2_ 44

 40. Moeini A, Moeini H (2015) Real-world and rapid face recognition toward 
pose and expression variations via feature library matrix. IEEE Trans Inform 
Forensics secur 10(5):969-984. https:// doi. org/ 10. 1109/ TIFS. 2015. 23935 53

 41. Meena HK, Sharma KK, Joshi SD (2020) Effective curvelet-based facial 
expression recognition using graph signal processing. Signal Image Video 
Process 14(2):241-247. https:// doi. org/ 10. 1007/ s11760- 019- 01547-9

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations.

https://doi.org/10.1016/j.schres.2008.05.010
https://doi.org/10.1016/j.schres.2008.05.010
https://doi.org/10.3758/s13423-014-0779-y
https://doi.org/10.1109/ICCSP.2016.7754178
https://doi.org/10.1037/tam0000082
https://doi.org/10.3390/app9183904
https://doi.org/10.1109/FG.2011.5771434
https://doi.org/10.1109/FG.2015.7163098
https://doi.org/10.1007/s11042-019-08443-x
https://doi.org/10.1007/s11042-021-10836-w
https://doi.org/10.1007/s00521-019-04138-4
https://doi.org/10.1103/PhysRevB.84.134105
https://doi.org/10.1103/PhysRevB.84.134105
https://doi.org/10.1051/0004-6361/201015045
https://doi.org/10.3390/s20174847
https://doi.org/10.1016/j.jfranklin.2019.01.046
https://doi.org/10.1109/ICSTC.2018.8528679
https://doi.org/10.1007/s11263-014-0785-6
https://doi.org/10.1109/TPAMI.2003.1195991
https://doi.org/10.1109/TPAMI.2003.1195991
https://doi.org/10.1117/1.OE.54.5.053112
https://doi.org/10.1109/BTAS.2008.4699324
https://doi.org/10.1109/BTAS.2008.4699324
https://doi.org/10.1007/s00521-021-05881-3
https://doi.org/10.1007/s00521-021-05881-3
https://doi.org/10.1109/INDCON.2013.6726093
https://doi.org/10.1109/INDCON.2013.6726093
https://doi.org/10.1016/j.imavis.2012.06.005
https://doi.org/10.1155/2009/261542
https://doi.org/10.1007/s00371-019-01635-4
https://doi.org/10.1007/s11042-021-10547-2
https://doi.org/10.1007/s11042-021-10547-2
https://doi.org/10.1007/s11042-021-10951-8
https://doi.org/10.1007/s11042-021-10951-8
https://doi.org/10.1007/s10489-019-01500-w
https://doi.org/10.1007/s10489-019-01500-w
https://doi.org/10.1007/s11263-018-1131-1
https://doi.org/10.1007/s11263-018-1131-1
https://doi.org/10.1016/j.patcog.2015.10.012
https://doi.org/10.1016/j.patcog.2015.09.035
https://doi.org/10.1016/j.patcog.2015.09.035
https://doi.org/10.1016/j.engappai.2017.05.009
https://doi.org/10.1016/j.engappai.2017.05.009
https://doi.org/10.1016/j.eswa.2014.10.042
https://doi.org/10.1016/j.eswa.2014.10.042
https://doi.org/10.1145/3240508.3240568
https://doi.org/10.1145/3240508.3240568
https://doi.org/10.1007/978-981-10-0557-2_44
https://doi.org/10.1109/TIFS.2015.2393553
https://doi.org/10.1007/s11760-019-01547-9

	Robust facial expression recognition system in higher poses
	Abstract 
	Introduction
	Applications
	Technical issues on the use of two-dimensional facial data
	Related work

	Methods
	Architectural framework
	Ellipsoidal feature tracking method
	Feature points extraction
	Classification using Ada-AdaSVM
	Facial expression datasets

	Results and discussion
	Experiments on databases
	Comparison of methods

	Conclusions
	Acknowledgements
	References