University of Ghana http://ugspace.ug.edu.gh
UNIVERSITY OF GHANA 
COLLEGE OF BASIC AND APPLIED SCIENCE 
 
A DEEP LEARNING APPROACH FOR THE AUTOMATIC CLASSIFICATION OF ACOUSTIC 
EVENTS: A CASE OF NATURAL DISASTERS 
BY 
EKPEZU, AKON OBU 
(10704369) 
THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, LEGON IN 
PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF MPHIL IN 
COMPUTER SCIENCE DEGREE 
DEPARTMENT OF COMPUTER SCIENCE 
October 2020 
  
 
University of Ghana http://ugspace.ug.edu.gh
DECLARATION 
I hereby declare that I am the sole author of this thesis and all materials used from other sources 
or in collaboration with other researchers have been properly and fully acknowledged. 
 
EKPEZU, AKON OBU 
(Candidate)     
 
  
 
 
                               
DR. FERDINAND KATSRIKU                     DR. WINFRED YAOKUMAH 
(Supervisor)                      (Co-supervisor) 
  
 ii  
University of Ghana http://ugspace.ug.edu.gh
ABSTRACT 
Automatic classification of acoustic events is a signal processing activity that has recently 
gained research interest, especially in the machine learning community. This is due to its cost-
effectiveness in the long-term monitoring of larger areas and the collection of large amounts 
of data in real-time. A plethora of techniques have been proposed and adopted for the 
classification of acoustic events such as respiratory sound, animal calls/vocalizations, baby cry, 
speech disorders, and environmental sound. This study was aimed at developing a natural 
disaster sound classification model that will enable automatic classification of natural disasters. 
Accordingly, deep learning techniques including Convolutional Neural Network (CNN) and a 
Long short-term memory based-Recurrent Neural Network (RNN-LSTM) were used to 
develop classification models. The adopted algorithms and sound features used in this study 
were motivated by methodologies used in the area of speech/voice recognition. To ensure a 
relevant and rigorous research, this study adopted the design science research methodology 
which consisted of a five-phase cycle; awareness of the problem, suggestion, development, 
evaluation, and conclusion. Furthermore, to also ensure the real-time classification of natural 
disaster sounds, the detection-by-classification approach was adopted instead of detection-and-
classification. The dataset used for this study consisted of five classes of natural disasters sound 
that was extracted from the Freesound database. The sound files were preprocessed at 16000Hz 
to extract 13 Mel Frequency Cepstral Coefficient (MFCC). An arbitrary time frame of 0.1s was 
adopted. In the end, the performance of both models was validated using the classification 
metrics and cross-validation. Results indicated that although CNN performed slightly better 
than RNN-LSTM, both models were effective at automatically discerning one disaster sound 
from the other in real-time. Best results of 99.95% in classification accuracy, and 0.999 in the 
area under the curve (AUC) score were obtained from CNN. 
  
 iii  
University of Ghana http://ugspace.ug.edu.gh
ACKNOWLEDGEMENT 
But by the grace of God, I am what I am, and his grace toward me was not in vain. For seeing 
me through this phase of studies; all thanks and adoration to the Lord God almighty for his 
unfailing love, grace, and mercies. 
I particularly thank my supervisors; Dr. Ferdinand Katsriku and Dr. Winfred Yaokumah for 
their wholehearted encouragement, diligent guidance, and for dutifully and meticulously 
guiding this thesis. Also, for the many research opportunities and experiences they have 
exposed me to throughout my MPhil program, may God bless you abundantly. 
My appreciation also extends to Dr. Jamal Abdullahi, Dr. Isaac Wiafe, Dr. Solomon Mensah, 
Dr. Justice Appatti, and all the faculty members of the Department of Computer Science, 
University of Ghana; you have given my postgraduate journey a new and unforgettable 
experience. Remain blessed. 
I also acknowledge with special thanks, Dr. Enoimah Umoh, and all the faculty members of 
the Department of Computer Science, Cross River University of Technology; their unwavering 
support has made this academic pursuit a success.  
I am profoundly grateful to my mother Mrs. Atim Mbah, my siblings Eke and Obu Ekpezu, 
and the families of Dr. Emil Inyang and Dr. Isaac Wiafe whose encouragement, prayers, love, 
and sacrifice kept me pushing forward. May God in his infinite mercies bless you all. 
To my colleagues William Apprey, Samuel Abedu, Fredrick Boafo, Abigail Wiafe, Jacqueline 
Kumi, and Melody Kakrabah; your immeasurable contributions and support have been 
amazing. 
May the grace and blessings of God rest and abide with you all. Amen. 
  
 iv  
University of Ghana http://ugspace.ug.edu.gh
RELATED PUBLICATION 
Ekpezu, A. O., Katsriku, F., Yaokumah W., and Wiafe I., [under review] Classification of 
Acoustic Signals using machine learning: A Systematic Review. Submitted to Journal of 
Artificial Intelligence and Soft Computing Research. 
  
 v  
University of Ghana http://ugspace.ug.edu.gh
TABLE OF CONTENTS 
 
DECLARATION ............................................................................................................................ ii 
ABSTRACT .................................................................................................................................. iii 
ACKNOWLEDGEMENT ............................................................................................................... iv 
RELATED PUBLICATION .............................................................................................................. v 
TABLE OF CONTENTS ................................................................................................................. vi 
LIST OF FIGURES ........................................................................................................................ ix 
LIST OF TABLES ........................................................................................................................... x 
LIST OF ABBREVIATIONS ........................................................................................................... xi 
Chapter One INTRODUCTION .............................................................................................. 1 
1.1 BACKGROUND AND MOTIVATION .........................................................................................1 
1.2 RESEARCH PROBLEM ..............................................................................................................3 
1.3 RESEARCH AIM AND OBJECTIVES ...........................................................................................6 
1.4 EXPECTED CONTRIBUTIONS ...................................................................................................6 
1.4.1 THEORETICAL CONTRIBUTION .........................................................................................................7 
1.4.2 PRACTICAL CONTRIBUTION ..............................................................................................................7 
1.5 THESIS OUTLINE......................................................................................................................7 
Chapter Two RELATED STUDIES ........................................................................................... 9 
2.1 CHAPTER OVERVIEW ..............................................................................................................9 
2.2 NATURAL DISASTERS ..............................................................................................................9 
2.3 REVIEW SUMMARY ON RELATED WORK............................................................................. 10 
2.3.1 TASK CATEGORIES ......................................................................................................................... 10 
2.3.2 MODELLING TECHNIQUES USED ................................................................................................... 20 
2.4 REVIEW SUMMARY ............................................................................................................. 21 
2.5 CHAPTER SUMMARY ........................................................................................................... 23 
Chapter Three STATE OF THE ART IN SOUND CLASSIFICATION ........................................... 25 
3.1 CHAPTER OVERVIEW ........................................................................................................... 25 
3.2 RELATED REVIEWS ............................................................................................................... 25 
3.3 REVIEW QUESTIONS ............................................................................................................ 26 
3.4 REVIEW APPROACH ............................................................................................................. 27 
3.4.1 LITERATURE SEARCH ..................................................................................................................... 27 
3.4.2 INCLUSION AND EXCLUSION CRITERIA ......................................................................................... 28 
3.4.3 STUDY SELECTION AND DATA EXTRACTION ................................................................................. 29 
3.5 OVERVIEW OF PUBLICATION TRENDS ................................................................................. 30 
3.5.1 PUBLICATION FREQUENCY ............................................................................................................ 30 
3.5.2 DISTRIBUTION OF JOURNALS ........................................................................................................ 31 
 vi  
University of Ghana http://ugspace.ug.edu.gh
3.5.3 AUTHORS AND COUNTRY ORIGIN ................................................................................................. 32 
3.6 SUMMARY OF METHODOLOGIES FROM REVIEWED ARTICLES........................................... 35 
3.6.1 SOUND/ACOUSTIC SIGNALS CLASSIFIED AND DATA SOURCES..................................................... 35 
3.6.2 DISTRIBUTION OF CLASSIFIED SOUNDS ACCORDING TO APPLICATION DOMAIN ........................ 47 
3.6.3 FEATURE EXTRACTION METHODS ................................................................................................. 49 
3.6.4 SOUND CLASSIFICATION ALGORITHMS AND PERFORMANCE METRICS ....................................... 53 
3.7 REVIEW SUMMARY ............................................................................................................. 59 
3.8 CHAPTER SUMMARY ........................................................................................................... 62 
Chapter Four RESEARCH METHODOLOGY ......................................................................... 64 
4.1 CHAPTER OVERVIEW ........................................................................................................... 64 
4.2 DESIGN SCIENCE RESEARCH METHODOLOGY (DSRM) ........................................................ 64 
4.3 RESEARCH APPROACH FOR THIS STUDY ............................................................................. 65 
4.3.1 AWARENESS OF THE PROBLEM..................................................................................................... 66 
4.3.2 SUGGESTION ................................................................................................................................. 66 
4.3.3 DEVELOPMENT .............................................................................................................................. 66 
4.3.4 EVALUATION ................................................................................................................................. 71 
4.3.5 CONCLUSION ................................................................................................................................. 72 
4.4 CHAPTER SUMMARY ........................................................................................................... 72 
Chapter Five USING DEEP LEARNING FOR ACOUSTIC EVENT CLASSIFICATION ................ 73 
5.1 CHAPTER OVERVIEW ........................................................................................................... 73 
5.2 SOFTWARE USED FOR THE EXPERIMENT ............................................................................ 74 
5.3 NATURAL DISASTER SOUND DATASET ................................................................................ 74 
5.4 SOUND PREPROCESSING AND FEATURE EXTRACTION ....................................................... 75 
5.4.1 DE-NOISING THE SIGNAL ............................................................................................................... 77 
5.4.2 ACOUSTIC DOWN-SAMPLING ....................................................................................................... 78 
5.4.3 FILTER BANK-BASED FEATURE EXTRACTION METHOD ................................................................. 78 
5.5 CLASSIFICATION TECHNIQUES............................................................................................. 79 
5.5.1 CONVOLUTIONAL NEURAL NETWORK .......................................................................................... 80 
5.5.2 RECURRENT NEURAL NETWORK ................................................................................................... 82 
5.6 CHAPTER SUMMARY ........................................................................................................... 84 
Chapter Six EVALUATION OF DEEP LEARNING TECHNIQUES .............................................. 85 
6.1 CHAPTER OVERVIEW ........................................................................................................... 85 
6.2 MODEL VALIDATION ............................................................................................................ 85 
6.2.1 CROSS-VALIDATION ....................................................................................................................... 85 
6.2.2 CLASSIFICATION METRICS ............................................................................................................. 88 
6.3 TESTING THE VALIDITY OF THE MODELS IN REAL-TIME CLASSIFICATION OF DISASTER 
SOUNDS ............................................................................................................................................ 93 
6.4 CHAPTER SUMMARY ........................................................................................................... 95 
Chapter Seven CONCLUSION ................................................................................................ 97 
7.1 CHAPTER OVERVIEW ........................................................................................................... 97 
7.2 THESIS SUMMARY ............................................................................................................... 97 
7.3 DISCUSSIONS ..................................................................................................................... 100 
 vii  
University of Ghana http://ugspace.ug.edu.gh
7.3.1 CLASSIFICATION CATEGORY ........................................................................................................ 100 
7.3.2 INPUT ACOUSTIC FEATURES........................................................................................................ 101 
7.3.3 CLASSIFICATION PERFORMANCE ................................................................................................ 102 
7.4 LIMITATION OF THE STUDY ............................................................................................... 104 
7.4.1 DATASETS .................................................................................................................................... 104 
7.4.2 DENOISING THE SIGNAL .............................................................................................................. 104 
7.4.3 FEATURE EXTRACTION ................................................................................................................ 104 
7.5 RECOMMENDATION .......................................................................................................... 105 
7.6 FUTURE WORK................................................................................................................... 105 
References: ............................................................................................................................. 106 
APPENDIX A: PRIMARY STUDIES USED FOR THE SYSTEMATIC REVIEW ................................. 126 
APPENDIX B: PYTHON CODES FOR LOADING THE DATA ........................................................ 130 
APPENDIX C: PYTHON CODES FOR MODEL PREPARATION/PREDICTION .............................. 132 
APPENDIX D: PYTHON CODES FOR 10-FOLD MODEL VALIDATION ........................................ 134 
APPENDIX E: PYTHON CODES FOR AUC-ROC ......................................................................... 134 
 
 viii  
University of Ghana http://ugspace.ug.edu.gh
LIST OF FIGURES 
FIGURE 2.1: NATURAL DISASTER EVENTS GLOBALLY FROM 2010 TO 2019 (DUGGAR ET AL. 2016) .......................................... 10 
FIGURE 2.2: DISTRIBUTION OF MODELLING TECHNIQUES ................................................................................................... 20 
FIGURE 3.1: PUBLICATIONS BY YEAR .............................................................................................................................. 31 
FIGURE 3.2: DISTRIBUTION OF AUTHORS BY CONTINENT .................................................................................................... 34 
FIGURE 3.3: SOUND CLASSIFICATION PUBLICATION TREND BY STUDY COUNTRY ...................................................................... 35 
FIGURE 3.4: PIE CHART SHOWING THE DISTRIBUTION OF APPLICATION DOMAINS ................................................................... 48 
FIGURE 3.5: DISTRIBUTION OF MODELLING TECHNIQUES ................................................................................................... 58 
FIGURE 4.1: LAYERS OF CNN (BORGNE & BONTEMPI, 2017) ........................................................................................... 67 
FIGURE 5.1: SOUND CLASSIFICATION ARCHITECTURE ......................................................................................................... 73 
FIGURE 5.2: CLASS DISTRIBUTION OF DISASTER SOUND DATASET ......................................................................................... 75 
FIGURE 5.3: TIME SERIES REPRESENTATION OF FIVE RANDOM SAMPLES BELONGING TO THE FIVE DIFFERENT CLASSES OF THE DATASET
 ..................................................................................................................................................................... 76 
FIGURE 5.4: MFCC REPRESENTATION OF FIVE RANDOM SAMPLES BELONGING TO THE FIVE DIFFERENT CLASSES OF THE DATASET..... 76 
FIGURE 5.5: FILTER BANK COEFFICIENT REPRESENTATION OF FIVE RANDOM SAMPLES BELONGING TO THE FIVE DIFFERENT CLASSES OF 
THE DATASET ................................................................................................................................................... 77 
FIGURE 5.6: PIE-CHART SHOWING THE CLASS DISTRIBUTION OF DENOISED DISASTER SOUND DATASET ........................................ 78 
FIGURE 5.7: CNN MODEL DIMENSIONS .......................................................................................................................... 81 
FIGURE 5.8: RNN-LSTM ARCHITECTURE (RAZA ET AL. 2019)........................................................................................... 82 
FIGURE 5.9: RNN-LSTM MODEL DIMENSIONS ............................................................................................................... 83 
FIGURE 6.1: CLASSIFICATION ACCURACY AND AVERAGE ACCURACY OF THE 10-FOLDS ............................................................. 87 
FIGURE 6.2: ACCURACY OBTAINED FROM 10-FOLD CROSS-VALIDATION FOR CNN AND RNN-LSTM ......................................... 87 
FIGURE 6.3: CLASSIFICATION ACCURACY FOR CNN AND RNN-LSTM .................................................................................. 91 
FIGURE 6.4: AUC-ROC FOR CNN MODEL ..................................................................................................................... 92 
FIGURE 6.5: AUC-ROC FOR RNN-LSTM MODEL ........................................................................................................... 93 
FIGURE 6.6: CHART SHOWING ACCURACY SCORE COMPARISON FOR INITIAL AND INCREASED TIME FRAMES.................................. 93 
FIGURE 6.7: CHART SHOWING THE CLASSIFICATION ACCURACY FOR THE REAL TIME MODEL (2X) AND AUGMENTED DATASET (4X, 6X).
 ..................................................................................................................................................................... 95 
 ix  
University of Ghana http://ugspace.ug.edu.gh
LIST OF TABLES 
TABLE 2.1: DISASTER TYPE, MODELLING TECHNIQUES, AND TASK SUMMARY .......................................................................... 13 
TABLE 2.2: CLASSIFICATION SCHEMES FOR NATURAL DISASTERS .......................................................................................... 22 
TABLE 3.1: RESEARCH QUESTIONS AND OBJECTIVES .......................................................................................................... 26 
TABLE 3.2: SEARCH RESULTS PER EXCLUSION CRITERIA ...................................................................................................... 30 
TABLE 3.3: FREQUENCY DISTRIBUTION OF PRIMARY SOURCES ............................................................................................. 32 
TABLE 3.4: LEADING AUTHORS ..................................................................................................................................... 33 
TABLE 3.5: SUMMARY OF CLASSIFIED SOUNDS AND DATASETS ............................................................................................ 38 
TABLE 3.6: FEATURE EXTRACTION METHODS ................................................................................................................... 49 
TABLE 3.7: CLASSIFICATION TECHNIQUES USED ................................................................................................................ 53 
TABLE 3.8: CLASSIFICATION CATEGORIES ........................................................................................................................ 61 
TABLE 4.1: DESIGN SCIENCE RESEARCH (DSR) GUIDELINES ............................................................................................... 65 
TABLE 4.2: FEATURES OF THE 3-LAYERS IN A CNN ........................................................................................................... 69 
TABLE 4.3: CALCULATING THE CURRENT STATE, ACTIVATION FUNCTIONS AND OUTPUT IN RNN ................................................ 70 
TABLE 6.1: 10-FOLD CROSS-VALIDATION........................................................................................................................ 86 
TABLE 6.2: CONFUSION MATRIX SHOWING CNN PREDICTIONS ........................................................................................... 90 
TABLE 6.3: CONFUSION MATRIX SHOWING RNN-LSTM PREDICTIONS ................................................................................. 90 
TABLE 6.4: RESULT SUMMARY OF CLASSIFICATION METRICS ............................................................................................... 96 
TABLE 7.1: COMPARISON OF STUDY APPROACHES WITH OTHER STUDIES. ............................................................................ 103 
 
 x  
University of Ghana http://ugspace.ug.edu.gh
LIST OF ABBREVIATIONS 
ABBREVIATION FULL MEANING 
AI Artificial Intelligence 
AEC Acoustic event classification 
AED Acoustic event detection 
ANN Artificial Neural Network 
ARMAX Auto Regressive Moving Average with Exogenous Inputs 
ASA Acoustical society of America 
ASR Automatic speech recognition 
CNN Convolutional Neural Network 
DCT Discrete cosine transform 
DL Deep learning 
DML Deep metric learning 
DSRM Design science research methodology 
DWT Discrete wavelet transform 
ELM Extreme learning machine 
FBC Filter bank coefficient 
FCN Fully connected network 
FFT Fast Fourier transform 
FT Fourier transform 
HMM Hidden markov model 
IoT Internet of things 
JASA Journal of Acoustical society of America 
kHz Kilo hertz 
KNN k-Nearest Neighbor 
LSTM Long short-term memory 
MFCC Mel Frequency Cepstral Coefficient 
ML Machine learning 
MLP Multilayer Perceptron 
NLP Natural language processing 
RBF Radial Basis Function 
RF Random forest 
RNN Recurrent Neural Network 
STFT Short time Fourier transform 
SVM Support Vector Machine 
.WAV Waveform Audio 
 xi  
University of Ghana http://ugspace.ug.edu.gh
Chapter One  
INTRODUCTION 
1.1 BACKGROUND AND MOTIVATION 
Every object either animate or inanimate produces sound in its vibrating state (Giordano, 2005; 
Rubinstein, 2008). Although it varies depending on seasons, time, geographic location as well 
as propagation medium, sound is considered as one of the most significant signals used to 
monitor and detect changes in the environment. However, due to varying parameters, detecting 
a sound of interest in a particular environment is most times challenging because the sound of 
interest is usually immersed in different forms of background noises such as anthrophony 
(man-made e.g. traffic, shipping, and aircraft noise), geophony (environmental e.g. windstorm, 
raindrops, thunderstorms), and biophony (animal-made e.g. dog barking, vocalizations from 
marine mammals) (Muir & Bradley, 2016). Additionally, in distinguishing one sound type 
from the other, human operations are trained to use expert knowledge; this is an overwhelming 
and inadequate process as false alarms are raised in most cases (Cao et al. 2017).  
Consequently, the need for automatic active sound classification began to gain research 
interest. Generally, automatic sound classification (ASC) entails the automatic identification 
of ambient sound in an environment. It has so far been applicable in the domains such as disease 
diagnosis (Aykanat et al. 2017; Chen et al. 2019), voice classification (Fang et al. 2019), speech 
recognition (Turner & Joseph, 2015), bioacoustics (Kim et al. 2018; Luque et al. 2018), and 
action detection (Aziz et al. 2019; Maccagno et al. 2019; Salamon & Bello, 2017). 
This study is focused on the automatic acoustic classification of natural disasters as a 
complement for vision-based classification of natural disasters. Acoustic classification instead 
of vision-based classification because; natural disasters are a result of seismic activities in the 
 1  
University of Ghana http://ugspace.ug.edu.gh
ocean, these seismic activities produce layers of wave-like pulses that are invisible and 
inaudible to human eyes and ears yet observable as sound waves (Mone, 2007; Perlman, 2013). 
Accordingly, these sound waves can be recorded, monitored, and used to provide early warning 
of an upcoming seismic event (Monroe-Kane, 2019). It is important to note that this study is 
concerned with only those natural disasters which are a result of seismic activities. 
Developing an automatic acoustic classification technique (model) requires the selection of 
appropriate acoustic features (Aziz et al. 2019), as well as a robust classification algorithm 
(Luque, Romero-Lemos, Carrasco, & Gonzalez-Abril, 2018). A robust classification algorithm 
in this context is a classifier that can distinguish sounds that belong to distinct classes of the 
feature space. Sound classifiers are broadly classified as discriminative and non-discriminative 
(Mitilineos et al. 2018). While the former entails modelling the decision boundary in the 
training data and matching its test input to a specific data class, the latter attempts to explicitly 
model the actual distribution of each class (Chu et al. 2009; Mitilineos et al. 2018). Examples 
of discriminative classifiers include logistic regression, nearest neighbor, k-means, support 
vector machines, and traditional neural networks such as the multilayer perceptron. Non-
discriminative classifiers include Naïve Bayes, Markov random fields, Hidden Markov 
Models, (HMM) and, Bayesian networks. 
In this study, a discriminative classifier such as a neural network will be adopted for the 
classification task. Neural networks such as convolutional neural networks (CNNs) and 
recurrent neural networks (RNN) are well-known classifiers that have been efficient in a wide 
variety of practical applications (Binkhonain & Zhao, 2019). While CNN is known for its high-
performance accuracy in image classification and recognition (Arel et al. 2010; Epelbaum, 
2017), RNN is known for its efficiency in processing sequential and time-series information 
(Arel et al. 2010; Pouyanfar et al. 2018). Additionally, these two classifiers have been shown 
 2  
University of Ghana http://ugspace.ug.edu.gh
to perform excellently in signal and speech/sound classification (Lecun et al. 2015; Mitilineos 
et al. 2018). This study will use these two neural networks in the automatic acoustic 
classification of natural disasters. 
1.2 RESEARCH PROBLEM 
Natural disasters such as tsunami, volcanoes, hurricanes, and earthquakes are powerful events 
with an infrasonic signature and low frequencies that are inaudible to the human ear 
(Gopalaswami, 2018; Perlman, 2013). With the low frequency of sound from the movement of 
the earth floor, it is no surprise why there are challenges in the early detection of natural 
disasters (Mone, 2007; Perlman, 2013). Consequently, the yearly cost of natural disasters is 
expensive in terms of financial loss and loss of human lives. It is estimated that annually ninety 
thousand (90000) people lose their lives and nearly 160 million are also affected as a result of 
natural disasters (Tobergte & Curtis, 2013). For instance, in the year 2018, the financial cost 
of natural disasters was estimated to be in the region of $91 billion in the US alone (Chappell, 
2019).  
While meteorological scientists and researchers are looking for better techniques to combat 
these disasters, studies have shown that natural disasters cannot be stopped (Goswami et al. 
2018; Khalaf et al. 2018; Wallemacq, 2015). However, being able to predict well in advance 
its occurrence and to differentiate between various kinds of disasters and their severity can 
mitigate their impact (Khalaf et al. 2018). According to Chen et al. (2013), the damage caused 
by natural disasters can be reduced significantly if information technology tools such as remote 
sensing, and satellite data are employed. Okamoto et al. (2018) also add that images obtained 
from surveillance cameras can be used to automatically assign disaster names to the disaster 
prevention networks. Conversely, Panagiota et al. (2011) argue that the use of images in pre-
disaster imagery most times prevents accurate change detection. On the other hand, remote 
 3  
University of Ghana http://ugspace.ug.edu.gh
sensing methods are limited by cost-effective availability and temporal delays of about 48 to 
72 hours before information can be produced (Resch et al. 2018). According to Wisner & 
Adams (2002), these temporal delays in information dissemination increases the level of 
vulnerability regarding the physical and environmental security. Studies have shown that 
natural disasters disrupt measures taken to protect buildings, systems, and business operations 
(Tierney, 2019). 
With the resurgence of interest in the use of Artificial Intelligence (AI), researchers have found 
that it can be used to manage, predict, or detect disasters. However, since the AI techniques 
like machine learning algorithms are based on data from records, it is difficult for artificial 
intelligence to predict long-term trends of various natural disasters (Joshi, 2019). More 
particularly, predictions are most times inaccurate, underestimated or overestimated due to 
discrepancies in the data used for the prediction (Joshi, 2019); typical cases include Indonesia’s 
Tsunami early warning system (Singhvi et al. 2018), false earthquake warning in Japan (BBC 
News, 2018), and the case of the California earthquake warning which was sent 92 years late 
(https://www.bbc.com/news/technology-40366816, 2017). Additionally, existing natural 
disaster detection systems are either unable to identify an event (natural disaster) in real-time 
or unable to send early warning signals due to the unavailability of a reliable detection and alert 
system that runs 24 hours a day (UNDP, 2012). A typical case of an ineffective alert system is 
the 2010 earthquake in Maule, Chile; no warnings or evacuation plans were made even after 
three hours of the tsunamic hitting the Chilean coasts (Soule, 2014). 
Machine learning algorithms used in several studies to classify natural disasters have used 
images, videos, text, and numerical data for the analysis. In all these, the damaging effects of 
natural disasters still abound. Current approaches for managing these disasters are either 
insufficient or inefficient (Boustan et al. 2017; Duggar et al. 2016; Evans, 2011). 
 4  
University of Ghana http://ugspace.ug.edu.gh
Given that:  
− the temperature of the ocean determines climate and wind patterns which in turn affects 
life on land and the ecosystem (Domingo, 2012),  
− disasters like hurricanes, tsunamis, earthquake, tornadoes are as a result of seismic 
activities in the ocean,  
− most of these disasters produce sound during formation, even though some of the 
produced sounds are below the range of human hearing, with a sampling rate of 2.5 to 5kHz 
(Mone, 2007), 
− sound travels faster in the water at a speed of 1500m/s compared to the speed of 353m/s 
in air, this implies that water is a favorable environment for sound propagation (Aziz et al. 
2019; Stojanovic & Beaujean, 2016). 
− the use of satellite or airborne images is limited due to the inability of light to travel 
beyond shallow water depth as well as access underwater information (Domingo, 2012; 
Hassiotis, 2018). 
This study proposes the use of acoustic signals/sound instead of images, video, text, or 
numerical data to differentiate one disaster type from another for the following reasons. Using 
a sound-based approach is less invasive, inexpensive, and it allows long-term monitoring of 
large areas and the collection of large amounts of data in real-time (Chen et al. 2019; Malfante 
et al. 2018). Sound classification is more reliable compared to image and video classification 
because it is not affected by variations in light intensity (Aziz et al. 2019). Additionally, the 
wide-angle camera lenses used in computer vision are not as omnidirectional as the 
microphones used for sound (Aziz et al. 2019; Mitilineos et al. 2018). 
 5  
University of Ghana http://ugspace.ug.edu.gh
1.3 RESEARCH AIM AND OBJECTIVES 
Putting the above-mentioned factors (see section 1.2) into consideration, this thesis aims to 
develop an automatic sound classification model for natural disasters. The disaster sound 
classification model will enable the automatic classification of these disasters amid ambient 
noise as well as sounds below the human hearing range. Particularly, this study will use deep 
learning techniques to build a sound classification model for the automatic classification of 
natural disasters. The specific objectives of this study include: 
i. To explore existing literature and identify the approaches and methods used for 
managing and detecting natural disasters. 
ii. To explore literature and identify state-of-the-arts in the classification of acoustic 
signals/sounds (events) in various application domains that use AI techniques. 
iii. To develop a natural disaster sound classification model using deep learning 
techniques. 
iv. To analyze diversified sound features as well as evaluate the models using existing 
model validation techniques. 
1.4 EXPECTED CONTRIBUTIONS 
The outcome of this study is expected to make key contributions to the fields of Internet of 
Things (IoT), intelligent systems, environmental monitoring, natural disaster detection, and 
acoustic signal processing. Since approximately 71% of the earth’s surface is covered by the 
ocean, the expected contribution of this study to IoT also extends to Internet of Underwater 
Things (IoUT) and its applications such as underwater exploration, disaster prevention, and 
military surveillance. 
 6  
University of Ghana http://ugspace.ug.edu.gh
1.4.1 THEORETICAL CONTRIBUTION 
This thesis explores existing literature in natural disasters, sound/acoustic signals, and artificial 
intelligence. It provides a theoretical background on the relationship between acoustic signals 
and natural sounds in the environment. It also highlights how machine learning and deep 
learning techniques are used to distinguish one sound type from the other.  
It therefore, proposes an acoustic model that will be useful in the early detection of natural 
disasters. More particularly, the proposed model is expected to simplify natural disaster 
detection processes and the detection of natural sounds in general for both researchers and 
practitioners.  
1.4.2 PRACTICAL CONTRIBUTION 
The developed model is expected to facilitate the early and automatic classification of natural 
disasters. The automatic classification of natural disasters using sound/acoustic signals may 
also be useful for environmental monitoring, disaster detection, and consequently mitigate the 
massive loss of life and properties caused by natural disasters. 
Furthermore, the developed model is expected to facilitate acoustic sensing in IoT devices. 
This is because Internet of things (IoT), as an enabling technology for remote monitoring 
requires a clear sound detection and identification system that is capable of effectively sensing 
and analyzing environmental and natural sounds.  
1.5 THESIS OUTLINE 
The thesis is organized as follows. Chapter one presents the background of the research by 
briefly stating the research problem, aim, and objectives. The chapter also highlights the 
expected practical and theoretical contributions of the research to knowledge. An outline of the 
thesis structure is also presented in this chapter. 
 7  
University of Ghana http://ugspace.ug.edu.gh
Chapter two provides review summaries on studies related to artificial intelligence methods of 
mitigating the effects of natural disasters. It provides an extensive literature study on current 
techniques for disaster detection, prevention, and management concerning various types of 
disasters. 
Chapter three presents state of the art in the area of general sound classification, and also reports 
feature extraction and classification techniques that exist in literature.  
Chapter four presents the research methodology. More particularly, it presents the design 
science research methodology (DSRM) and how this study fits into it. 
Chapter five will report the steps used in developing the model and conducting the 
classification tasks.  
Chapter six will evaluate the performance of the developed models as well as provide a 
comparison of the deep learning techniques used in classifying natural disasters sound. While 
Chapter seven will provide a summary of the thesis. 
  
 8  
University of Ghana http://ugspace.ug.edu.gh
Chapter Two  
RELATED STUDIES 
2.1 CHAPTER OVERVIEW  
This chapter provides a review summary of studies on natural disasters mitigation schemes 
using artificial intelligence (AI). More particularly, it identifies and summarizes the different 
measures proposed by researchers to either detect, prevent, or manage natural disasters. 
2.2 NATURAL DISASTERS 
Every year around the world, there has been significant damage to properties and animal life, 
as well as the loss of thousands of human lives, this is due to various forms of natural disasters. 
Natural disasters can be classified into three main categories; those caused by movements of 
the earth otherwise known as geophysical events (earthquake, tsunamis, volcanic eruptions), 
weather-related disasters (hurricanes, tornadoes, extreme heat or cold) and others (floods, 
landslides, famine) (Evans, 2011). 
Although meteorologists, environmental scientists, computer scientists, and researchers have 
put in a lot of work to predict, detect, and manage these disasters, their effects still abound. For 
instance, between the years 2010 to 2019, there has been an estimated total of 7342 natural 
disaster events globally, (see figure 2.1). The most disaster-prone areas have been coastal 
regions and the most affected people are the low-income earners (Boustan et al. 2017). 
 9  
University of Ghana http://ugspace.ug.edu.gh
409
2018 415
399
2016 375
373
2014 373
362
2012 355
352
2010 420
379
2008 248
389
2006 391
403
2004 334
340
2002 368
316
2000 343
0 50 100 150 200 250 300 350 400 450
Number of events
 
Figure 0.1: Natural disaster events globally from 2010 to 2019 (Duggar et al. 2016) 
2.3 REVIEW SUMMARY ON RELATED WORK 
In this section, a summary of existing literature on the current methodologies for managing 
natural disasters using artificial intelligence will be provided. The findings are summarized in 
Table 2.1. 
In Table 2.1, different research trends, and techniques in combating natural disasters have been 
summarized into task categories, objectives of the tasks, modeling techniques, type and sources 
of data.  The ensuing sections will further elaborate on the findings summarized in Table 2.1. 
2.3.1 TASK CATEGORIES 
From Table 2.1, three major categories of tasks for managing natural disasters can be identified. 
They include (i) prediction, (ii) detection, and (iii) disaster recovery and management 
strategies. 
PREDICTION: Predicting a natural disaster is an ideal solution to mitigating the effects of 
natural disasters. It involves forecasting the type, time, place, and magnitude of a disaster and 
is commonly based on data gathered from past occurrences, disaster-prone areas as well as the 
attributes of the disaster. According to Khalaf et al. (2018) using a predictive classification 
 10  
University of Ghana http://ugspace.ug.edu.gh
approach assists in tackling the severity of a natural disaster and it depends on the features 
identified from the available datasets. Chen et al. (2017) proposed a hybrid model of rotation 
forest ensembles and naïve Bayes tree classifiers that can improve the accuracy of a disaster 
predictive model. Kim et al. (2017) on the other hand developed a smart-eye platform for 
disaster recognition and response. 
However, Goswami et al. (2018) argue that although the disaster-prone areas can be predicted, 
combating natural disasters cannot be solved with available data and techniques. Conversely, 
Kim et al. (2017) posit that prediction aims to reduce the damage caused by disasters by using 
data from past disasters to recognize current disaster situations.  
DETECTION: Detection involves promptly detecting a disaster as soon as it occurs. 
According to Gupta & Doshi, (2018), the primary objective of disaster detection is to reduce 
the level of damage and destruction on lives and properties, hence it should be fast and accurate. 
Traditionally, the task of natural disaster detection is done by meteorological observatories, 
however, the news of detection with exact spots and repositories takes a long time to get to the 
appropriate authorities (Goswami et al. 2018). Consequently, the demand for real-time 
situation reports in disaster situations has birthed the need for the development of disaster 
detection methods and systems (Wieland et al. 2016). Some of the flood detection systems 
found in literature include:  
i. an alert generating system for flood detection using sensors technology, particularly 
global communication and mobile system modems (Khalaf et al. 2015),  
ii. Disaster and Agriculture Sentinel Applications (DiAS) for remote sensing and a 
processing chain for the analysis of Sentinel data towards flood detection (Doxani et al. 2019),  
iii. a sensing device that can monitor and detect flash floods, pluviometry, water presence 
and water level (Mousa et al. 2016),  
 11  
University of Ghana http://ugspace.ug.edu.gh
iv. a smart Automatic Warning System (AWS) that uses an automatic water level recorder 
(AWLR) sensor. The working mechanism of the AWLR sensors is based on cognitive artificial 
intelligence (CAI) (Asnaning & Putra, 2018), 
v. SENDI (System for detecting and forecasting Natural Disaster based on IoT); a fault-
tolerant system based on IoT, ML, and wireless sensor network (WSN) for the detection and 
forecasting of natural disasters (Furquim et al. 2018). 
vi. An automatic image-based natural disaster naming system that uses AI (Okamoto et al. 
2018). 
DISASTER RECOVERY AND MANAGEMENT: Disaster recovery and management is 
made up of three phases; awareness/early warning, response, and post-disaster assessments 
(Tarasconi et al. 2017). It involves response and rescue activities in addition to communication 
measures that ensure prompt identification and support for casualties (Goswami et al. 2018). 
According to Hwang et al. (2018), communication plays a key role in the survival of victims 
during and after disasters. Data from summaries in Table 2.1 indicates that more studies tend 
to focus on response and post-disaster assessments (i.e. investigating the recovery and 
monitoring mechanisms), rather than awareness/early warning. Some of the post-disaster 
systems developed include; 
i. a web-based knowledge system for emergency preparedness and response (Sermet & 
Demir, 2018). 
ii. a flood monitoring system based on computer vision where the uploaded images are 
analyzed using deep learning algorithms (Vallimeena et al. 2018). 
iii. an alert based system that uses WSN to sense environmental changes in temperature 
(Gupta & Doshi, 2018). 
 
 12  
University of Ghana http://ugspace.ug.edu.gh
Table 0.1: Disaster type, modelling techniques, and task summary 
Task  Objective  Modelling Source of Data Type of Data 
Technique Used 
Flood  
Prediction  To build a model for the prediction of Random forest, Meteorological data Numerical flood data 
flood severity (Khalaf et al. 2018) Artificial Neural  
Network (ANN), 
Levenberg-
Marquardt learning 
algorithms 
(LEVNN), Support 
Vector Machine 
(SVM) 
Detection  To develop a flood alert generating Random forest, Historical data from the Numerical flood data 
system (Khalaf et al. 2015) bagging, decision environment agency 
website, UK 
 13  
University of Ghana http://ugspace.ug.edu.gh
tree, and hyper pipes 
algorithms 
Disaster To develop an intelligent system Natural language - Ontology based data 
recovery & designed to improve societal processing (NLP)  
management preparedness for flood (Sermet & Demir, techniques 
strategies 2018) 
Detection  To introduce a processing chain for the Decision support Sentinel and Synthetic SAR and optical 
analysis of Sentinel data towards flood system aperture radar (SAR) data Images  
detection (Doxani et al. 2019) 
Detection  To develop a sensing device for water ANN, ARMAX 8F48 and D3CB sensors raw water level 
level detection (Mousa et al. 2016) measurement data  
Detection  To develop a smart warning system Cognitive artificial Automatic water level raw water level 
(AWS) (Asnaning & Putra, 2018) intelligence recorder (AWLR) sensor  measurement data 
recorded spreadsheets 
 14  
University of Ghana http://ugspace.ug.edu.gh
Disaster To develop a computer vision-based CNN Smartphones  Raw crowdsourced 
recovery & algorithm for flood depth estimation Images  
management (Vallimeena et al. 2018) 
strategies 
Detection To develop a fault-tolerant system for  Multilayer IP-based (Internet Raw numerical data 
and forecasting and issuing warnings of Perceptron (MLP) Protocol) sensor networks collected from rivers 
Prediction  natural disasters-based on IoT (Furquim 
et al. 2018) 
Prediction  To produce landslide susceptibility maps Rotation Forest Historical records Satellite images 
(LSM) for the planning and management ensembles (RFEs)  
of areas vulnerable to landslides (Chen et and naïve Bayes tree 
al. 2017) (NBT)  
Detection   To develop an efficient image object Convolutional Information Centric Text and images 
detection method for use in a disaster Neural Network Networking (ICN) 
prevention network (Okamoto et al. (CNN), Natural 
2018)  
 15  
University of Ghana http://ugspace.ug.edu.gh
language processing 
(NLP) techniques 
Disaster To identify spatial distributions of both Multi-criteria GIS vector, GIS Thematic Satellite imagery, map 
management risk and damage cost of the wildfire Evaluation Analysis maps, Climate data, Field of the land cover 
 (Nasanbat et al. 2018) (MCEA)  data, and Satellite data thematic, field and 
climatic data 
Prediction  To implement a model that predicts ANN and SVM Data was collected from Images  
wildfires using Remote Sensing (Sayad Moderate Resolution 
et al. 2019)  Imaging Spectroradiometer 
Detection  To develop an alert system using sensor Low Energy Wireless sensor network Raw analog data from 
network and LEACH algorithm (Gupta Adaptive Clustering transmitting data from the sensors converted to 
& Doshi, 2018) Hierarchy (LEACH) cluster head to the base digital data 
algorithm station and then to the radio 
receiver 
 16  
University of Ghana http://ugspace.ug.edu.gh
Prediction   To show that regression works better Root mean square Meteorological data from Forest fire data 
than classification for detection of forest error (RMSE), linear UCI repository 
fires (Kansal et al. 2016) regression, SVM, 
decision trees, 
GRNN 
Prediction  To reduce disaster damages by training a CNN Optical sensor Aerial images 
deep learning model for forest fire 
prediction (Kim et al. 2017) 
Detection To develop an algorithm for early Multiple Linear Forest fires dataset Numerical and 
and detection of forest fire (Li et al. 2018) regression (MLR) categorical data 
prevention and Decision Tree  
(DT) 
Disaster To develop a distributed data center that Distributed service Logistic information Statistical data 
recovery & carries information from relief broker policy systems (LIS) 
management distribution centers to the affected areas algorithm (DSBP) 
strategies for emergency needs (Dubey et al. 2018)  
 17  
University of Ghana http://ugspace.ug.edu.gh
Disaster Using AI to identify risk areas and ANN and SVM Pre-earthquake and post- Images  
recovery & determine future needs (Ivić, 2019) earthquake remote sensing 
management images Landsat and 
strategies Sentinel images 
Disaster To present an approach to analyzing Latent Dirichlet Social media data - Twitter Tweet text 
recovery & social media posts to assess the footprint Allocation (LDA) 
management of the damage caused by natural and Local spatial 
strategies  disasters (Resch et al. 2018) autocorrelation 
Disaster To develop a classification model to SVM, Naïve Bayes, Simulated CAT data Numerical and 
recovery & solve the trigger design challenge logistic regression, models categorical data 
management (Calvet et al. 2017) Neural network 
strategies 
Detection  To develop a model for early detection SVM SAR data SAR images 
of disaster (Wieland et al. 2016) 
Disaster To process social media textual and Stanford sentiment Twitter  Tweets text and 
recovery & imagery data to generate visual and analysis, K-means images 
 18  
University of Ghana http://ugspace.ug.edu.gh
management descriptive summaries of hurricanes algorithm, Random 
strategies  (Alam et al. 2019) forest, and LDA, 
General 
disasters 
Disaster The application of  multihop ad hoc Simulation  - - 
recovery & networks in disaster response scenarios 
management (Hwang et al. 2018) 
strategies  
Disaster To develop a model for performing Natural Language Twitter  Tweet text 
recovery & information extraction on generic, Processing (NLP) 
management hazard-related social media data streams technique 
strategies  (Tarasconi et al. 2017) 
 19  
University of Ghana http://ugspace.ug.edu.gh
2.3.2 MODELLING TECHNIQUES USED 
The impact of natural disasters can be reduced by developing predictive algorithms (Li et al. 
2018). Thus, researchers have adopted machine learning (ML) techniques as tools for either 
detecting, predicting, or managing natural disasters. Table 2.1 and Figure 2.2 shows the various 
modelling techniques and the distribution of the techniques respectively. The modelling 
techniques are divided into four categories; traditional machine learning techniques such as 
random forest, SVM, and logistic regression, neural networks (deep learning) such as ANN 
and CNN, natural language processing techniques (NLP) such as standford sentiment analysis, 
and Latent Dirichlet Allocation (LDA), and other statistical tools and algorithms such as root 
mean square error (RMSE), multiple linear regression (MLR), local spatial autocorrelation, 
and linear regression. 
10
9
3 3
Traditional Machine learning Other statistical Neural Networks NLP
tools/algorithms
 
Figure 0.2: Distribution of modelling techniques 
As shown in Figure 2.2, the modelling techniques used were predominantly machine learning 
techniques, closely followed by statistical tools/algorithms. The least used techniques were 
neural networks and NLP techniques. It was also observed that, although various modelling 
techniques were used, there are a plethora of techniques that were not explored. 
 20  
University of Ghana http://ugspace.ug.edu.gh
2.4 REVIEW SUMMARY 
Although the review summary provided in this chapter is not exhaustive as the search was 
narrowed to articles from the Scopus database only. Yet, 24 articles from the past 5 years (2014 
- 2020) were reviewed and the following observations were made. 
i. It was observed that the subject of natural disaster control is a multidisciplinary one as 
it involved researchers from computer & mathematical sciences, engineering, 
informatics, agriculture, and social sciences. 
ii. Flood, earthquake, tsunami, forest fire, wildfire, landslides, and hurricane made up the 
investigated disasters. 
iii. Out of the 24 articles, research was predominantly conducted on disaster detection and 
disaster recovery & management strategies. Also, these tasks had varying 
methodologies.  
iv. Some of the researchers who proposed detection, prediction, or disaster management 
strategies as a means of mitigating the effects of natural disaster adopted the machine 
learning classification approach using either images of disaster scenes, text content or 
numerical measurements. Table 2.2 summarizes the various classification schemes 
identified in the reviewed articles. 
  
 21  
University of Ghana http://ugspace.ug.edu.gh
Table 0.2: Classification schemes for natural disasters 
Task  Classes  Type of data 
Classification of flood Normal or dangerous Numerical 
severity (Khalaf et al. 2015) measurements 
Classification of flood Normal, abnormal, or high-risk floods Numerical 
severity (Khalaf et al. 2018) measurements 
Classification of flood data Flood or no flood Numerical 
(Doxani et al. 2019) measurements 
Classification of disaster Relevant or irrelevant messages and Textual and image 
reports (Imran et al. 2017) mild or severe disasters contents 
Classification of fire scenes Fire or non-fire scenes Images  
(Kim et al. 2017) 
Classification of disaster affected, vegetation, sea and rivers, Images  
scenes (Ivić, 2019) bare land, and clouds 
Classification of faces, Deep or shallow water depth Numerical 
gender and age group of measurements 
flood victims (Vallimeena et 
al. 2018) 
It was observed that classification schemes were used as post-disaster management or 
assessment strategies to determine either the magnitude of the disaster, or the number of 
casualties, or the affected population. 
Furthermore, the following classification metrics were used to evaluate the performance of the 
models; true positive (TP) – cases where the disasters are correctly predicted, false negative 
(FN) – cases where a disaster is incorrectly predicted, false positive (FP) – cases where a non-
disaster is incorrectly predicted, true negative (TN) – cases were a non-disaster is correctly 
 22  
University of Ghana http://ugspace.ug.edu.gh
predicted. Data sources included online tweets from social media (twitter), historical data, 
meteorological data, simulated data, and wireless sensors. 
Although data forms such as text, images, and statistical data were used to classify and identify 
patterns of natural disasters, these lists of data types are not exhaustive. For instance, no study 
was identified on the use of sound/acoustic signals for detecting or predicting natural disasters, 
even though this comes with its advantages. Monroe-Kane, (2019), states that acoustic data 
makes it possible to quantify the characteristics of a volcano including the duration, frequency, 
intensity, and the progression over time of eruptions. Also, distinguishing events using 
acoustics enables the measurement of wind patterns as well as determining the destructive 
power of a natural disaster such as a hurricane (Wilson & Makris, 2006). Similarly, natural 
disasters produce acoustic signals, and detecting the infrasound pulses and T-waves from 
disasters such as earthquake, crashing waves, hurricanes, volcanoes, and tornadoes can 
supplement the information gathered from satellites (satellite images) and airplanes (aerial 
views) (Hassiotis, 2018; Mone, 2007). An effective disaster sound detection and monitoring 
systems can increase early detection rates (true positive) as well as reduce false alarms 
(Simmonds & MacLennan, 2005). 
2.5 CHAPTER SUMMARY 
In this chapter, summaries of studies related to the use of AI techniques to mitigate the effects 
of natural disasters on life and properties as well as the environmental and physical security 
were identified. The modelling techniques, data source, data types as well as classification 
schemes were also summarized. 
The main focus of this study is in differentiating one natural disaster type from the other in 
real-time using sound/acoustic signals. However, no study on the subject was found from the 
review as most of the studies were focused on post-disaster management strategies. In the next 
 23  
University of Ghana http://ugspace.ug.edu.gh
chapter, this study will seek to investigate various sound classification tasks, their 
methodologies and algorithms. This will be achieved by a systematic review of the literature. 
  
 24  
University of Ghana http://ugspace.ug.edu.gh
Chapter Three  
STATE OF THE ART IN SOUND CLASSIFICATION 
3.1 CHAPTER OVERVIEW 
This chapter presents a systematic review of the literature that analyzes the use of machine 
learning in the various sound classification tasks. The goal of this chapter is to;  
i. Identify publication patterns in the area of acoustic signal/sound classification. 
ii. Identify trends in the use of machine learning in acoustic signal/sound classification.  
iii. Identify open questions and challenges in the use of machine learning algorithms in 
sound/acoustic signal classification.  
iv. Identify research gaps in the subject area. 
In this chapter, the terms sound and acoustic signals may be used interchangeably. 
3.2 RELATED REVIEWS 
This section identifies studies that have attempted to systematically review the literature on the 
classification of sounds or acoustic signals using artificial intelligence (AI) techniques. 
Existing systematic reviews in the subject area were identified and then evaluated using 
Greenhalgh, (1997) evaluation criteria. These criteria have been adopted in a number studies 
including (Tranfield et al. 2003; Van Dulmen et al. 2007) 
Findings from existing systematic reviews in the subject area indicated that, available studies 
satisfied the predefined evaluation criteria as well as provided summaries and reproducible 
review methodologies. However, researchers focused predominantly on the classification of 
biomedical acoustic signals, particularly on heart sounds (Dwivedi et al. 2019), lung sound 
(Palaniappan et al. 2013), respiratory sound (Pramono et al. 2017) and speech sound disorder 
in children (Wren et al. 2018). In other words, no secondary study (systematic review) on the 
classification of sounds in general was found. Considering the various applications of sound in 
 25  
University of Ghana http://ugspace.ug.edu.gh
our day to day activities, the lack of sufficient summaries justifies the need for a systematic 
review.  
3.3 REVIEW QUESTIONS 
Acoustic signals or sound rather than imaging or computer vision is gradually gaining research 
popularity as a tool for environmental monitoring, diagnosing diseases, and data transmission. 
Recently, machine learning algorithms have been used for various classification tasks 
(Aucouturier et al. 2011; Bishop et al. 2019). However, due to the plethora of machine learning 
algorithms, selecting a suitable algorithm for a specific classification task is difficult. Hence, 
the need to identify open questions and state-of-the-art trends and tools that will assist 
researchers to appropriately position new research activity in this domain. This review 
examines the following research questions to identify publication trends and provide answers 
that will provide researchers with information about current approaches in algorithm usage. 
The research questions stated in Table 3.1 are divided into two categories. Category one is 
made up of questions that seek to provide an overview of publication trends in the area of sound 
classification and machine learning. On the other hand, category two seeks to provide a good 
methodological background for a broader work by identifying research gaps and up-to-date 
methodologies. 
Table 0.1: Research questions and objectives 
 RESEARCH QUESTIONS OBJECTIVES  
Category one 
1. - What are the yearly publication trends? - To identify the frequency of primary 
- What Journal has the highest number of studies per year. 
publications? 
 26  
University of Ghana http://ugspace.ug.edu.gh
-  What is the frequency of authors? - To identify the frequency of 
- What is the country’s origin of authors publications per journal. 
affiliated institutions?  - To identify authors who are consistent 
in writing on the subject area. 
- To identify countries with the highest 
number of publications. 
Category two 
2. - What kind of sound is classified? - To identify the different types of 
- What is the format of the sound? classified sounds. 
- What are the sample rates of the audio - To identify predominantly used audio 
recordings? formats for classification. 
- What datasets were used for the - To determine the maximum audio 
classification and (or) evaluation? frequency that can be reproduced. 
- To identify datasets that are available 
for public use. 
3. What are the various application domains? To identify domains in which sound 
classification is predominantly 
performed. 
4.  - What features were extracted or what To identify predominantly used feature 
feature extraction technique was used? extraction techniques as well as 
- What classifiers were used? classification techniques. 
3.4 REVIEW APPROACH  
3.4.1 LITERATURE SEARCH 
A systematic search of the literature was carried out in two databases, Scopus and Acoustical 
Society of America (ASA). We sought to review scientific articles from high ranking journals, 
 27  
University of Ghana http://ugspace.ug.edu.gh
hence our choice of SCOPUS. Additionally, ASA was included as the second database since 
the primary interest of this study is on sound. 
Publications were extracted from selected databases using key search terms and their possible 
combination using the logical ‘and’ operator. The key search terms included classification, 
sound, acoustic signals, machine learning, deep learning, and artificial intelligence. The 
combination of these search terms produced the following search phrases (SP):  
SP1 Classification of sound and machine learning 
SP2 Classification of sound and deep learning 
SP3 Classification of sound and artificial intelligence 
SP4 Classification of acoustic signals and machine learning 
SP5 Classification of acoustic signals and deep learning 
SP6 Classification of acoustic signals and artificial intelligence 
3.4.2 INCLUSION AND EXCLUSION CRITERIA 
A set of specific eligibility criteria were defined and followed to limit our collection of articles 
to only those that fit with our research objectives. A suitability check of returned articles was 
performed after examining the title and removing duplicate papers. Only articles in which the 
methodologies and results were explicitly stated in the abstract and or conclusion and which 
are primary studies were considered eligible for the review. The inclusion and exclusion criteria 
are as follows: 
C1 Include only open-access journal articles written in English and published between the 
years 2010 and 2019. 
C2 Include articles whose title contains keywords like classification and acoustic signals 
or sound and machine learning or deep learning or whose title suggests sound 
classification using artificial intelligence.  
C3 Exclude repeated papers from the search results. 
 28  
University of Ghana http://ugspace.ug.edu.gh
C4 Exclude papers in which the abstract and (or) conclusions do not explicitly state the 
classification techniques and or results for sound classification.  
C5 Exclude secondary studies. 
3.4.3 STUDY SELECTION AND DATA EXTRACTION 
The six search phrases earlier mentioned were used to search the Scopus and ASA databases. 
The protocol for this systematic review has three main steps. In the first step, the retrieved 
articles were analyzed with an initial exclusion criterion (C1 to C2). In the second step, eligible 
articles were then exported to a spreadsheet (MS Excel) for further exclusion by repetition, 
abstract and or conclusion and by type of study (C3, C4, and C5). ASA database does not have 
the export feature, hence this phase of exclusion was done directly from the browser and 
manually documented. The third step entailed downloading and reading eligible articles to 
extract relevant data from it concerning the review questions. The extracted data was collated 
in a spreadsheet for ease of use. Table 3.2 shows the search results gotten from the selected 
databases after each stage of exclusion or inclusion. 
As shown in Table 3.2, the initial search output contained 1,028 research articles published 
from 2010 to 2019. Out of these, 150 articles were included after an initial screening by title 
and keywords and a total of 67 articles were obtained after the removal of duplicates. 
Furthermore, 19 articles were omitted based on abstract and secondary studies. Finally, a total 
of 48 journal articles were selected. It is important to note that this search was until the 22nd of 
December 2019. 
  
 29  
University of Ghana http://ugspace.ug.edu.gh
Table 0.2: Search results per exclusion criteria 
Databases    C1 C2 C3 C4 C5 
ASA SP1 50 10 10 8 8 
SP2 117 12 7 7 7 
SP3 112 11 0 0 0 
SP4 291 21 6 5 5 
SP5 117 16 0 0 0 
SP6 111 21 5 4 3 
SCOPUS SP1 104 19 17 13 12 
SP2 37 19 11 10 10 
SP3 27 6 3 2 1 
SP4 27 6 3 2 1 
SP5 23 7 3 0 0 
SP6 12 2 1 1 1 
TOTAL  1,028 150 67 43 48 
3.5 OVERVIEW OF PUBLICATION TRENDS  
This section will answer category one of the research questions stated in Table 3.1. It highlights 
the frequency of publications, distribution of journals, leading authors in the subject area and 
their country origin. 
3.5.1 PUBLICATION FREQUENCY 
The publication trend covers articles published between the years 2010 and 2019. Figure 3.1 
shows the frequency distribution of the selected articles. 
During the search, it was observed that within the selected year range, Scopus had no open 
access publications (exclusion criteria 1) in the subject area until the year 2013. As shown in 
 30  
University of Ghana http://ugspace.ug.edu.gh
Figure. 3.1, there has been a moderate publication trend within the years 2011 and 2015 with 
a minimum of two publications per year. The publication trend started increasing from 2016 
with a big jump in the year 2018 and a slight drop in the year 2019; probably because the study 
was completed before the end of the year. It can therefore be concluded that researchers are 
beginning to develop considerable interests in this area of research. 
14
13
5
4
3 3
2 2 2
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
 
Figure 0.1: Publications by year 
A further increase in subsequent years is envisaged, considering the popularity of artificial 
intelligence as well as the emergence of sound as an alternative form of data transmission. 
3.5.2 DISTRIBUTION OF JOURNALS 
The search results from SCOPUS had publications from several journals including the Journal 
of Acoustical Society of America (JASA). JASA is a journal in the ASA database with 
numerous publications in the area of sound/acoustic signals. It had more publications from an 
independent search has shown in Table 3.3.  
 31  
University of Ghana http://ugspace.ug.edu.gh
It can be observed that a total of 20 Scopus journals published articles in the subject area within 
the years 2010 and 2019. Out of the 20 journals, 35% of the publications cut across 17 journals 
with a maximum of one publication. Applied Sciences and Sensors journal each made up 6% 
of the publications respectively, followed by IEEE Access with 4%. As earlier mentioned, 
JASA with 49% had the highest number of publications. 
Table 0.3: Frequency distribution of primary sources  
 Journals Freq 
1.  APSIPA Transaction on Signal and Information Processing 1 
2.  Biomedical Journal 1 
3.  Electronics 1 
4.  Elecktronika ir Elektrotechnika 1 
5.  Eurasip Journal on Image & Video processing 1 
6.  Expert Systems with Applications 1 
7.  Computers & Electronics in Agriculture 1 
8.  Frontiers in Neuroscience 1 
9.  IEEE Signal Processing Letters 1 
10.  IEICE Transactions on Information and Systems 1 
11.  International Journal of Fuzzy logic & Intelligent Systems 1 
12.  International Journal of online & biomedical engineering 1 
13.  International journal of online engineering 1 
14.  Noise mapping 1 
15.  PeerJ 1 
16.  PLoS ONE 1 
17.  IEEE Access 2 
18.  Applied Sciences 3 
19.  Sensors (Switzerland) 3 
20.  JASA 24 
3.5.3 AUTHORS AND COUNTRY ORIGIN 
To identify the author or group of authors who are consistent in writing on the subject area, as 
well as countries with the leading number of publications; an analysis of the authors and their 
country origin (the country in which their affiliated institution is located) was done. 
With the number of authors per article ranging from 2 to 9, a headcount of the various authors 
showed that 209 authors wrote the 48 selected articles. Since one of the objectives of research 
question one is to identify authors who are consistent in writing on the subject area, Table 3.4 
provides details of authors who wrote more than one article. 
 32  
University of Ghana http://ugspace.ug.edu.gh
Table 3.4 highlights four groups of authors who wrote more than one article as either 
corresponding authors or co-authors. It was observed that the four groups of authors were all 
interested in bioacoustics. 
Table 0.4: Leading authors 
AUTHORS NAMES YEAR  JOURNAL  REFERENCE 
Ali K. Ibrahim, Laurent M. 2018 JASA (Ibrahim et al. 
Chérubin, Hanqi Zhuang, Michelle 2018) 
Umpierre, Fraser Dalgleish, 2019 JASA (Ibrahim et al. 
Nurgun Erdol, B. Ouyang, and A. 2019) 
Dalgleish  
Abeer Alwann and Charles E. 2015 JASA (Tan et al. 
Taylor 2015) 
2016 JASA (Kaewtip et 
al. 2016) 
Yagya Pandeya, Joonwhoan Lee 2018 Applied Sciences (Pandeya et 
al. 2018) 
2018 International Journal of (Pandeya & 
Fuzzy Logic and Lee, 2018) 
Intelligent Systems 
Amalia Luque, Javier Romero- 2018 Expert systems with (Luque et al. 
Lemos, Alejandro Carrasco Applications 2018) 
 2018 PeerJ (Luque et al. 
2018) 
 33  
University of Ghana http://ugspace.ug.edu.gh
Furthermore, authors’ country origin (countries of authors institutions) and the frequency of 
publications per year was also identified. Figure 3.2 shows the distribution of authors’ country 
origin grouped into continents. 
 
Figure 0.2: Distribution of authors by continent 
As shown in Figure 3.2, the authors were distributed across five continents. Africa had the least 
number of publications (1), closely followed by Australia (2), while Europe had the highest 
(19). South America recorded no publications; hence it is not represented in Figure 3.2. Figure 
3.3 provides a distribution of the publication trend by study country. 
As shown in Figure 3.3, studies have. been conducted in 22 different countries with the USA 
leading the trend with a total of 23% of all the studies. USA is followed by China, Korea and 
France with 10% each. Spain is in third place with a total of 8%, while India and UK are the 
fourth place, having 6% each of all the studies. 
 34  
University of Ghana http://ugspace.ug.edu.gh
25%
23%
20%
15%
10% 10% 10%
10%
8%
6% 6%
5% 4% 4%
2% 2% 2% 2% 2% 2% 2% 2% 2% 2% 2% 2% 2%
0%
 
Figure 0.3: Sound classification publication trend by study country 
3.6 SUMMARY OF METHODOLOGIES FROM REVIEWED 
ARTICLES 
This section seeks to address category two of the review questions stated in Table 3.1. As 
earlier mentioned, forty-eight (48) articles were selected based on the criteria used. The 
discussions below are results obtained in line with the research questions.  
For ease of reference, the selected articles have been numbered in the order in which there were 
selected - A1 TO A48 and will be used accordingly in further analysis (see Appendix A for a 
list of papers). 
3.6.1 SOUND/ACOUSTIC SIGNALS CLASSIFIED AND DATA SOURCES 
Sound is considered as the second most important sense after sight and it is capable of carrying 
information about anything in our environment (Perr, 2005). Although information from 
sounds are different from that obtained from radio frequency (RF), infrared (IR), and optics, it 
can be used for detection, classification and, localization (Hartman & Candy, 2014; Lopatka et 
al. 2016). Hence, the ability to differentiate sound or signal types becomes imperative as it will 
enable the extraction of relevant information about the sound source and the environment 
 35  
University of Ghana http://ugspace.ug.edu.gh
(Rascon & Meza, 2017). So far, sound has been used in areas such as medical acoustics for 
medical diagnosis (Beach & Dunmire, 2007; Oweis et al. 2015) environmental monitoring for 
security surveillance (Salamon & Bello, 2017; Wu et al. 2018) and bioacoustics for prediction 
of natural disasters (Pandeya & Lee, 2018). 
However, its application varies in land, air, and water depending on the medium of propagation, 
seasons, activities, and geographic location. Furthermore, it can be generated or caused by 
various activities including Anthrophony (sounds made or caused by humans) e.g. shipping 
and drilling noise, Geophony (sound from the environment) e.g. sea surface noise like the 
breaking of waves, ice-breaking, raindrops, and Biophony (sounds from animals) e.g. 
vocalizations of mammals, anurans, groupers. 
This section discusses the different kinds of sounds that were classified, source of data, sample 
rates, duration of sound recordings per file, and availability of datasets as found in the selected 
articles. 
From Table 3.5, it can be observed that 31 researchers were specifically interested in Biophony, 
13 in Anthrophony, and the last 4 in all sound categories - anthrophony, geophony, and 
Biophony. 
The marine mammal group which is the dominating was made up of different species of 
dolphins and whales (odontocetes & Mysticetes). It was observed that, some of the researchers 
were interested in automatically detecting, differentiating and classifying call types from 
different species (Guilment et al. 2018; Halkias et al. 2013; Roch et al. 2011; Shamir et al. 
2014). Others were interested in classifying; vocalizations of humpback whales, whistles & 
pulse of dolphins, song cycles of whales and echolation clicks of beaked whales respectively 
(Allen et al. 2017; LeBien & Ioup, 2018; Ou et al. 2013; Peso Parada & Cardenal-López, 2014). 
Cvengros et al. (2017) classified blast sound with the aim of monitoring environmental noise, 
classifying signals and differentiating between blast sound and non-blast sound.  
 36  
University of Ghana http://ugspace.ug.edu.gh
Human sounds that were classified included respiratory sounds, human voice disorder, and 
baby cry. Aucouturier et al. (2011) described baby cry as a reflexive signal that reflects the 
state of a baby by conveying a message of either a need, pain, discomfort or a medical 
condition.  
 37  
University of Ghana http://ugspace.ug.edu.gh
Table 0.5: Summary of classified sounds and datasets 
REF SOUND TYPE Link to Dataset/Name of datasets Source of datasets D.A Sample Time(s) 
rate 
   L.R E.D    
A1 Whale   N/M ✓ x x 96 – 192 1 - 8 
A2 Birds http://www.animalsoundarchive. x ✓ ✓ 1-4 60 
org/Refsys/Statistics.Php 
A3 Fish SEACOUSTIC2014 x ✓ x 256 10 - 30 
A4 Mysticetes calls Mobysound.org x ✓ ✓ 0:1- 8 - 
A5 Birds Birdcalls71, Flight calls datasets, x ✓ x 22.1 – 44- 0.5 - 
Anuran dataset 1  320 
A6 Military blast sound LRPE, East South Central, APG, x ✓ x 5 – 25.6 5 
SERDP-PITT, MCBC-PITT, 
New York (Fort Drum) 
A7  Birds N/M ✓ x x 8 - 16  10 
 38  
University of Ghana http://ugspace.ug.edu.gh
A8 Primate Calls N/M ✓ x x 44 3 
A9 Grouper N/M ✓ x x 10 20 
A10 Marmosets-monkey http://home.ustc.edu.cn/~zyj008/ ✓ x ✓  44.1 0.5 - 4 
background_noise.wav. 
A11 Marmosets-monkey http://marmosetbehavior.mit.edu ✓ x ✓ 48 0.5 
A12 Red Hind Grouper N/M ✓ x x 10 10 
A13 Mysticete calls DEFLOHYDRO, OHAS-ISBIO, x ✓ x 0.25 6 
DCLDE 2015 datasets 
A14 Birds song phrases  http:// x ✓ ✓ 20  3 
bn.birds.cornell.edu/bna/species 
(CAVI database) 
A15 Mysticete calls Mobysound.org x ✓ ✓  I - 4 2 
A16 Odontocetes Sound N/M ✓  x x 192 10 
A17 Humpback whale song unit N/M ✓ x x 22 – 44.1 3 
 39  
University of Ghana http://ugspace.ug.edu.gh
A18 Bird song Phrase http://taylor0.biology.ucla.edu/bi x ✓ ✓ 20 – 44.1 0.12 – 
rdDBQuery/. 0.25 
(CAVI database) 
A19 Humpback whales Auau Channel 2002 and French x ✓ x 10 0.4 – 
Frigate Shoals (FFS) dataset 3.7 
A20 Beaked whales https://data.gulfresearchinitiative ✓ x ✓ 92 0.0021 
.org 
A21 African gray parrot N/M ✓ x x 22 2.5 
A22 Dolphins whistles and pulses http://www.cemma.org x ✓ ✓ 96 6 - 25 
(CEMMA database) 
A23 Anuran calls Recordings of frog vocalizations x ✓ N/M 44.1 8 
were obtained from 
commercially available compact 
discs (CD) 
A24  Baby cry N/M ✓ x x 44.1 30 
 40  
University of Ghana http://ugspace.ug.edu.gh
A25 Livestock (sheep, cattle, & Maremma N/M ✓ x N/M 44.1 1 
sheepdogs) 
A26 Aircap, Bells, Bottle, Buzzer, Case, http://dcase.community/challeng x ✓ ✓ 48 4 
Clap, Cup, Drum, Phone, Pump, Saw, e2018/index. And 
Spray, Stapler, Tear, Toy, Whistle & http://citeseerx.ist.psu.edu/viewd
Wood oc/download?doi=10.1.1.463.35
7&rep=rep1&type=pdf. 
(Real-world computing 
partnership (RWCP) sound scene 
dataset and DCASE challenge 
dataset). 
A27 Respiratory sound (wheezes, crackles Int. Conf. on Biomedical Health x ✓ N/M N/M 1.5 
& normal sound) Informatics (ICBHI) scientific 
challenge database 
A28 Heart sound https://physionet.org/challenge/2 x ✓ ✓ 2 5 - 120 
016/. (the Physionet database) 
 41  
University of Ghana http://ugspace.ug.edu.gh
 
A29 Heart sound https://github.com/yaseen21khan x ✓ ✓ 8 - 
/Classification-of-heart-sound-
signal-using-multiple-features-
/blob/master/README.md 
A30 Cat (mother call, paining, resting, Online video sources including x ✓ N/M 16 2 - 6 
warning, angry, defense, fighting, YouTube, Kaggle challenge 
happy, hunting mind, mating) database and Flicker 
A31 Anuran (mating and release call) http://www.fonozoo.com/. x ✓ ✓ 44.1 96 
A32 Pet dog (barking, growling, howling & https://github.com/kyb2629/pdse ✓ x ✓ 22 – 44.1 0.24 – 
whining) . 1.47 
A33 Anurans (mating, release, distress http://www.fonozoo.com/. x ✓ ✓ N/M 5 
calls) 
A34 Lung sounds N/M ✓ x x N/M  
A35 Birds, frogs, wind, rain, & thunder N/M N/M N/M N/M 16 N/M 
 42  
University of Ghana http://ugspace.ug.edu.gh
A36 People, animals, nature, vehicles, http://www.findsounds.com/type x ✓ ✓ 16 10 
noisemakers, office, & musical s.html (FindSounds database) 
instrument 
A37 Fish  http://www.fishbase.org/and x ✓ ✓ 44.1 14 
http://www.dosits.org/ 
A38 Heartbeat sound (normal, murmur & Dataset B- PASCAL classifying x ✓ N/M 4 12.5 
extra-systole) heart sounds challenge 
A39 Air conditioner, car horn, children https://dl.acm.org/doi/10.1145/2 x ✓ ✓ 22.1 4 
playing, dog bark, drilling, engine 647868.2655045. 
idling, gunshots, jackhammer, siren, & (Urban 8k dataset) 
street music. 
A40 Dog barking, firecrackers, rain, rooster, https://www.karolpiczak.com/pa x ✓ ✓ N/M N/M 
baby cries, sneezing, sea waves, pers/Piczak2015-ESC-
chainsaw, helicopter, & clock sound) Dataset.pdf 
(ESC-10 AND ESC 50 datasets) 
 43  
University of Ghana http://ugspace.ug.edu.gh
A41 Bird  http://www.vision.caltech.edu/vi x ✓ ✓ 10 10 
sipedia/CUB-200-2011.html. 
(CUB-200-2011 standard 
dataset) 
A42 Conversation, children shouting, walk- YouTube videos ✓ x ✓ 20 10 
footsteps, crowd, hubbub, children-
playing, bird, vocalization, truck-horn, 
motorcycle, traffic-noise, light-engine, 
medium-engine, engine starting, idling, 
silence 
A43 Cymbals, horn, phone, bells, kara, RWCP and TIDIGITS datasets x ✓ N/M 16 - 20 3 
bottle, buzzer, metal, whistle, ring 
A44 Cat (warning, angry, defense, fighting, YouTube, and flicker ✓ x N/M N/M N/M 
happy, hunting, mating, mother-call, 
paining, resting) 
A45 EEG (electroencephalogram) signals http://www.cs.colostate.edu/eeg x ✓ ✓   
 44  
University of Ghana http://ugspace.ug.edu.gh
A46 air conditioner, car horn, children Urban- sound 8k dataset x ✓ ✓ 44.1 4 
playing, dog bark, drilling, engine 
idling, gun shot, jackhammer, siren, 
street music) 
A47 Respiratory sound N/M ✓ x x 44.1  
A48 Human voice disorders N/M ✓ x N/M 44.1 N/M 
Hints: L.R = life recording, E.D = Existing sound dataset, D.A = data availability (✓ = available, x = not available), N/M = Not mentioned. 
 45  
University of Ghana http://ugspace.ug.edu.gh
SAMPLE RATE, AUDIO FORMAT & SIGNAL REPRESENTATION  
The sample rate which is the number of samples of audio carried per second ranged from 
0.1kHz to 192kHz. The dominantly used sample rates lied between 22 and 44.1kHz.  
Out of the 48 classified sounds, the dominant audio format used was the .wav format. Others 
included mp3 (Peso Parada & Cardenal-López, 2014; Shamir et al. 2014), ARFF (Zhang et al. 
2016) and HDF5 format (Bold et al. 2019).  
Furthermore, the signals and audio files were predominantly visually represented as 
spectrograms. Spectrograms are graphical or visual representations of sound with frequency 
on the vertical axis, time on the horizontal axis and a dimension of color which represents the 
intensity of the sound at each time-frequency location. According to (Halkias et al. 2013; 
Malfante et al. 2018; Oikarinen et al. 2019; Ou et al. 2013), the classification of spectrograms 
as natural images allows it to be processed with available image processing tools. Additionally, 
it helps in removing the effect of background disturbances on the classification process (Thakur 
et al. 2019). Features extracted from spectrograms usually outperform hand-crafted features 
since spectrograms do not discriminate phrase classes with similar dominant frequency 
trajectories (Tan et al. 2015). However, unlike images in which the axes carry the same 
meaning irrespective of their location (i.e. the axes are shared weights across the vertical and 
horizontal dimensions), the axes of a spectrogram do not carry the same meaning (with time 
and frequency as the vertical and horizontal dimensions).   
SOURCES OF DATA 
With the aim of identifying publicly available datasets, the datasets used in the reviewed 
articles were divided into two categories; pre-existing sound datasets and life recordings. 
i. PRE-EXISTING SOUND DATASETS: This was made up of sound collected from 
past experiments, past projects or pre-existing sound databases, 27 datasets were 
obtained from this category. Out of the 27, only 17 were stated to be publicly available, 
 46  
University of Ghana http://ugspace.ug.edu.gh
while the availability of others was either not mentioned or not available due to 
licensing or privacy issues. 
ii. LIFE RECORDINGS: This category of datasets was generated by the researchers. It 
is made up of recordings of the subject of interest either in their natural habitat (Allen 
et al. 2017; Briggs et al. 2012; Ibrahim et al. 2019; LeBien & Ioup, 2018; Roch et al. 
2011; Shamir et al. 2014), or in a controlled environment including recording rooms 
and laboratories (Giret et al. 2011; Oikarinen et al. 2019; Zhang et al. 2018). In some 
cases, a recording device was attached to the animals (Oikarinen et al. 2019; Shamir et 
al. 2014), while in other cases, the data was collected with any of the following 
recording units, hydrophones, passive acoustic monitoring (PAM) systems, short-gun 
microphones, etc. attached to divers, seafloor moving boats or sinks. In all, 19 datasets 
were privately generated and 5 are available to the public. 
With a total of 47 mentioned data sources from both categories, only 23 are reported to be 
publicly available, this is a confirmation to the challenges of limited datasets faced by 
numerous researchers in the area of sound classification. 
3.6.2 DISTRIBUTION OF CLASSIFIED SOUNDS ACCORDING TO 
APPLICATION DOMAIN 
Considering the different type of classified sounds, the specific sound environment, and the 
researcher’s objective for classifying the chosen sound, three broad application domains of 
classified sounds were identified. They include bioacoustics, medicine, and the environment 
(see Figure 3.4). 
The application domain of bioacoustics was the most explored with a 69% occurrence rate. In 
this domain, researchers were predominantly interested in the classification of sounds and 
vocalizations of birds, mammals, and domestic animals. It was observed that some classified 
animal calls or vocalization with the intent of detecting and differentiating one animal species 
from the others (Guilment et al. 2018; Halkias et al. 2013; Roch et al. 2011). Some others were 
 47  
University of Ghana http://ugspace.ug.edu.gh
interested in identifying and differentiating the call types of specific animals (Kim et al. 2018; 
Pandeya & Lee, 2018; Roch et al. 2011).  
Researchers in the medical domain were predominantly interested in classifying sounds from 
heart and lungs related diseases. This was with a major objective of providing an automated 
and efficient classification and recognition system that will assist medical doctors or physicians 
in smart diagnosis. 
Environment
17%
Bioacoustics
Medicine
14% Medicine
Environment
Bioacoustics 
69%
 
Figure 0.4: Pie Chart showing the distribution of application domains 
Furthermore, they also sought to eliminate the invasive traditional computer vision 
methodologies like the use of medical imaging (Chen et al. 2019; Oweis et al. 2015; Vrbancic 
& Podgorelec, 2018). Conversely, 17% of the researchers explored sounds from the 
environment to automatically recognize environmental acoustics scenes as well as to precisely 
classify the detected sound. Environment as a domain consisted of sounds from sub-domains 
such as human activities, urban environment, surveillance, machinery, weather, musical 
instruments, etc. 
 48  
University of Ghana http://ugspace.ug.edu.gh
3.6.3 FEATURE EXTRACTION METHODS 
The classification of sound/acoustic signals as with other classification task requires the 
extraction of relevant features that will make the classification process more efficient and 
accurate. According to Wu et al. (2018), feature extraction reduces the size of data and 
represents the complex data as feature vectors. Additionally, the choice of features used to 
represent any given set of data may have a high impact rate on the classifiers as well as the 
classification results (Binder & Paul, 2019; Malfante et al. 2018; Oweis et al. 2015).  
In order to ensure high classification accuracy, some researchers explored feature selection 
techniques such as the Jensen-Shannon divergence (Luque, Romero-Lemos, Carrasco, & 
Gonzalez-Abril, 2018), step-wise feature selection (LeBien & Ioup, 2018; Malfante et al. 
2018). 
Table 3.6 highlights the feature extraction methods used in the reviewed articles. The methods 
have been categorized according to the feature extraction methods stated by Wang & Nanda, 
(2012). 
Table 0.6: Feature extraction methods 
Methods    Reference  
Time series Frequency domain (Binder & Paul, 2019; Bourouhou et al. 2019; 
transforms (Stationary signals) Fang et al. 2019; X. C. Halkias et al. 2013; Han 
 et al. 2016; Ibrahim et al. 2018; Kim et al. 
2018b; Luque, Romero-Lemos, Carrasco, & 
Gonzalez-Abril, 2018; Malfante et al. 2018; 
Parada & Cardenal-Lopez, 2014; Roch, 
Newport, et al. 2011; Su et al. 2019; Yan 
Zhang et al. 2016) 
 49  
University of Ghana http://ugspace.ug.edu.gh
Time-frequency (Non- (Aykanat et al. 2017; H. Chen et al. 2019; 
stationary signals) Guilment et al. 2018; Malfante et al. 2018; 
Noda et al. 2016; Ou et al. 2013; Wu et al. 
2018) 
Wavelets (Non-stationary (Aucouturier et al. 2011; Bishop et al. 2019; 
signals) Qian et al. 2017b; Raza et al. 2019; Yaseen et 
al. 2018) 
Data For sensors (Gingras & Fitch, 2013; Oikarinen et al. 2019; 
descriptive Oweis et al. 2015; Ya-jie Zhang et al. 2019) 
statistics For events (Allen et al. 2017; Robakis et al. 2018) 
Data Distribution models (Briggs et al. 2012) 
descriptive Information-based (Giret et al., 2011; LeBien & Ioup, 2018) 
models models 
Regression models - 
Classification/clustering (Aziz et al. 2019; Ibrahim et al. 2019; Kaewtip 
models et al. 2016; Khamparia et al. 2019; Shamir et 
al. 2014) 
Time- Explicit mathematical - 
independent operations 
Data dimension reduction (Cvengros et al. 2012a; Tan et al. 2015; Thakur 
et al. 2019) 
From the reviewed articles, the domain of bioacoustics was predominantly made up of marine 
mammals. According to Ou et al. (2013), the classification of marine mammals based on sound 
begins from analyzing their vocalizations. This includes sound detection from ambient noise, 
signal extraction, and feature analysis. Accordingly, they integrated contour extraction and 
 50  
University of Ghana http://ugspace.ug.edu.gh
spectrogram correlation for feature extraction. Specifically, frequency contours were extracted 
from the spectrogram of humpback whales by applying image edge detection filters. Shamir et 
al. (2014) on the other hand analyzed spectrograms using Wndchrm scheme based on 
numerical content descriptors such as; 2D texture features (Haralick & Tamura textures), 
statistical distribution (mean, standard deviation, skewness, & kurtosis) and multi-scale 
histogram of the pixel intensities, polynomial distribution (Chebyshev coefficients & Zemike 
polynomials), Gabor wavelets and Radon features. While Halkias et al. (2013) on the other 
hand attempted to learn the underlying structures of the calls directly from the spectrogram 
using discriminative features. Inversely, (Binder & Paul, 2019; Roch et al. 2011) adopted the 
use of perceptual features as they provide better discriminative signals for the classification of 
inter-species of marine mammals, but most importantly, for its ability to take into consideration 
how a human listener would differentiate sound. 
Identifying features of the Mysticetes calls is challenging because it is expected that the 
selected features do not only differentiate the call types but should also be able to detect any 
other signal within the same context (Guilment et al. 2018). Accordingly, Guilment et al. 
(2018) proposed a feature extraction method in which the feature vectors were digitized time-
series of Mysticetes calls and extracted features were obtained from click waveforms. With the 
primary objective of obtaining a low false-positive rate (FPR),  LeBien & Ioup, (2018) adopted 
a stepwise feature selection procedure that iteratively added features which minimizes a loss 
function. Consequently, a false positive rate of 0.001% was achieved. 
In classifying sounds from groupers/fishes, Malfante et al. (2018) proposed an all-purpose 
feature extraction approach that can be used on the classification of any type of datasets except 
fishes. Accordingly, they used 84 general features (instead of domain-specific features), from 
 51  
University of Ghana http://ugspace.ug.edu.gh
the time domain, frequency domain, and cepstral domain and the forward selection method to 
address the issue of feature selection.  
Giret et al. (2011) used the extractor discovery system (EDS) to generate 11,000 features from 
10 MFCC features. The acoustic features extracted from each audio signal was used to classify 
the calls using a C4.5 machine learning algorithm. Kaewtip et al. (2016) on the other hand, 
used the Mel-frequency cepstral coefficients (MFCC) as front-end features for the Hidden 
Markov Model Toolkit (HTK).  
Sounds produced by birds are characterized by several components depending on their species. 
However, due to the class-specific characteristics of harmonic and percussive components of 
a bioacoustics sound, these two components were combined with Mel-spectrogram to produce 
a three-channel input for the proposed framework (Thakur et al. 2019). 
MFCC as a sparse representation of the original sound was the predominant feature extracted 
from signals (see Table 3.8). However, Ibrahim et al. (2018) argue that they do not perform 
well under noisy conditions, hence they proposed the use of its optimized version. Accordingly, 
they used weighted MFCC (WMFCC) and weighted multiresolution features (WMRAF). 
Although WMFCC had better features with lower magnitudes of computational cost, due to 
the varying performance accuracies obtained from different species, the optimized features 
were concluded to be domain-specific features (Ibrahim et al. 2018). In a further study by the 
same authors (Ibrahim et al. 2019), sparse autoencoders (SAE) were used to learn features from 
the sound spectrum rather than using a particular feature extraction method (Ibrahim et al. 
2019). Similarly, instead of using traditional MFCC features, Briggs et al. (2012) used mask 
descriptors and integrated the selected features into a single feature vector that described each 
segment of an audio signal in the spectrogram. 
Zhang et al. (2019) used handpicked acoustic features in processing animal vocalizations and 
the processes of classification and detection were done distinctly. In addition to detection and 
 52  
University of Ghana http://ugspace.ug.edu.gh
classification, attribution was included to make up a process that enabled the network to learn 
useful features as well as reduce likely bias consequence of the handpicked features (Oikarinen 
et al. 2019). Inversely, instead of using hand-picked features, features from a CNN pre-trained 
network was used (Bold et al. 2019; Pandeya et al. 2018). 
In general, it was observed that the extracted features were either domain-specific or generic. 
It was also observed that, no feature extraction/selection was performed in some cases where 
neural networks were used for the classification. According to Vrbancic & Podgorelec, (2018), 
the advantage of using this approach is that no domain expert knowledge is required for 
classification. 
3.6.4 SOUND CLASSIFICATION ALGORITHMS AND PERFORMANCE 
METRICS 
The next step of sound analysis after feature extraction is to take the extracted features and feed 
them into an appropriate classifier. According to (Binder & Paul, 2019), an automatic classifier 
does not only identify or differentiate one sound from another, but it also reduces false 
detections of sound. Techniques for the various classification task are shown in Table 3.7. The 
predominant classifiers included Support vector machine (SVM), neural networks, k-nearest 
neighbor (KNN), hidden markov model (HMM), and k-means. 
Table 0.7: Classification techniques used  
Ref. no Classifiers 
A1 Euclidean distance 
A2 Kernel-based extreme machine (KELM), Sparse-Instance-based AL, least 
confidence-score-based AL (LCS-AL) 
A3 Random Forest & Support Vector Machine (SVM) 
A4 Restricted Boltzmann machine (RBM) & sparse auto-encoder (SAE) 
A5  Convolutional Neural Network (CNN) & Multilayer perceptron (MLP) 
A6  Linear SVM and Radial Basis Function (RBF) SVM 
A7  MIML-SVM, MIML-KNN, MIML-RBF (MIML: multi-instance multi-label) 
 53  
University of Ghana http://ugspace.ug.edu.gh
A8  Artificial Neural Network (ANN) 
A9  K-nearest neighbors (KNN), Support Vector Machine (SVM) & Sparse 
classifiers 
A10  SVM, Deep Neural Network (DNN), Recurrent Neural Networks - Long Short-
Term Memory (RNN- LSTM) 
A11  Feed forward deep convolutional neural network  
A12  Random ensemble of stacked autoencoders (RESAE) 
A13  Sparse representation  
A14  Dynamic time warping (DTW) and Hidden Markov models (HMM) 
A15  Aural classifiers 
A16  K-means 
A17  Self-organizing map 
A18  DTW-SR-2pass 
A19  K-means 
A20  K-means 
A21  Decision tree 
A22  Gaussian Mixture Model (GMM) 
A23  Logistic regression (LR) 
A24  Hidden Markov model (HMM) 
A25  Support Vector Machine (SVM) 
A26  Support Vector Machine (SVM), K-nearest neighbors (KNN) 
A27  Deep residual networks (ResNets) 
A28  Support Vector Machine (SVM) 
A29  Support Vector Machine (SVM), Deep Neural Network (DNN) 
A30 Convolutional deep belief network (CDBN) 
A31 Hidden markov model (HMM) 
A32 SVM, KNN, Long short-term memory-fully convolutional network (LSTM-
FCN) 
A33 Non-temporally aware (NTA) 
 54  
University of Ghana http://ugspace.ug.edu.gh
A34 Convolutional Neural Network (CNN) 
A35 Multi-view simple disagreement sampling (MV-SDS) 
A36 SVM with linear kernels & pairwise multi-class discrimination sequential 
minimal optimization, logistic regression 
A37 Support Vector Machine (SVM) 
A38 Recurrent Neural Network (RNN) 
A39 TSCNN-DS 
A40 Convolutional Neural Network (CNN) 
A41 CaffeNet pretrained Convolutional Neural Network (CNN) 
A42 Artificial Neural Network (ANN), Recurrent Neural Networks - Long short-term 
memory (RNN-LSTM) 
A43 Self-organizing map-Spike Neural Network (SOM-SNN) 
A44 Pre-trained CNN 
A45 LeNet based Convolutional Neural Network (CNN) 
A46 Convolutional Neural Network (CNN) 
A47 Artificial Neural Network (ANN) 
A48 Deep Neural Network (DNN) 
Support vector machine (SVM) has been identified as a robust technique in both 
classification and regression tasks. It is a supervised machine learning algorithm and it seeks 
to find the hyperplane which optimally separates the labeled data into their various classes 
(Bourouhou et al. 2019; Cvengros et al. 2012a; Noda et al. 2016; Qian et al. 2017a; Yaseen et 
al. 2018). Most of the articles that used SVM were focused on improving the classification 
performance either by modifying existing approaches of SVM based classification or by adding 
new features to it. Modifications to existing approaches included Recursive feature elimination 
(SVM-RFE) and linear SVM (Cvengros et al. 2012a), and SVM with linear kernels (Han et al. 
2016), while added features included cost parameter CSVM (Malfante et al. 2018). Generally, 
SVMs have been reported to be cumbersome for multi-class tasks but robust for binary 
 55  
University of Ghana http://ugspace.ug.edu.gh
classification tasks concerning good performances on various learning tasks (Zhang et al. 
2018). 
Neural Networks are algorithms that imitate the operations of a human brain to identify 
patterns and trends in data. Although its effectiveness is limited by the unavailability of labeled 
data, neural networks have self-organizing and adaptive learning properties with an outstanding 
ability to detect trends based on the sample data (Dwivedi et al. 2019). They also have a 
distinctive ability to build deep architectures as well as automatically learn feature 
representations. Compared to conventional machine learning techniques with shallow 
networks that are made up of one input layer, one output layer and a hidden layer that lies in 
between the input/output layers, neural networks consist of several layers and has the ability to 
grow deeper into the network by increasing the number of hidden layers. Conversely, the 
difference between neural networks and deep learning depends on the depth of the model; deep 
learning is an application of neural networks with several layers of nodes (4 or more) between 
the input and output layers (Arel et al. 2010). 
Recently, deep learning has enabled various applications in action detection, object 
recognition, speech recognition, image classification, and recognition. Findings from this 
systematic review indicate that deep neural networks have also been evident and effective in 
medical diagnosis, acoustic detection, and acoustic classification. With a widespread 
application in various domains, deep learning has been promoted in literature for the following 
reasons as stated by Wason, (2018): 
1. They can filter and extract information hidden in the presence of noise 
2. The algorithms train through input data to identify hidden patterns and then integrate the 
information obtained into visual analytics displays. 
3. The algorithms can apply discrimination to data to reveal patterns and extract valuable 
information. 
 56  
University of Ghana http://ugspace.ug.edu.gh
4. It can classify unstructured and structured data using methods like deep belief methods 
(DBM) or convolutional neural networks. 
5. It mimics the human brain through artificial neural networks (ANN) and learns how to 
solve problems in a human-like manner. 
From the reviewed articles, some of the neural networks used included, CNN, ANN, DNN, 
RNN, LSTM-RNN, feed-forward deep convolutional neural network (FFD-CNN), 
convolutional deep belief network (CDBN), and Long short-term memory-fully convolutional 
network (LSTM-FCN). In general, although high classification accuracies are guaranteed with 
neural networks, training a neural network requires huge datasets and high computational 
power. 
Hidden markov model (HMM) is a generative model and the first segment-based approach 
in classification procedures (Wu et al. 2018). In sound classification,  it takes a sound segment 
and tries to classify it as a whole without any form of framing (Luque, et al. 2018). Although 
it ensures realistic temporal statistics of the output (Aucouturier et al. 2011), its performance 
is limited due to its statistical inefficiency in modeling data that lies on a nonlinear manifold in 
the feature space (Ibrahim et al. 2019). Additionally, it uses sub-word features that are not 
suitable for non-speech sound identification since they lack the phonetic structure that speech 
possesses (Luque et al. 2018). Furthermore, it requires large datasets for better performance 
and at the same time performs badly when there is a lot of noise in the data (Kaewtip et al. 
2016). From the reviewed articles, authors who used HMM was generally for segmentation. 
K-NEAREST NEIGHBOR (KNN) is a supervised machine learning algorithm that finds the 
class to which an unknown object belongs to using majority voting of KNNs i.e. it predicts 
classes using the majority of nearest neighbors (Noda et al. 2016; Pandeya & Lee, 2018). In 
contrast to HMM, KNN is robust to noise and requires low training time but at the same time 
requires large memory space (Dwivedi et al. 2019). 
 57  
University of Ghana http://ugspace.ug.edu.gh
Predominantly, 94% of the techniques used in the 48 reviewed articles were supervised 
machine learning techniques, while the other 6% made up unsupervised machine learning with 
the k-means clustering technique. Apart from traditional supervised learning models and deep 
learning models, other supervised machine learning techniques were explored to overcome the 
challenges of limited data, overfitting and lack of labeled data. They included the use of pre-
trained models like VGG and CaffeNet for transfer learning, extreme learning machine (ELM), 
and deep metric learning (DML). ELM is a single hidden layer feedforward neural network 
that was used to overcome the problems of slow training speed and, over-fitting encountered 
by neural networks (Ding et al. 2015; Qian et al. 2017). DML was used to overcome the 
problem of unlabeled data (Thakur et al. 2019). Additionally, a semi-supervised learning 
technique called active learning was used to minimize the demand for human descriptions on 
sound classification training models (Han et al. 2016).  
Figure 3.5 shows the distribution of modelling techniques used in the 48 reviewed articles; 
neural networks were the most used technique out of the four categories of modeling techniques 
identified.  
18
14
4
2
Neural network Machine learning Statistical/time-series Active learning
models
 
Figure 0.5: Distribution of modelling techniques 
 58  
University of Ghana http://ugspace.ug.edu.gh
Various metrics were used to evaluate the performance of the techniques; 92% of the 
researchers were mostly concerned with classification accuracy. Others (8%) used the F1 score, 
area under curve (AUC), Sensitivity/TPR, Specificity/FPR, unweighted average recall (UAR), 
precision, recall and mean error rate. It was observed that the classification techniques used in 
the reviewed articles predominantly had good performance accuracies. 
3.7 REVIEW SUMMARY 
The primary objective of the systematic review was to identify methodological approaches and 
current algorithms used in the automatic classification of sounds. This review was restricted to 
Journal articles from Scopus and ASA databases and was guided by two categories of review 
questions which were answered accordingly.  
In the first phase of the review, we identified the frequency of publications between the years 
2010–2019, the distribution of journals, consistent researchers in the area of sound 
classification and the country origin of the various researchers. It was observed that until the 
year 2015 upwards, the level of research interest in sound classification was minimal. Also, 
researchers who published more than one article were all interested in animal sound 
classification. Additionally, 90% of the researchers were from European and Asian countries. 
In the second phase, we identified the different types of classified sounds and their properties 
in terms of sample rate, audio format, datasets, and the various application domains. It was 
observed that in the domain of bioacoustics, researchers were mostly interested in classifying 
sounds from marine mammals, while the medical domain was concerned with diagnosing 
respiratory diseases using sound. Although different forms of environmental sound were 
classified, none of the articles classified natural disaster sound. This is a research gap, 
considering the alarming increase in natural disaster events yearly.  
Considering that a major limitation to most of the studies was limited datasets or lack of 
annotated data, few researchers explored the techniques used for overcoming such problems in 
 59  
University of Ghana http://ugspace.ug.edu.gh
machine learning. Three articles explored the options of transfer learning (Bold et al. 2019; 
Pandeya et al. 2018; and Pandeya & Lee, 2018), one used averaging methods for an ensembling 
of six classifiers (Pandeya & Lee, 2018), and two used cross-validation (Binder & Paul, 2019; 
and Roch et al. 2011). Additionally, Luque et al. (2018) used instance selection as an alternative 
to cross-validation.  
Other research challenges identified included limited bandwidth (Binder & Paul, 2019; Luque, 
et al. 2018),  threshold problem (Malfante et al. 2018), and lack of general applicability of 
classifiers (Guilment et al. 2018). 
Furthermore, the feature extraction methods, classification techniques and the performance 
evaluation techniques used in the reviewed articles were identified. It was observed that, 
although a variety of feature extraction and classification techniques were used, we could not 
identify unique patterns in the use of these techniques to a particular application domain. 
However, it was observed that MFCCs were predominantly used in feature extraction for its 
ability to imitate the hearing properties of the human ear using a nonlinear scale of properties 
(Mitilineos et al. 2018; Raza et al. 2019; Turner & Joseph, 2015). 
We also identified two categories of sound classification, they included detection-and-
classification otherwise known as acoustic event detection and detection-by-classification 
otherwise known as acoustic event classification. While the former involves detection of the 
sound and then its classification, the latter involves sound detection by classifying the audio 
segments. In detection-and-classification, no classification decision is made, rather 
segmentation is done when a segment boundary is detected based on a chosen threshold, 
followed by localization (Lopatka et al. 2016; Temko & Nadeu, 2009).   
 60  
University of Ghana http://ugspace.ug.edu.gh
Table 0.8: Classification Categories  
Category  Domains 
 Bioacoustics Environment Medicine 
(Malfante et al. 2018), (Thakur (Verma et al. 2019)  
et al. 2019), (Briggs et al. 
2012), (Briggs et al. 2012), 
(Robakis et al. 2018), (Ibrahim 
et al. 2018), (Ya-jie Zhang et al. 
2019), (Oikarinen et al. 2019), 
(Ibrahim et al. 2019), 
(Guilment et al. 2018), (Ou et 
al. 2013), (LeBien & Ioup, 
2018), (Parada & Cardenal-
Lopez, 2014), (Bishop et al. 
2019), (Noda et al. 2016) 
(Allen et al. 2017; Aucouturier (Aziz et al. 2019), (Chen et al. 2019), 
et al. 2011; Binder & Paul, (Yan Zhang et al. (Bourouhou et al. 
2019; Bold et al. 2019; 2016), (Han et al. 2019), (Yaseen et al. 
Cvengros et al. 2012b; Gingras 2016), (Su et al. 2018), (Aykanat et al. 
& Fitch, 2013; Giret et al. 2011; 2019), (Wu et al. 2017), (Raza et al. 
X. C. Halkias et al. 2013; 2018), (Salamon & 2019), (Khamparia et 
Kaewtip et al. 2016; Y. Kim et Bello, 2017) al. 2019), (Vrbancic & 
al. 2018; Luque, Romero- Podgorelec, 2018), 
Lemos, Carrasco, & Gonzalez- (Oweis et al. 2015), 
Abril, 2018; Pandeya et al. (Fang et al. 2019) 
 61  
Detection-by-classification Detection-and-classification 
University of Ghana http://ugspace.ug.edu.gh
2018; Pandeya & Lee, 2018; 
Qian et al. 2018; Roch et al. 
2011; Shamir et al. 2014; Tan et 
al. 2015) 
Conversely, in detection-by-classification, the task of detection automatically translates to 
classification as its strategy is based on using classifiers (such as HMM, logistic regression) 
with inbuilt segmentation algorithms (Ren et al. 2017; Temko & Nadeu, 2009). Table 3.8 
shows the researchers classification category according to the application domains earlier 
identified. Detection-and-classification were performed in the domains of bioacoustics and 
environment, while detection-by-classification cut across the three identified domains in this 
review.  
Furthermore, it was observed that neural networks were the most used techniques for sound 
classification. Mitilineos et al. (2018) posits that this is due to the ability of neural networks to 
identify specific patterns exhibited by sound sources in its distribution of energy over 
frequency and time.  
3.8 CHAPTER SUMMARY 
In this chapter, 48 articles in the area of sound classification that were selected based on 
predefined criteria were reviewed.  
From the reviewed articles, predominant sound application domains, feature extraction and, 
selection methods as well as classification techniques were identified. Two broad categories of 
sound classification schemes were also identified; acoustic event detection (AED) and acoustic 
event classification (AEC). The review also highlighted sound classification trends and 
limitations of existing studies. 
 62  
University of Ghana http://ugspace.ug.edu.gh
Although this review provided methodologies and algorithms used in various domains of sound 
classification, we opine that the methodologies and research coverage are not exhaustive. Most 
importantly, we found no study on the detection of extreme events or the automatic sound 
classification of natural disasters. Consequently, this is one research gap amongst others 
mentioned in the discussions that provide us with a good justification for the relevance of this 
study. 
Subsequent chapters will seek to address this research gap by providing methodologies and 
techniques for the classification of an acoustic event such as natural disaster. 
  
 63  
University of Ghana http://ugspace.ug.edu.gh
Chapter Four  
RESEARCH METHODOLOGY 
4.1 CHAPTER OVERVIEW 
This chapter will discuss the steps adopted in conducting this study. More particularly, it will 
discuss the design science research methodology as a research paradigm deemed appropriate 
for this study. 
4.2 DESIGN SCIENCE RESEARCH METHODOLOGY (DSRM) 
A research work should address specific issues by developing and evaluating artefacts designed 
to meet identified scientific or business needs (Carcary, 2011; Hevner et al. 2004; Winter, 
2008). These artefacts may include but not limited to models, frameworks, methods, constructs, 
instantiations, and social innovations. Hevner et al. (2004)  posits that the artefact must 
adequately coincide with the real world, it should solve a problem, and should also be able to 
present the steps, findings, and results clearly and concisely. 
In this study, the design science research methodology (DSRM) was adopted because it ensures 
a relevant and rigorous research based on a set of guidelines (see Table 4.1) as proposed by 
(Hevner et al. 2004). 
Also, the DSRM was deemed appropriate based on the aim of this study. This study is aimed 
at developing a model for the automatic classification of natural disasters sound as a means of 
providing real-time detection and warning signals to people. Accordingly, the model to be 
developed is an artefact that seeks to address the limitations and cover research gaps of existing 
natural disasters detection and warning methods. It is expected that the artefact (model) that is 
developed at the end of this study will be novel as it is aimed at addressing the challenges in 
existing natural disaster detection systems. 
  
 64  
University of Ghana http://ugspace.ug.edu.gh
Table 0.1: Design Science Research (DSR) Guidelines 
Guideline  Description  
Design as an Artefact DSR must produce a viable artefact. 
Relevance of the Problem The objective of a DSR is to develop technology-based 
solutions to important and relevant problems. 
Evaluation of the Design  The utility, quality and efficacy of a design artefact must be 
rigorously demonstrated via well-executed evaluation 
methods. 
Research Contributions An effective DSR must specify clear and verifiable 
contributions in the area of the design artefact or design 
methodologies. 
Research Rigor Design science relies upon the application of rigorous 
methods in both construction and evaluation of the designed 
artefact. 
Design as a Search Process The search for an efficient artefact requires the utilization of 
available means to reach the desired end but at the same time 
satisfying the rules in the problem domain. 
Communication of DSR must be presented effectively to both technology-
Research oriented and management-oriented audience. 
4.3 RESEARCH APPROACH FOR THIS STUDY 
To ensure a rigorous and appropriate methodology as described by Hevner et al. (2004), this 
study is divided into five main phases; awareness, suggestion, development, evaluation and 
conclusion. This five-phase research cycle for a design science research model has been 
adopted in several studies including Peffers et al. (2008) and Van der Merwe et al. (2020). The 
research process begins with the awareness that a problem exists. Suggestions to the identified 
 65  
University of Ghana http://ugspace.ug.edu.gh
problem are made based on existing knowledge. An attempt is made at developing an artefact 
(a solution to the problem). After which the artefacts are evaluated, and conclusions are drawn. 
The ensuing sub-sections are discussions on how each phase of the process addressed the 
objectives of this study. 
4.3.1 AWARENESS OF THE PROBLEM 
The awareness of the research problem in this study was triggered by the alarming increase in 
the occurrence of natural disasters as well as the false alarm signals. As mentioned in chapter 
one, this study aims to develop a model (artefact) that can be used for the automatic 
classification of acoustic events. The model will seek to be sufficiently robust to the changing 
ambient noise as well as low frequency sounds produced by natural disasters especially during 
formation. Accordingly, the result of this study will be a purposeful artefact developed to 
address the current existing problem. 
4.3.2 SUGGESTION 
In this study, the suggestion phase involved an analysis of literature related to natural disasters 
and sound classification. Relevant studies were selected and reviewed through a literature 
review and a systematic review of literature. These reviews facilitated the identification of 
research gaps as well as existing methodologies that can be enhanced to serve the goal of this 
study. More particularly, it identified the effectiveness of deep learning techniques in the 
classification of acoustic events. It also identified the inadequacies in the use of images, text, 
and numerical data to detect natural disasters and highlighted the potentials in using sound. 
This phase is reported in chapters two and three and it addresses the first and second objectives 
of this study.  
4.3.3 DEVELOPMENT 
The development phase involves building and training a model that will automatically classify 
an acoustic event. Models are trained and developed in this stage using a convolutional neural 
 66  
University of Ghana http://ugspace.ug.edu.gh
network (CNN) and a recurrent neural network (RNN). The development phase is reported in 
chapter five and it addresses the third objective of this study. 
i. CONVOLUTIONAL NEURAL NETWORK (CNN) 
Convolutional neural networks are the most popular type of neural networks. It is inspired by 
the primary visual system of the brain; hence it is particularly tailored for image recognition 
and classification. Typically, CNN works with two-dimensional (2D) convolution operation 
(Maccagno et al. 2019). A convolution is a mathematical procedure that defines a rule of how 
two functions (the input data and a convolution kernel) will be mixed to produce a transformed 
feature map (the integral) (Cośkun et al. 2017). A CNN is made up of three main layers, each 
of the layers performs different sets of tasks on the input data and, the layers also have different 
optimized parameters. The CNN layers include the convolutional layer, pooling layer and a 
fully connected layer as shown in Figure 4.1. 
 
Figure 0.1: Layers of CNN (Borgne & Bontempi, 2017) 
CONVOLUTION LAYER 
The convolutional layer otherwise called filter is the first layer in which features are extracted 
from the input data. It is also where most of the user-specified parameters are in the network. 
The three layers of a CNN have the input say, 𝑥 arranged in three dimensions, 𝑝 × 𝑞 × 𝑟, where 
𝑝 and 𝑞 are the height and width of the input and 𝑟 is the depth. In each convolution layer, 
several filters or kernels 𝑘 of size exist; in the kernel 𝑘 of size 𝑚 × 𝑚 × 𝑛, 𝑚 is always smaller 
than the size of the input data while 𝑛 can either be smaller or equal to the size of 𝑟. 
Furthermore, the filters which form the base of a connection convolve with the input, share the 
 67  
University of Ghana http://ugspace.ug.edu.gh
same parameters (in terms of weight 𝑊𝑘 and bias 𝑏𝑘) and then generate 𝑘 feature maps (ℎ𝑘) 
of size 𝑝 − 𝑚 − 1. The convolutional layer also calculates a dot function between the weight 
and the inputs, thereafter, an activation function 𝑓 is applied to the output of the layers. 
POOLING LAYER 
Similar to the convolution layer is the pooling layer; it is used to reduce the dimensionality of 
the network by reducing the number of parameters if the input data is large. More particularly, 
it decreases the number of parameters in the network, speeds up the training process and also 
controls overfitting by downsampling each feature map in the network. Three basic operations 
are performed in the pooling layer, they include, max pooling which takes the largest value in 
a defined filter region, average pooling which takes the average value and sum pooling which 
takes the sum of all the values in the defined filter region. Generally, the pooling operations 
are performed over a specified contiguous region for all feature maps in the network. 
FULLY CONNECTED LAYER 
The fully connected layer (partitioner) is the last layer of the network, placed before the 
classification output of a CNN. It uses previous low-level and mid-level features to generate 
high-level abstraction from the data. It outputs the probability that an input belongs to a certain 
class for a given instance. 
Furthermore, these three layers have three sets of features categorized into; action, parameters 
and input/output. Table 4.2 summarizes these features with respect to the CNN layers. 
  
 68  
University of Ghana http://ugspace.ug.edu.gh
Table 0.2: Features of the 3-layers in a CNN  
ACTION PARAMETERS INPUT/OUTPUT 
Convolutional layers 
− Filters are applied to − Number of kernels. − Input: 3D cube, 
extract features. − Size of the kernels. previous set of feature 
− Filters are composed of − Activation functions. maps. 
small learned kernels. − Stride  − Output: 3D cube, one 
− Activation functions − Padding  2D map per filter. 
are applied on every − Regularization type and 
value of the feature map value. 
Pooling layer 
− Dimensionality Strides and window size. − Input: 3D cube, 
reduction. previous set of feature 
− Extraction of maps. 
maximum, average or − Output: 3D cube, one 
sum of a specified 2D map per filter, 
region. reduced spatial 
− It uses the sliding dimensions. 
window approach. 
Fully connected layers 
− Combines − Number of nodes. − Input: flattened 3D 
information from final − Activation function: uses cube, and previous set of 
feature maps. either ReLU for aggregating feature maps. 
 69  
University of Ghana http://ugspace.ug.edu.gh
− Develops final or SoftMax for producing a − Output: 3D cube, 
classification. final classification. and one 2D map per filter 
ii. RECURRENT NEURAL NETWORK (RNN) 
Recurrent neural network (RNN) is another popular class of neural networks that are 
predominant in the fields of natural language processing (NLP) and speech processing. 
A major feature of a RNN that distinguishes it from other neural networks is that the network 
contains at least a feed-back connection which enables the network to perform temporal 
processing as well as learn sequential information (Pouyanfar et al. 2018). It uses the sequential 
characteristics of data and its patterns to make predictions. 
Also, in a traditional neural network, the inputs and outputs are independent of each other, 
whereas a Recurrent Neural Network uses the output from a previous step to make input to the 
current step.  
Table 0.3: Calculating the current state, activation functions and output in RNN 
  Formula  Variable definitions  
1.  Current state ℎ𝑡 = 𝑓(ℎ𝑡−1,𝑋𝑡)  ℎ𝑡 is the current state, 
 ℎ𝑡−1 is the previous state, 
 𝑋𝑡 is the input state. 
2. Activation ℎ𝑡 = tanh(𝑊ℎℎℎ𝑡−1 + 𝑊𝑥ℎ𝑥𝑡) 𝑊𝑥ℎ is the weight of input 
function neuron, 
𝑊 ℎℎ is the weight at the 
recurrent neuron. 
3. Output  𝑌𝑡 = 𝑊ℎ𝑦ℎ𝑡 𝑌𝑡 is the output, 
𝑊ℎ𝑦 is the weight at the 
output layer. 
TRAINING A RECURRENT NEURAL NETWORK 
 70  
University of Ghana http://ugspace.ug.edu.gh
The recurrent neural network consists of the input layer (𝑋0, 𝑋1, 𝑋2, 𝑋3, … … 𝑋𝑡), the hidden 
layers (ℎ0, ℎ1, ℎ2, ℎ3, … … ℎ𝑡) and the output layers (𝑦0, 𝑦1, 𝑦2, 𝑦3, … … 𝑦𝑡). The current state, 
activation function and output are calculated as shown in Table 4.3. 
The steps for training a Recurrent Neural Network are as follows: 
i. In the input layer, load or send the initial inputs with equal weights and activation 
functions. 
ii. Calculate the current state using the current inputs and the previous state output. 
iii. Current state ℎ𝑡 , will become ℎ𝑡−1 for the second time step. 
iv. To solve a particular problem, this process keeps repeating for all the steps. 
v. Calculate the final step with the current state of the final step as well as other previous 
steps. 
vi. Generate the error by calculating the difference between the actual output and the 
output generates by the RNN model. 
vii. Steps end when the process of backpropagation occurs such that the error is 
backpropagated to update the weights. 
4.3.4 EVALUATION 
The performance of the developed models is evaluated in this phase. In this study, the cross-
validation and classification metrics will be used to validate the performance of the models.  
Cross-validation is a re-sampling method that is used to evaluate and validate the performance 
of a classification algorithm. It reduces possible bias that might result from the training/testing 
split on a specific dataset, and also increase model reliability as it checks to ensure that a model 
is not overfitting (Jacoby, 2014; Raza et al. 2019; Sayad et al. 2019; Wang & Peng, 2018). The 
cross-validation process involves a cross-over in successive iterations between the training and 
testing sets in order to validate the model. There are different cross-validation methods, they 
 71  
University of Ghana http://ugspace.ug.edu.gh
include K-Fold cross-validation, Stratified K-Fold cross validation, leave out one, shuffle split 
and adversarial validation. The K-Fold cross-validation is used in this study. 
The classification metrics used include accuracy, precision, recall and AUC (area under curve) 
score.  
The evaluation of the models is reported in chapter six and it addresses the fourth objectives of 
this study. 
4.3.5 CONCLUSION 
This is the final stage of this research cycle and it is expected to serve as a knowledge base for 
future research. In this stage, summaries, limitations of the study, discussions based on findings 
are provided. The concluding phase is reported in chapter seven. 
4.4 CHAPTER SUMMARY 
This chapter discussed the design science research methodology as an appropriate paradigm 
for this study. It also discussed the five-phase research cycle, and how each phase of the cycle 
is related to the research objectives stated in chapter one. 
  
 72  
University of Ghana http://ugspace.ug.edu.gh
Chapter Five  
USING DEEP LEARNING FOR ACOUSTIC EVENT CLASSIFICATION 
5.1 CHAPTER OVERVIEW 
From the systematic review in chapter three, two broad categories of sound classification 
schemes were identified; acoustic event detection and acoustic event classification. In this 
chapter, the acoustic event classification (AEC) approach will be explored. Hence, the 
methodologies and techniques adopted in conducting this study will be discussed. It is 
important to note that most of the methods mentioned and adopted in literature for the 
classification of an acoustic event are borrowed from the speech recognition system. 
The framework for this chapter is summarized in Figure 5.1. It begins with extracting the 
sounds of interest, pre-processing the sound, extracting relevant features, building the 
classification model and then model validation. 
 
Figure 0.1: Sound classification architecture 
 73  
University of Ghana http://ugspace.ug.edu.gh
5.2 SOFTWARE USED FOR THE EXPERIMENT 
In this study, Anaconda Python and various libraries were used to build and train the neural 
networks as well as for feature extraction. Apart from the general python libraries used for data 
processing and analysis such as Numpy, Matplotlib, Scikit-Learn. Other specific libraries used 
for the experiment included; 
i. Keras in this experiment was used on top of TensorFlow GPU to build the neural 
networks. 
ii. The audio analysis library LibROSA was used to read in data and for resampling 
the audio files. 
iii. Python_speech_features was used for providing speech features such as Mel 
frequency cepstral coefficient (MFCC) and filter-banks. 
iv. Tqdm creates a progress path for nested loops. 
5.3 NATURAL DISASTER SOUND DATASET 
Data used for this study were extracted from the Freesound database 
(freesoundeffects.com/free-sounds/ambience-10005/). All the sound recordings are in WAV 
format. The sound format is relevant because the features extracted in this study (MFCC) 
supports WAV sound formats (Sasmaz & Tek, 2018).  
The datasets are made up of five classes of unevenly distributed disaster sounds (class 
imbalance) namely, earthquake, windstorm, waves, forest fire and volcano. The five classes 
are made up of a total number of 244 sound recordings and total duration of 1560.52694 
seconds. The pie chart in Figure 5.2 shows the distribution of the disaster sound dataset where 
each class is named according to the type of the disaster.  
 74  
University of Ghana http://ugspace.ug.edu.gh
Forestfire
18.9%
Earthquake 
26.1%
Volcano
14.2%
Windstorm
16.7%
Waves
24.1%
 
Figure 0.2: Class distribution of disaster sound dataset 
5.4 SOUND PREPROCESSING AND FEATURE EXTRACTION 
Sound/acoustic signals can be visualized in two forms, one as a wave plot (time-domain 
representation) and the other as a spectrogram (frequency domain representation). A wave plot 
is an amplitude versus time plot that shows the loudness of sound waves as it changes over 
time (see Figure 5.3). However, due to the varying amplitudes, there is the tendency that the 
classifiers can misjudge the magnitudes of sound intensity during learning (Kim et al. 2018b; 
Oikarinen et al. 2019; Pramono et al. 2017). Findings from the systematic review in chapter 
three showed that spectrograms, and spectral features such as Mel frequency cepstral 
coefficients (MFCC) were widely used features in sound classification that can enable the 
classifiers to learn discriminative features (Guilment et al. 2018; Halkias et al. 2013; Salamon 
& Bello, 2017). Spectral features generate sound waves that produce more accurate results in 
noisy conditions. Spectrograms on the other hand, reduces the number of trainable parameters 
in contrast to direct sound classification (Khamparia et al. 2019). 
 75  
University of Ghana http://ugspace.ug.edu.gh
 
Figure 0.3: Time series representation of five random samples belonging to the five 
different classes of the dataset 
Since this study aims to classify sounds amidst noisy conditions, spectral features (which are 
usually obtained by converting the time-based signal into the frequency domain using Fourier 
transforms) instead of spectrograms will be adopted. Figure 5.4 and 5.5 shows the visual 
representation of sound from each class of the dataset using MFCCs and filter bank 
coefficients.  
 
Figure 0.4: MFCC representation of five random samples belonging to the five different 
classes of the dataset 
 76  
University of Ghana http://ugspace.ug.edu.gh
 
Figure 0.5: Filter bank coefficient representation of five random samples belonging to the 
five different classes of the dataset 
5.4.1 DE-NOISING THE SIGNAL 
It was observed that there were lots of dead spaces which can be said to contain irrelevant data 
within the audio files. To get rid of these dead spaces, we calculated and created an envelope 
of the signal. An envelope of a signal defines the boundary (in most cases the upper boundary) 
within which a signal is contained when viewed in the time domain. 
This was achieved by passing in a signal with a collection rate and a specified threshold value 
of 0.005. To avoid getting rid of relevant data, a rolling window with a tenth of seconds 
(0.1seconds) size was generated over the data which uses the mean of all the signal values to 
identify signal values that are growing dead or fading out. Accordingly, an aggregated mean 
of all the values in a window was generated and used to develop a mask over the signals, for 
mean values greater than the specified threshold. Consequently, the dataset now consists of 
signals with relevant data; this can be observed by slight changes in the class distribution (see 
Figure 5.6).  
 77  
University of Ghana http://ugspace.ug.edu.gh
Forestfire
18.9%
Earthquake 
26.0%
Volcano
14.2%
Windstorm
16.7%
Waves
24.1%
 
Figure 0.6: Pie-chart showing the class distribution of denoised disaster sound dataset 
5.4.2 ACOUSTIC DOWN-SAMPLING 
Down-sampling is a signal reduction technique that reduces the sound features by extracting 
more discriminative features which are in turn used for modelling (Raza et al. 2019). 
Considering that natural disasters generate low-frequency noise particularly at the formation 
stage and this study is focused on early detection of the signals generated by these disasters, it 
was imperative to remove any form of redundancy from the data. Hence, with a window size 
of 25ms, the audio signals were downsampled from 44100Hz to 16000Hz. Accordingly, a clean 
directory was created to store the cleaned-up audio files. Data from this directory will be used 
for the classification. 
5.4.3 FILTER BANK-BASED FEATURE EXTRACTION METHOD 
A fundamental aspect in the design of an acoustic event classification system is in the selection 
and extraction of appropriate signal features that will enhance the efficient differentiation 
between different types of sound signals. The selection of appropriate features is essential 
because recorded sounds are generally non-stationary signals with super-imposed background 
noises that originates from natural ambient noise (Mitilineos et al. 2018). Commonly used 
 78  
University of Ghana http://ugspace.ug.edu.gh
sound features are either related to the time-domain representation (time-frequency features), 
to the frequency domain representation (spectral features) or to statistical features.  
Since seismic activities have wide range of spectral contents (Simmonds & MacLennan, 2005), 
the feature extraction in this study is performed on spectral features of the audio recordings 
based on filter bank-based Mel Frequency Cepstral Coefficient approach. Although there are 
various approaches for the filter-bank feature extraction methods used in the areas of 
speech/sound feature extraction, Mel Frequency Cepstral Coefficients (MFCC) have been 
predominantly used in audio-specific feature extraction. This is mainly because it is robust to 
noise and yields high performance in speech signal processing (Aykanat et al. 2017; Aziz et al. 
2019; Bishop et al. 2019; Chen et al. 2019; Turner & Joseph, 2015; Yaseen et al. 2018). It is 
also commonly used for its ability to leverage on the robustness of CNN based classifiers 
(Verma et al. 2019). Accordingly, MFCC method of feature extraction is adopted in this study. 
Due to the downsampling earlier performed, the number of MFCC features was reduced from 
26 to 13 features, while FFT was reduced from 1103 to 512 (Aziz et al. 2019; Chu et al. 2009). 
Accordingly, the 13 MFCCs is calculated using a downsampled 512-point fast Fourier 
transform with 0.1s window frame length. More specifically, the fast Fourier transform (FFT) 
is applied to the waveforms of each window frame, then the log Mel-filter bank spectrum is 
extracted to obtain spectral reports of each frame. Furthermore, the obtained spectrum is tied 
to a Mel-frequency scale by passing it through a Mel-filter bank created by triangle filters. The 
logs of the outputs are used to develop the log Mel-filter bank spectrum for each frame and the 
MFCC is finally obtained by applying a Discrete Cosine Transform (DCT) to the filter banks 
(Yaseen et al. 2018). 
5.5 CLASSIFICATION TECHNIQUES 
Two deep learning algorithms; Convolutional Neural Networks (CNN) and Recurrent Neural 
Network (RNN) have been adopted in this study for the classification tasks. Before developing 
 79  
University of Ghana http://ugspace.ug.edu.gh
the classification models, it is important that the class imbalance situation is considered. This 
is because imbalances may result in low performance of some classes and also make the process 
of building a classification model difficult (Imran et al. 2017; Ya-jie Zhang et al. 2019). Also, 
this study is particularly focused on detection-by-classification. Sound detection-by-
classification is predominantly concerned with the choice of the window length, this window 
length can be any arbitrary value from half a second to several minutes depending on the task 
application domain (Temko & Nadeu, 2009). Accordingly, an arbitrary length of time of a tenth 
of a second (0.1seconds) was chosen. Then a random sampling along the length of the audio 
files was performed to extract chunks of 0.1second. To determine the total number of audio 
samples (n_samples) generated within the signal after the extraction, the total length in seconds 
of all the data was divided by 0.1second then multiplied the results by 2.  
That is, [𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2𝑥], where 𝑥 in the python code is defined as; 𝑥 =
𝑖𝑛𝑡(𝑠𝑜𝑢𝑛𝑑𝑠[′𝑙𝑒𝑛𝑔𝑡ℎ′]. 𝑠𝑢𝑚()/0.1). 
Consequently, this process increased the number of samples from 244 to 31194; a process 
generally termed data augmentation. Although the 0.1second time frame may be described as 
too short, it was chosen to ensure that the model can quickly discern different classifications in 
real-time. Finally, the datasets were split into two sets; 80% for training and 20% for testing. 
5.5.1 CONVOLUTIONAL NEURAL NETWORK 
Convolutional Neural Network (CNN) is one of the most competitive neural networks applied 
in computer vision for image classification and recognition. In this study, the CNN was adopted 
specifically for the following reasons: CNNs can capture patterns across time and frequency 
for given input spectrograms (Maccagno et al. 2019). Also, it can make distinctions even when 
sound is masked in time and frequency by other noise (Salamon & Bello, 2017). 
The CNN model used in this experiment consists of 4 fully connected layers. We used 16 filters 
built with a 3𝑥3 convolution, all the layers have a ReLU activation function with 1𝑥1 stride 
 80  
University of Ghana http://ugspace.ug.edu.gh
(because of the small input space) and padding as ‘same’, a 2𝑥2 kernel for the maxpooling, 
and a dropout layer to reduce overfitting on the training data. The convolutional model 
dimensions used in the study is shown in Figure 5.8. 
 
Figure 0.7: CNN model dimensions 
 81  
University of Ghana http://ugspace.ug.edu.gh
5.5.2 RECURRENT NEURAL NETWORK 
Recurrent neural network (RNN) is another type of deep neural network that learns the 
significant features from an upcoming data sequence, stores it in memory cells and then 
predicts the next steps based on the stored up features (Verma et al. 2019). RNN is used in this 
study because it performs well on time series data by producing a 1D array consisting of 
frequency values (Raza et al. 2019; Verma et al. 2019). In order words, RNNs are used to 
model features that change over time. Generally, RNNs are efficient in tasks that involve 
sequential inputs such as speech and language, yet it is limited by the inability to store the 
learned features for a long time (Lecun et al. 2015). Accordingly, the use of LSTM to mitigate 
this challenge has been widely adopted (Verma et al. 2019). 
This study adopts Raza et al. (2019)’s RNN model architecture presented in Figure 5.8. 
 
Figure 0.8: RNN-LSTM Architecture (Raza et al. 2019) 
The RNN model in this study is LSTM based with a data shape of (n, time, feat), one recurrent 
layer, a 0.5 Dropout, a time distributed fully connected layer with 64 neurons and a ReLU 
activation, flatten, and a SoftMax activation. Since sequences are returned, the time distributed 
is carried down from layer to layer. Hence, more parameters can be created while enabling 
 82  
University of Ghana http://ugspace.ug.edu.gh
deeper modelling. The RNN-LSTM model dimensions used in this study is shown in Figure 
5.9. 
 
Figure 0.9: RNN-LSTM model dimensions 
 83  
University of Ghana http://ugspace.ug.edu.gh
5.6 CHAPTER SUMMARY 
This chapter provided the steps involved in the classification of sound.  It is worthy to note that 
all the methodologies described in this chapter were fully automated with little or no form of 
human intervention. 
After sound extraction from Freesound database, the methodology was made up of three main 
steps: pre-processing, feature extraction and classification. Data preprocessing entailed 
denoising and downsampling the audio recordings as well as augmenting the data size. The 
filter bank-based Mel Frequency Cepstral Coefficient approach was used for feature extraction, 
after which the dataset was split into train and test sets. Classification models namely CNN and 
RNN-LSTM were then built and applied to the preprocessed data to enable predictions. 
The next chapter will evaluate the performance of the two models, numerical results will be 
evaluated in terms of accuracy, precision, recall and area under curve (AUC) score. 
  
 84  
University of Ghana http://ugspace.ug.edu.gh
Chapter Six   
EVALUATION OF DEEP LEARNING TECHNIQUES 
6.1 CHAPTER OVERVIEW 
This chapter presents a comparison of the results obtained from the experiment conducted in 
chapter five. It discusses the model validation techniques and the metrics used in the 
classification of an acoustic event. Results are displayed as bar charts, tables and confusion 
matrix (5x5 contingency table). 
6.2 MODEL VALIDATION 
Model validation is performed after the model has been trained, it aims to find the model with 
the best performance. More particularly, model validation is the process whereby a trained 
model is evaluated with a separate portion of the same dataset commonly referred to as the 
testing data (Gingras & Fitch, 2013; Lebien & Ioup, 2018; Sayad et al. 2019). Amongst the 
different validation techniques, this study will use cross-validation (Krishna et al. 2018; 
Pandeya & Lee, 2018; Su et al. 2019) and classification metrics (Luque, Romero-Lemos, 
Carrasco, & Gonzalez-Abril, 2018; Sayad et al. 2019). 
6.2.1 CROSS-VALIDATION 
In the K-fold cross-validation, the entire training dataset is divided into K-subsets such that in 
each iteration, all the subsets in the datasets are trained except one subset which is reserved 
and used for testing. Each successive iteration outputs an accuracy, and the overall accuracy is 
calculated by taking an average of the accuracy results returned from each fold. Previous 
studies have argued that, the value of K is predominantly either 5 (5-fold cross-validation) or 
10 (10-fold cross-validation). However, the 10-fold cross-validation has been adopted by 
several researchers for its ability to produce better performance of model hyper-parameters 
(Chen et al. 2017; Davis & Suresh, 2019; Han et al. 2016; Pandeya & Lee, 2018; Thakur et al. 
 85  
University of Ghana http://ugspace.ug.edu.gh
2019). Thus, the 10-fold cross-validation technique is adopted in this study to evaluate the 
performance of the models. Table 6.1 shows an illustration of the 10-fold cross-validation 
process. 
Table 0.1: 10-Fold cross-validation 
 
Maintaining the initial 80% and 20% training/test splits, the training datasets (80%) was split 
into 10 folds; such that for each iteration, 9 of 10 folds of sound recordings are selected for the 
training the model, then the trained model is tested on the remaining one-fold (holdout set). 
This process is repeated 10 times with a different holdout set in each iteration.  
Figure 6.2 shows the classification accuracy and average accuracy obtained from the 10-fold 
cross validation performed on the training set. As shown in Figure 6.1, the minimum 
classification accuracy obtained from the training set was 99.16% and 97.05% for CNN and 
RNN-LSTM respectively. The highest accuracy of 100% was achieved at the 9th fold for both 
classifiers. CNN and RNN-LSTM had an average accuracy of 99.85% and 99.23% respectively 
on the train set. 
 86  
University of Ghana http://ugspace.ug.edu.gh
100.5
100
99.5
99
98.5
98
99.84 99.87 100 100 10909.97 99.96 99.94 10010099.74 99.68 99.77
99.9 99.85
97.5
99.16 99.13 99.23 99.227
98.78 98.75
97
96.5
97.05
96
95.5
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Average
CNN RNN
 
Figure 0.1: Classification Accuracy and average accuracy of the 10-folds 
Furthermore, the performance of the 10-fold cross-validation model was validated on the 
remaining 20% (unseen data) of the dataset. Figure 6.2 shows a bar chart of the classification 
accuracy obtained from the test set.  
99.96%
99.94%
99.94%
99.92%
99.90%
99.88%
99.86%
99.84%
99.82%
99.82%
99.80%
99.78%
99.76%
CNN RNN-LSTM
Accuracy
 
Figure 0.2: Accuracy obtained from 10-fold cross-validation for CNN and RNN-LSTM 
 87  
University of Ghana http://ugspace.ug.edu.gh
Comparing the results from Figure 6.1 (train validation) to the test validation of Figure 6.2, it 
can be observed that the model performs better on unseen data as it achieved higher accuracies. 
CNN performed better on the test set by 0.09% and RNN-LSTM performed better by 0.59%. 
6.2.2 CLASSIFICATION METRICS 
Classification metrics are a set of metrics generally used to evaluate the performance of a model 
using the test datasets. Twenty percent (20%) of the natural disaster sound dataset was assigned 
for testing the performance of the models. The metrics used in this study for the evaluation of 
the CNN and RNN-LSTM models include Confusion Matrix, Accuracy, Precision, Recall, and 
area under curve (AUC). 
i. CONFUSION MATRIX 
A confusion matrix is a table that provides a detailed breakdown of the correct (true positives) 
and incorrect (errors) classifications for each class in a dataset. It allows the visualization of 
the performance of a model by tabulating the values of the actual and predicted classes as 
columns and rows respectively. Four terms are commonly associated with confusion matrix, 
they include  
i. true positives (TP) is obtained when both the actual class and the predicted class is true.  
ii. true negatives (TN) is obtained when both the actual class and the predicted class is 
false. The total number of true negatives for a certain class is the sum of all the columns 
and rows excluding that class’s column and row. 
iii. false positives (FP) is obtained when the actual class is false, and the predicted class is 
true. The total number of false positives for a class is the sum of values in the 
corresponding column excluding the true positive (TP). 
iv. and false negatives (FN) is obtained when the actual class is true, and the predicted 
class is false. The total number of false negatives for a class is the sum of values in the 
corresponding row excluding the true positive (TP). 
 88  
University of Ghana http://ugspace.ug.edu.gh
Furthermore, various performance measures or metrics can be calculated based on values from 
the confusion matrix. The performance of a model can be obtained using the following 
measures: 
i. Accuracy is the measure of correctly predicted instance to the total instances. It is 
(TP+TN)
calculated as;   
TP+FP+FN+TN
(FP+FN)
ii. Error rate, EER is calculated as;  
(TP+TN+FN+FP)
iii. True Positive Rate (TPR) also known as sensitivity or recall is a measure of the ability 
TP
of a prediction model to correctly select instances. TPR Is calculated as;  
(TP+FN)
iv. True Negative Rate (TNR) also known as specificity is a measure of negative instance 
TN
correctly predicted. TNR is calculated as  
(FP+TN)
v. False Positive Rate (FPR) is the portion of negative samples that are predicted as 
positive. FPR is calculated as 1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦. 
vi. False Negative Rate (FNR) is the portion of positive samples that are predicted as 
negative.  
vii. Precision also known as positive predictive value (PPV) is the measure of positive 
TP
instances that are actually positive.  PPV is calculated as; . 
(TP+FP)
In this study, in order to compare the actual classes with the predicted, confusion matrix was 
generated for each classifier used in this study. Table 6.2 and 6.3 shows the confusion matrix 
for CNN and RNN-LSTM. Recall from the pie chart in Figure 5.6 that the class distribution 
was imbalanced; hence the varying numbers of test splits in the different classes of the 
confusion matrix. 
  
 89  
University of Ghana http://ugspace.ug.edu.gh
Table 0.2: Confusion matrix showing CNN predictions 
 Earthquake  Forestfire  Volcano  Waves  Windstorm  
Earthquake  1648 0 0 0 1 
Forestfire  0 1219 0 0 0 
Volcano  0 0 852 0 0 
Waves  0 1 0 1473 0 
Windstorm  0 0 0 1 1044 
The confusion matrix in Table 6.2 shows predictions made on the test dataset of natural disaster 
sound using CNN. The test dataset is made up of a total of 6239 sound recordings, out of which 
6236 instances were correctly predicted. Confusions were observed in the following instances; 
earthquake as windstorm (1), waves as forestfire (1), and windstorm as waves (1). Forestfire 
and volcano were correctly predicted at all instances. 
Similarly, Table 6.3 shows the confusion matrix for predictions made using the RNN-LSTM 
model. Out of the 6,239 sound recordings, 8 errors were recorded across the five classes.  
Earthquake and waves were correctly predicted at all instances. Regarding forest fire, 1171 
recordings were correctly predicted, while being confused for volcano and waves in 5 and 1 
instances respectively. Volcano and windstorm were both confused as earthquake in 1 instance. 
Table 0.3: Confusion matrix showing RNN-LSTM predictions 
 Earthquake  Forestfire  Volcano  Waves  Windstorm  
Earthquake  1637 0 0 0 0 
Forestfire  0 1171 5 1 0 
Volcano  1 0 893 0 0 
Waves  0 0 0 1462 0 
 90  
University of Ghana http://ugspace.ug.edu.gh
Windstorm  1 0 0 0 1068 
ii. ACCURACY, PRECISION AND RECALL 
Based on the confusion matrix, classification metrics such as accuracy, precision, and recall 
for both CNN and RNN-LSTM were computed. It was observed that for each of the classifiers, 
accuracy, precision and recall produced the same results (see Figure 6.3). As shown in Figure 
6.3, although CNN outperformed RNN-LSTM, both classifiers had good accuracy results 
(99.95% and 99.87%).  
99.96%
99.94% 99.95% 99.95% 99.95%
99.92%
99.90%
99.88%
99.86% 99.87% 99.87% 99.87%
99.84%
99.82%
CNN RNN-LSTM
Accuracy Precision Recall
 
Figure 0.3: Classification accuracy for CNN and RNN-LSTM 
iii. AUC- ROC (AREA UNDER CURVE – RECIEVER OPERATING 
CHARACTERISTICS) CURVE 
Due to the dominating effect of the majority class in an imbalanced dataset, classification 
metrics such as accuracy, precision and recall are not sufficient for evaluating the performance 
of a model (Luque, Romero-Lemos, Carrasco, & Gonzalez-Abril, 2018; Weng & Poon, 2008; 
Yang et al. 2015). Thus, the AUC-ROC as a visualization tool for comparing classification 
 91  
University of Ghana http://ugspace.ug.edu.gh
models has been used to mitigate the dominating effect of dataset imbalance (Ling et al. 2003; 
Weng & Poon, 2008; Yang et al. 2015). As shown in Figure 5.2 and 5.6, the datasets used in 
this study is imbalanced, therefore this study will also adopt the AUC-ROC has a classification 
metrics. The AUC-ROC in Figures 6.4 and 6.5 plots the false positive rate (FPR) on the x-axis 
and the true positive rate (TPR) on the y-axis for CNN and RNN-LSTM respectively. It shows 
the variation between the number of correctly predicted (classified) positive instances and 
incorrectly predicted negative instances i.e. how much the model is able to distinguish between 
classes. It can be observed from the AUC plots in Figures 6.4 and 6.5 that both models are 
equal to one, implying that they both have good measures of separability of the disaster sounds 
(Weng & Poon, 2008; Yang et al. 2015). Additionally, both models had AUC-ROC score of 
0.999. 
 
Figure 0.4: AUC-ROC for CNN Model 
 92  
University of Ghana http://ugspace.ug.edu.gh
 
Figure 0.5: AUC-ROC for RNN-LSTM model 
6.3 TESTING THE VALIDITY OF THE MODELS IN REAL-TIME 
CLASSIFICATION OF DISASTER SOUNDS 
Recall that a time frame of a tenth of a second (0.1seconds) was selected primarily with the 
aim of achieving real-time classification of disaster sound (see section 5.6). To test the real-
time validity of the model, the time frame was increased from 0.1second to 0.2seconds and 
0.4seconds. Results are shown in Figure 6.6. 
102.00%
99.95%
100.00% 99.45%
99.87%
98.85%
98.00%
96.41%
96.00%
94.00%
92.00%
90.00% 89.29%
88.00%
86.00%
84.00%
82.00%
CNN RNN-LSTM
0.1SECOND 0.2SECONDS 0.4SECONDS
 
Figure 0.6: Chart showing accuracy score comparison for initial and increased time 
frames. 
 93  
University of Ghana http://ugspace.ug.edu.gh
From Figure 6.6, it can be observed that, an increase in the time frame resulted in a decrease 
in the classification accuracies for both CNN and RNN-LSTM. This indicates that the model 
performs best at automatically classifying disaster sound. 
Conversely, it can be argued that the low classification accuracy of the 0.2seconds and 
0.4seconds time frame can be attributed to the fact that, the increase in time frame reduced the 
total number of sound samples, and consequently also reduced the number of test sets from 
6239 to 3120 for the 0.2seconds, and from 6233 to 1560 for the 0.4seconds time frame.  
Accordingly, the 0.2 and 0.4seconds time frame was maintained while the number of sound 
samples was augmented (n_samples) by:  
DA1. Multiplying it by 4; 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 4𝑥], 
Where, 𝑥 = 𝑖𝑛𝑡(𝑠𝑜𝑢𝑛𝑑𝑠[′𝑙𝑒𝑛𝑔𝑡ℎ′]. 𝑠𝑢𝑚()/0.2) 
instead of [𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2𝑥]). 
DA2. multiplying it by 6; [𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 6𝑥],   
Where, 𝑥 = 𝑖𝑛𝑡(𝑠𝑜𝑢𝑛𝑑𝑠[′𝑙𝑒𝑛𝑔𝑡ℎ′]. 𝑠𝑢𝑚()/0.2) 
instead of [𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2𝑥]). 
[Note: the formula is as written in python code.] 
Consequently, at DA1; number of test sets increased to 6239 (this is equal to the initial number 
of test sets) and at DA2; number of test sets increased to 9358 (higher than the initial test sets).  
The classification accuracy of the 4𝑥 and 6𝑥 augmentation with 0.2seconds time frame and the 
initial 2𝑥 with 0.1second time frame is shown in Figure 6.7. 
 94  
University of Ghana http://ugspace.ug.edu.gh
100.50%
99.95%
100.00% 99.81% 99.87%
99.63%
99.50%
99.09%
99.00%
98.50%
98.09%
98.00%
97.50%
97.00%
CNN RNN-LSTM
2X 4X 6X
 
Figure 0.7: Chart showing the classification accuracy for the real time model (2x) and 
augmented dataset (4x, 6x). 
As shown in Figure 6.7, although CNN had higher accuracies at all time instances compared 
to RNN-LSTM, again both classifiers performed better in real-time classification as compared 
to higher time frames. 
Therefore, it can be concluded that both models can effectually and automatically detect-by-
classification a natural disaster sound in less time. 
6.4 CHAPTER SUMMARY 
In this chapter the performance of the proposed deep learning models, convolutional neural 
network (CNN) and recurrent neural network with long short-term memory (RNN-LSTM) 
were validated on the 20% test dataset using the classification metrics and 10-fold cross-
validation. It was observed that in all instances of the model validation process, CNN 
outperformed RNN-LSTM with the highest accuracy of 99.95%. Also, classification metrics 
such as precision and recall had the same results with the accuracy in both models. Contingency 
tables (confusion matrix) were used to show the level of accurate/inaccurate predictions 
 95  
University of Ghana http://ugspace.ug.edu.gh
between the five different classes of natural disaster sound. It was observed that, while CNN 
had 3 incorrect instances in the confusion matrix, RNN-LSTM had 6 incorrect instances. Since 
the datasets used in this study was imbalanced, the AUC-ROC as a more robust classification 
metrics (especially for extreme events like natural disasters (Chan, 2020)) was also explored.  
Furthermore, the robustness of the model was tested in terms of automatic classification of 
natural disaster sound by increasing the time frame from 0.1second to 0.2seconds and 
0.4seconds with augmented datasets. Results showed that, CNN and RNN-LSTM consistently 
maintained a higher classification accuracy even when compared with other studies. Results of 
the classification metrics for the best performing models are summarized in table 6.4. 
Table 0.4: Result summary of classification metrics 
Model  Accuracy Precision Recall AUC score 
CNN 99.95% 99.95% 99.95% 0.999 
RNN-LSTM 99.87% 99.87% 99.87% 0.999 
 
  
 96  
University of Ghana http://ugspace.ug.edu.gh
Chapter Seven  
CONCLUSION 
7.1 CHAPTER OVERVIEW 
This thesis is a work in the area of acoustic event classification. As stated in chapter one, this 
study aimed to develop an automatic natural disaster sound classification model using deep 
learning techniques. Due to complexities in natural disaster sound, its varying amplitudes and 
frequencies, the detection-by-classification (Acoustic Event Classification (AEC)) approach 
was adopted instead of detection-and-classification (Acoustic Event Detection (AED)). In this 
method of classification, the task of detecting a natural disaster sound automatically translates 
to the task of classification.  
This chapter summarizes and concludes this thesis. 
7.2 THESIS SUMMARY  
Natural disasters are in no doubt a phenomenon that can neither be prevented nor stopped, 
however, studies have shown that its effects on life and properties can be mitigated by 
predicting a disaster before it occurs or by detecting a disaster as soon as it occurs (Alam et al. 
2019; Gupta & Doshi, 2018; Wisner & Adams, 2002).  
To this end, this study sought to achieve a set of four research objectives. 
Chapter one highlighted a set of research problems, as well as the research aim and objectives. 
To achieve the set aim and objectives, we started by conducting a literature review on studies 
related to natural disasters. Through the literature review in chapter two, research trends and 
methods proposed by researchers to mitigate the effects of natural disasters was identified. 
Trends in predicting, detecting, and managing natural disasters were also identified. It was 
observed that researchers were predominantly interested in post-disaster management 
strategies. Tools such as machine learning and data mining techniques were used to analyze 
 97  
University of Ghana http://ugspace.ug.edu.gh
historical and meteorological data of natural disasters. The datasets used for the analysis were 
either in the form of text, images, or numerical data. Several research gaps including the lack 
of studies on the use of sound to differentiate one natural disaster type from the other (using 
AI techniques) was identified. The lack of literature in this domain justified the need for a 
systematic review.  
Accordingly, a systematic review of literature on sound classification was conducted in chapter 
three. This was the first main contribution of this thesis. In the review, 48 articles from the 
Scopus and ASA databases were selected based on predefined criteria. This review aimed to 
identify research trends and methodologies in the area of sound classification. Most 
importantly, this review sought to investigate existing literature on the use of sound for various 
classification tasks as well as the application domains. Although substantive evidence in the 
use of sound to classify acoustic events in the domains of bioacoustics, medicine and, the 
environment was found, no study on the use of sound to classify an acoustic event such as 
natural disasters were identified. However, algorithms and machine learning techniques that 
can be adopted in this study was identified. Furthermore, two broad categories of 
differentiating sound events were classified, they included acoustic event detection (AED) also 
known as detection-and-classification, and acoustic event classification (AEC) also known as 
detection-by-classification. Findings indicated that this study falls in the second category; 
acoustic event classification (AEC). It was also observed that neural networks (deep learning), 
achieved higher classification accuracies compared to other classification techniques used. 
Hence, this study discussed the convolutional neural networks (CNN) and recurrent neural 
networks (RNN) as deep learning techniques that will be used for the classification of natural 
disaster sound. With these reviews, research objectives one and two were achieved. 
In chapter four, the design science research methodology (DSRM) was explored as an 
appropriate research paradigm for this study. The DSRM primarily seeks to create an artefact 
 98  
University of Ghana http://ugspace.ug.edu.gh
that solves a real-world problem through a set of rigorous steps that must be reported clearly 
and concisely. Since this study is aimed at developing a model for the automatic sound 
classification of natural disasters, the developed model is deemed as an appropriate artefact.  
The sequence of steps taken, and the experiments conducted in classifying the natural disasters 
sound was presented in a framework and described accordingly in chapter five. The dataset 
was made up of five classes of natural disaster sound namely earthquake, forestfire, volcano, 
waves and windstorm; all downloaded from the Freesound database. Denoising was done to 
get rid of irrelevant sound signals using a signal envelope and a threshold value of 0.005, while 
down sampling reduced the sampling rate from 41000Hz to 16000Hz. Furthermore, based on 
the filter bank-based Mel frequency cepstral coefficient (MFCC) approach, discriminative 
spectral features were extracted from the sound recordings.  
Preparing and developing an automatic disaster sound classification model was the third 
objective of this study. Hence, models were prepared to divide the sound samples in the five 
classes into a 0.1second window frame as well as to augment the data size. This process 
increased the number of sound recordings from 244 to 15596. Finally, 80% of the preprocessed 
sound was trained using CNN and RNN-LSTM. 
Model validation is the next and compulsory process after training a model, it entailed testing 
the performance of the models on the remaining 20% (holdout) data using two validation 
techniques; the classification metrics (accuracy, precision, recall and AUC scores) and 10-fold 
cross-validation.  
Using the accuracy, precision, recall, and AUC scores, a comparison of results obtained from 
the two selected methods of validating both models showed that convolutional neural networks 
(CNN) consistently performed better than recurrent neural networks (RNN). Hence, the fourth 
objective of this study was achieved. 
 99  
University of Ghana http://ugspace.ug.edu.gh
This study to the best of our knowledge is the first to classify natural disasters based on acoustic 
signals/sound. Accordingly, this brings an advancement to both modeling techniques and 
disaster detection-by-classification. 
7.3 DISCUSSIONS 
In this section the performance of the CNN and LSTM-RNN acoustic event classification 
model is compared with other reported studies that used either CNN and/or RNN. The best 
performing models from the various validation processes will be used for the comparison. 
Table 7.1 shows the comparison of the results of previous studies with this study. It highlights 
the classification technique, classification category, type of sound, input acoustic features, or 
acoustic features representation and classification metrics. Generally, in comparison to 
previous studies, it was observed that the TensorFlow GPU (graphical processing unit) was a 
commonly used open-source library for the classification experiments.  
7.3.1 CLASSIFICATION CATEGORY 
The automatic classification of an acoustic event is primarily focused on classifying and or 
differentiating environmental sounds into one of a set of identified classes (Pooja & Usha, 
2015; Ren et al. 2017). The goal of this study was to develop a model that will automatically 
classify natural disaster sounds as early as possible. Hence, the acoustic event classification 
(AEC) approach was adopted. With acoustic event classification, the task of detecting a sound 
automatically translates to classifying the sound (Aykanat et al. 2017; Raza et al. 2019; Temko 
& Nadeu, 2009). In contrast to studies that performed acoustic event detection (AED), 
performing acoustic event classification (AEC) truncates the three-phase rigor of AED 
(Lopatka et al. 2016) which involves detection, segmentation and localization as performed in 
studies such as (Oikarinen et al. 2019; Thakur et al. 2019; Zhang et al. 2018). The AEC 
approach is also not faced with problem of overlapping segments which the AED is 
predominantly faced with (Temko & Nadeu, 2009). 
 100  
University of Ghana http://ugspace.ug.edu.gh
7.3.2 INPUT ACOUSTIC FEATURES 
Most of the studies found, adopted the image-based approach by using the spectrogram of the 
sound for the classification of the sounds of interest. Although the use of spectrograms as a 
time-frequency representation of sound has been reported to reduce the number of trainable 
parameters compared to direct sound classification (Huzaifah, 2017; Khamparia et al. 2019; 
Zhang et al. 2019), Mitilineos et al. (2018) argues that the image-based approach results in 
huge feature spaces. Furthermore, the low power quantization areas of spectrograms are 
affected by noisy conditions (Pooja & Usha, 2015). Instead of using spectrograms, MFCCs are 
used in this study for their classification effectiveness at reduced data rates (Wyse, 2017).  
Furthermore, CNN which is predominantly used for image classification works with two-
dimensional image filters with shared weights across both axes (Wyse, 2017). However, this 
is not the case with using spectrograms of sounds as images, because the axes of a spectrogram 
do not carry the same information as with a typical image (Ren et al. 2017; Rothmann, 2019; 
Wyse, 2017). Ren et al. (2017) argues that using spectrograms for sound classification is 
currently not sufficient as existing approaches do not capture the texture information 
appropriately.  
Mel frequency cepstral coefficient (MFCC) on the other hand are commonly used acoustic 
features in speech/sound recognition and classification for its ability to represent signal 
information accurately (Y. Kim et al. 2018; Luque, Romero-Lemos, Carrasco, & Barbancho, 
2018; Sengupta et al. 2016; Yaseen et al. 2018). However, due to the non-stationary nature of 
acoustic signals, Su et al. (2019) posit that adopting MFCC as a single feature for classifying 
environmental sounds may be insufficient for capturing relevant information about an acoustic 
event. Thus, this study leveraged on the strength of CNN as an image-based classifier in 
 101  
University of Ghana http://ugspace.ug.edu.gh
addition to the MFCC as a spectral (frequency-domain) feature for the classification (Sasmaz 
& Tek, 2018; Verma et al. 2019). 
7.3.3 CLASSIFICATION PERFORMANCE 
Compared to other studies shown in Table 7.1, the CNN and RNN-LSTM models used in this 
study had the highest classification accuracies with the shortest sound duration of 0.1 seconds.  
In the model validation stage, it was observed that CNN performed slightly better than RNN-
LSTM. Convolutional neural networks are predominantly known to achieve high accuracies in 
image classification and recognition tasks. However, findings from this study indicate that 
CNNs can be successfully trained to classify natural sounds. Furthermore, it also affirms the 
argument by Maccagno et al. (2019) and Salamon & Bello, (2017) that; CNNs are also efficient 
in sound classification as they can identify patterns across inputs in a time-frequency 
spectrogram as well as differentiate one sound event from another even when the sound of 
interest is masked in noise. RNN-LSTM, on the other hand, achieved good performance in 
terms of accuracy, precision, and recall. Even though the results were lower than that of CNN, 
it further affirms that RNNs perform well on sound and time-series data (Raza et al., 2019; 
Verma et al., 2019). However, it was observed from the confusion matrix that the RNN-LSTM 
model was more prone to confusing one disaster type for another, a situation also known as 
false positives. Although this may be as a result of acoustic similarities between one disaster 
type and another, this situation must be further investigated. To overcome bias in the 
classification due to imbalance dataset, this study further adopted the AUC-ROC. With all the 
class predictions being equal to one, it can be concluded that the developed models are best fit 
for classifying natural disaster sound. 
 
 102  
University of Ghana http://ugspace.ug.edu.gh
Table 0.1: Comparison of study approaches with other studies. 
Reference Technique Classification Type of Input No of sound Duration Accuracy F1-
category Sound Acoustic recordings (seconds) (%) Score 
Features 
(Aykanat et CNN AEC Respiratory Spectrogram 17,930 N/M 86 - 
al. 2017) sound images  
(Khamparia CNN AEC Environmental Spectrogram 2,400 N/M 77 - 
et al. 2019) sound images  
(Salamon & CNN AEC Environmental Spectrogram 8,732 4 85 - 
Bello, sound images 
2017) 
(Sasmaz & CNN AEC Animal sound MFCC 875 N/M 75 - 
Tek, 2018) 
(Thakur et CNN AED Bird sound Spectrogram 10,208 0.5 to 320 - 0.94 
al. 2019) images  
(Oikarinen CNN AED Marmoset Spectrogram 15,970 N/M 99 0.81 
et al. 2019) sound images 
(Verma et RNN- AED Environmental MFCC 52,845 10  0.83 
al. 2019) LSTM sound 
(Raza et al. RNN- AEC Heartbeat Spectrogram 322 12.5 & 80.80 - 
2019) LSTM sound images 27.8 
(Zhang et RNN- AED Marmoset Mel-filter 20000 5 92 - 
al. 2018) LSTM sound bank 
spectrum 
This study CNN AEC Natural MFCC 31194 0.1 99.95 99.95 
disaster sound 
This study RNN- AEC Natural MFCC 31194 0.1 99.87 99.87 
LSTM disaster sound 
 103  
University of Ghana http://ugspace.ug.edu.gh
7.4 LIMITATION OF THE STUDY 
Although this study contributes to knowledge, there are some limitations and drawbacks that 
need to be highlighted. 
It is worthy to note that all the processes in developing the disaster sound classification model 
were fully automated and without expert knowledge. This may have posed certain limitations 
to this study. They are discussed below. 
7.4.1 DATASETS 
It would have been appropriate to test and compare the model’s performance on various 
historical natural disasters datasets analysis as well as datasets from different scenes of a natural 
disaster. However, due to the unavailability of free disaster sound datasets, this study could not 
explore this option. 
7.4.2 DENOISING THE SIGNAL 
To denoise the signals, a signal envelope with a threshold value of 0.005 was created. This 
comes with its disadvantages considering that the threshold value was arbitrarily chosen. 
Although only the upper boundary was considered, a too high threshold will imply that only 
well-defined sounds will be detected and the rest undetected, and a too low threshold will result 
in jumbling up sound recordings into a class (confusing one sound type for the other). An ideal 
situation would entail using a different threshold for each class of the disaster sound dataset, 
this could not be achieved because it requires expert knowledge (Malfante et al. 2018). 
7.4.3 FEATURE EXTRACTION 
Sounds have a unique pattern of changing through time and extracting these dynamic features 
enables the identification of one sound type from the other (Karbasi et al. 2011). Although the 
spectral features used in this study were suitable for the sound classification task, Huzaifah, 
(2017) and Mitilineos et al. (2018) argue that using spectral features alone is not sufficient 
because they are unable to provide time-based progression information of acoustic signals due 
 104  
University of Ghana http://ugspace.ug.edu.gh
to their non-stationary nature. While a more robust classification system would be obtainable 
by combining the spectral features with time-frequency features (since they can extract non-
stationary signals), this approach is computationally expensive (Mitilineos et al. 2018; Ou et 
al. 2013). 
7.5 RECOMMENDATION 
The terms detection and classification are in the context of this study inseparable. However, 
the decision to choose between the task of detection-and-classification (AED), or detection-
by-classification (AEC) is the determining factor for the automatic detection of an acoustic 
event such as natural disasters.  
In this study, the latter was adopted. While the use of sound for classifying a natural disaster 
fills in the gap left by satellite images or numerical data as posited by researchers such as (Aziz 
et al. 2019; Panagiota et al. 2011), having a robust model for real-time classification is the 
starting point. We, therefore, recommend that a model such as the CNN or  RNN models should 
be integrated into an application that can be installed in mobile devices, and IoT devices.  
7.6 FUTURE WORK 
A future work will be testing and comparing the model’s performance with different disaster 
sound datasets from different locations. 
Secondly, it was observed that both models, misclassified one disaster sound with the other in 
the various instances of predicting the five classes. While this may be due to the acoustical 
similarities of the natural disasters, it is however worth investigating. 
Based on the identified limitation of this study concerning the feature extraction technique, a 
comparative study of time-frequency features and spectral features using the CNN and RNN-
LSTM classifiers will be considered for future study. A hybrid CNN RNN approach may also 
be considered.  
 105  
University of Ghana http://ugspace.ug.edu.gh
References: 
Alam, F., Ofli, F., & Imran, M. (2019). Descriptive and visual summaries of disaster events 
using artificial intelligence techniques: case studies of Hurricanes Harvey, Irma, and 
Maria. Behaviour and Information Technology, 3001(May 2019). 
https://doi.org/10.1080/0144929X.2019.1610908 
Allen, J. A., Murray, A., Noad, M. J., Dunlop, R. A., & Garland, E. C. (2017). Using self-
organizing maps to classify humpback whale song units and quantify their similarity. The 
Journal of the Acoustical Society of America, 142(4), 1943–1952. 
https://doi.org/10.1121/1.4982040 
Arel, I., Rose, D. C., & Karnowski, T. P. (2010). Deep Machine Learning—A New Frontier. 
Ieee, November, 13–18. 
Asnaning, A. R., & Putra, S. D. (2018). Flood Early Warning System Using Cognitive 
Artificial Intelligence: The Design of AWLR Sensor. 2018 International Conference on 
Information Technology Systems and Innovation, ICITSI 2018 - Proceedings, 165–170. 
https://doi.org/10.1109/ICITSI.2018.8695948 
Aucouturier, J.-J., Nonaka, Y., Katahira, K., & Okanoya, K. (2011). Segmentation of 
expiratory and inspiratory sounds in baby cry audio recordings using hidden Markov 
models. The Journal of the Acoustical Society of America, 130(5), 2969–2977. 
https://doi.org/10.1121/1.3641377 
Aykanat, M., Kılıç, Ö., Kurt, B., & Saryal, S. (2017). Classification of lung sounds using 
convolutional neural networks. Eurasip Journal on Image and Video Processing, 2017(1). 
https://doi.org/10.1186/s13640-017-0213-2 
Aziz, S., Awais, M., Akram, T., Khan, U., Alhussein, M., & Aurangzeb, K. (2019). Automatic 
scene recognition through acoustic classification for behavioral robotics. Electronics 
(Switzerland), 8(5). https://doi.org/10.3390/electronics8050483 
 106  
University of Ghana http://ugspace.ug.edu.gh
BBC News. (2018). False earthquake warning panics Japan. BBC. 
https://www.bbc.com/news/world-asia-42582113 
Beach, K., & Dunmire, B. (2007). Medical Acoustics. In Rossing T. (eds) Springer Handbook 
of Acoustics. Springer Handbooks. Springer, New York, NY. 
https://doi.org/https://doi.org/10.1007/978-0-387-30425-0_21 
Binder, C., & Paul, H. (2019). Range-dependent impacts of ocean acoustic propagation on 
automated classification of transmitted bowhead and humpback whale vocalizations. The 
Journal of the Acoustical Society of America, 2480. https://doi.org/10.1121/1.5097593 
Binkhonain, M., & Zhao, L. (2019). A review of machine learning algorithms for identification 
and classification of non-functional requirements. Expert Systems with Applications: X, 
1. https://doi.org/10.1016/j.eswax.2019.100001 
Bishop, J. C., Falzon, G., Trotter, M., Kwan, P., & Meek, P. D. (2019). Livestock vocalisation 
classification in farm soundscapes. Computers and Electronics in Agriculture, 162(April), 
531–542. https://doi.org/10.1016/j.compag.2019.04.020 
Bold, N., Zhang, C., & Akashi, T. (2019). Cross-domain deep feature combination for bird 
species classification with audio-visual data. IEICE Transactions on Information and 
Systems, E102D(10), 2033–2042. https://doi.org/10.1587/transinf.2018EDP7383 
Borgne, Y.-A. Le, & Bontempi, G. (2017). Deep learning techniques-Overview. May. 
https://doi.org/10.13140/RG.2.2.33519.84643 
Bourouhou, A., Jilbab, A., Nacir, C., & Hammouch, A. (2019). Heart sounds classification for 
a medical diagnostic assistance. International Journal of Online and Biomedical 
Engineering, 15(11), 88–103. https://doi.org/10.3991/ijoe.v15i11.10804 
Boustan, P. L., Kahn, M. E., Rhode, P. W., & Yanguas, M. L. (2017). THE EFFECT OF 
NATURAL DISASTERS ON ECONOMIC ACTIVITY IN US COUNTIES: A 
CENTURY OF DATA. In NATIONAL BUREAU OF ECONOMIC RESEARCH. 
 107  
University of Ghana http://ugspace.ug.edu.gh
http://www.nber.org/papers/w23410 
Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X. Z., Raich, R., Hadley, S. J. K., Hadley, 
A. S., & Betts, M. G. (2012). Acoustic classification of multiple simultaneous bird 
species: A multi-instance multi-label approach. The Journal of the Acoustical Society of 
America, 131(6), 4640–4650. https://doi.org/10.1121/1.4707424 
Calvet, L., Lopeman, M., De Armas, J., Franco, G., & Juan, A. A. (2017). Statistical and 
machine learning approaches for the minimization of trigger errors in parametric 
earthquake catastrophe bonds. Sort, 41(2), 373–391. 
https://doi.org/10.2436/20.8080.02.64 
Cao, X., Zhang, X., Yu, Y., & Niu, L. (2017). Deep learning-based recognition of underwater 
target. International Conference on Digital Signal Processing, DSP, 89–93. 
https://doi.org/10.1109/ICDSP.2016.7868522 
Carcary, M. (2011). Design science research: The case of the IT capability maturity framework 
(IT CMF). Electronic Journal of Business Research Methods, 9(2), 109–118. 
Chan, C. (2020). What is a ROC Curve and How to Interpret It | Displayr. 
https://www.displayr.com/what-is-a-roc-curve-how-to-interpret-it/ 
Chappell, C. (2019). Natural disasters cost $91 billion in 2018, according to federal report. 
Cnbc. https://www.cnbc.com/2019/02/06/natural-disasters-cost-91-billion-in-2018-
federal-report.html 
Chen, H., Yuan, X., Pei, Z., Li, M., & Li, J. (2019). Triple-Classification of Respiratory Sounds 
Using Optimized S-Transform and Deep Residual Networks. IEEE Access, 7(April), 
32845–32852. https://doi.org/10.1109/ACCESS.2019.2903859 
Chen, W., Shirzadi, A., Shahabi, H., Ahmad, B. Bin, Zhang, S., Hong, H., & Zhang, N. (2017). 
A novel hybrid artificial intelligence approach based on the rotation forest ensemble and 
naïve Bayes tree classifiers for a landslide susceptibility assessment in Langao County, 
 108  
University of Ghana http://ugspace.ug.edu.gh
China. Geomatics, Natural Hazards and Risk, 8(2), 1955–1977. 
https://doi.org/10.1080/19475705.2017.1401560 
Chu, S., Narayanan, S., & Kuo, C. C. J. (2009). Environmental sound recognition with 
timeFrequency audio features. IEEE Transactions on Audio, Speech and Language 
Processing, 17(6), 1142–1158. https://doi.org/10.1109/TASL.2009.2017438 
Coskun, M., YILDIRIM, Ö., UÇAR, A., & DEMIR, Y. (2017). An Overview of Popular Deep 
Learning Methods. European Journal of Technic, 7(2), 165–176. 
https://doi.org/10.23884/ejt.2017.7.2.11 
Cvengros, R. M., Valente, D., Nykaza, E. T., & Vipperman, J. S. (2012a). Blast noise 
classification with common sound level meter metrics. The Journal of the Acoustical 
Society of America, 132(2), 822–831. https://doi.org/10.1121/1.4730921 
Cvengros, R. M., Valente, D., Nykaza, E. T., & Vipperman, J. S. (2012b). Blast noise 
classification with common sound level meter metrics Blast noise classification with 
common sound level meter metrics. The Journal of the Acoustical Society of America, 
822. https://doi.org/10.1121/1.4730921 
Davis, N., & Suresh, K. (2019). Environmental sound classification using deep convolutional 
neural networks and data augmentation. 2018 IEEE Recent Advances in Intelligent 
Computational Systems, RAICS 2018, 41–45. 
https://doi.org/10.1109/RAICS.2018.8635051 
Ding, S., Zhao, H., Zhang, Y., Xu, X., & Nie, R. (2015). Extreme learning machine: algorithm, 
theory and applications. Artificial Intelligence Review, 44(1), 103–115. 
https://doi.org/10.1007/s10462-013-9405-z 
Domingo, C. (2012). Journal of Network and Computer Applications An overview of the 
internet of underwater things. Journal of Network and Computer Applications, 35(6), 
1879–1890. https://doi.org/10.1016/j.jnca.2012.07.012 
 109  
University of Ghana http://ugspace.ug.edu.gh
Doxani, G., Siachalou, S., Mitraka, Z., & Patias, P. (2019). Decision making on disaster 
management in agriculture with sentinel applications. International Archives of the 
Photogrammetry, Remote Sensing and Spatial Information Sciences - ISPRS Archives, 
42(3/W8), 121–126. https://doi.org/10.5194/isprs-archives-XLII-3-W8-121-2019 
Dubey, S., Dahiya, M., & Jain, S. (2018). Application of Distributed Data Center in Logistics 
as Cloud Collaboration for handling Disaster Relief. Proceedings - 2018 3rd International 
Conference On Internet of Things: Smart Innovation and Usages, IoT-SIU 2018, 1–11. 
https://doi.org/10.1109/IoT-SIU.2018.8519865 
Duggar, E., Li, Q., & Praagh, A. Van. (2016). Understanding the Impact of Natural Disasters : 
Exposure to Direct Damages Across Countries (Issue November). 
Dwivedi, A. K., Imtiaz, S. A., & Rodriguez-Villegas, E. (2019). Algorithms for automatic 
analysis and classification of heart sounds-A systematic review. IEEE Access, 7(c), 8316–
8345. https://doi.org/10.1109/ACCESS.2018.2889437 
Epelbaum, T. (2017). Deep learning: Technical introduction. http://arxiv.org/abs/1709.01412 
Evans, M. (2011). Natural disasters. In Virginia Quarterly Review (Vol. 93, Issue 1). 
Fang, S. H., Wang, C. Te, Chen, J. Y., Tsao, Y., & Lin, F. C. (2019). Combining acoustic 
signals and medical records to improve pathological voice classification. APSIPA 
Transactions on Signal and Information Processing, 8(2019), 1–11. 
https://doi.org/10.1017/ATSIP.2019.7 
Furquim, G., Filho, G. P. R., Jalali, R., Pessin, G., Pazzi, R. W., & Ueyama, J. (2018). How to 
improve fault tolerance in disaster predictions: A case study about flash floods using IoT, 
ML and real data. Sensors (Switzerland), 18(3), 1–20. https://doi.org/10.3390/s18030907 
Gingras, B., & Fitch, W. T. (2013). A three-parameter model for classifying anurans into four 
genera based on advertisement calls. The Journal of the Acoustical Society of America, 
133(October 2012), 547–559. 
 110  
University of Ghana http://ugspace.ug.edu.gh
Giordano, B. L. (2005). Everyday listening: an annotated bibliography. In D. Rocchesso & F. 
Fontana (Eds.), The Sounding Object: Vol. 6 PART B (pp. 1–12). 
https://doi.org/10.1115/GT2005-69036 
Giret, N., Roy, P., Albert, A., Pachet, F., Kreutzer, M., & Bovet, D. (2011). Finding good 
acoustic features for parrot vocalizations: The feature generation approach. The Journal 
of the Acoustical Society of America, 129(2), 1089–1099. 
https://doi.org/10.1121/1.3531953 
Gopalaswami, R. (2018). A Study on the Correlation of Physiological and Psychological 
Health Hazards in Human Habitats with Seismicity, Mountain Air Turbulence and 
Environmental Infrasound. Open Journal of Earthquake Research, 07(02), 69–87. 
https://doi.org/10.4236/ojer.2018.72005 
Goswami, S., Chakraborty, S., Ghosh, S., Chakrabarti, A., & Chakraborty, B. (2018). A review 
on application of data mining techniques to combat natural disasters. Ain Shams 
Engineering Journal, 9(3), 365–378. https://doi.org/10.1016/j.asej.2016.01.012 
Greenhalgh, T. (1997). How to read a paper: Papers that summarise other papers (systematic 
reviews and meta-analyses). Bmj, 315(7109), 672–675. 
https://doi.org/10.1136/bmj.315.7109.672 
Guilment, T., Socheleau, F.-X., Pastor, D., & Vallez, S. (2018). Sparse representation-based 
classification of mysticete calls. The Journal of the Acoustical Society of America, 144(3), 
1550–1563. https://doi.org/10.1121/1.5055209 
Gupta, S., & Doshi, L. (2018). An Acknowledgement Based System for Forest Fire Detection 
via Leach Algorithm. Proceedings - 2017 International Conference on Computational 
Intelligence and Networks, CINE 2017, 17–21. https://doi.org/10.1109/CINE.2017.16 
Halkias, X. C., Paris, S., & Glotin, H. (2013). Classification of mysticete sounds using machine 
learning techniques. The Journal of the Acoustical Society of America, 134(5), 3496–
 111  
University of Ghana http://ugspace.ug.edu.gh
3505. https://doi.org/10.1121/1.4821203 
Halkias, X., Paris, S., & Glotin, H. (2013). Classification of mysticete sounds using machine 
learning techniques. 3496(2013). https://doi.org/10.1121/1.4821203 
Han, W., Coutinho, E., Ruan, H., Li, H., Schuller, B., Yu, X., & Zhu, X. (2016). Semi-
supervised active learning for sound classification in hybrid learning environments. PLoS 
ONE, 11(9), 1–19. https://doi.org/10.1371/journal.pone.0162075 
Hartman, W. M., & Candy, J. V. (2014). Acoustic Signal Processing. Springer Handbook of 
Acoustics, December. https://doi.org/10.1007/978-1-4939-0755-7 
Hassiotis, C. (2018). Infrasound Can Detect Tornadoes an Hour Before They Form. 
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design Science in Information Systems 
Research (Vol. 28, Issue 1, pp. 75–105). 
Https://www.bbc.com/news/technology-40366816. (2017). California earthquake alarm 
sounded - 92 years late. BBC. 
Huzaifah, M. (2017). Comparison of Time-Frequency Representations for Environmental 
Sound Classification using Convolutional Neural Networks. 1–5. 
http://arxiv.org/abs/1706.07156 
Hwang, C. J., Kush, A., & Kumar, A. (2018). Multihop ad hoc networks for disaster response 
scenarios. Proceedings - 2018 International Conference on Computational Science and 
Computational Intelligence, CSCI 2018, 810–814. 
https://doi.org/10.1109/CSCI46756.2018.00162 
Ibrahim, A. K., Chérubin, L. M., Zhuang, H., Schärer Umpierre, M. T., Dalgleish, F., Erdol, 
N., Ouyang, B., & Dalgleish, A. (2018). An approach for automatic classification of 
grouper vocalizations with passive acoustic monitoring. The Journal of the Acoustical 
Society of America, 143(2), 666–676. https://doi.org/10.1121/1.5022281 
Ibrahim, A. K., Zhuang, H., Chérubin, L. M., Umpierre, M. T. S., Ali, A. M., Richard, S., Sch, 
 112  
University of Ghana http://ugspace.ug.edu.gh
M. T., Ali, A. M., Nemeth, R. S., & Erdol, N. (2019). Classification of red hind grouper 
call types using random ensemble of stacked autoencoders. The Journal of the Acoustical 
Society of America, 2155. https://doi.org/10.1121/1.5126861 
Imran, M., Alam, F., Ofli, F., & Aupetit, M. (2017). Enabling Rapid Disaster Response Using 
Artificial Intelligence and Social Media. 1–12. 
Ivić, M. (2019). Artificial intelligence and geospatial analysis in disaster management. 
International Archives of the Photogrammetry, Remote Sensing and Spatial Information 
Sciences - ISPRS Archives, 42(3/W8), 161–166. https://doi.org/10.5194/isprs-archives-
XLII-3-W8-161-2019 
Jacoby, C. B. (2014). Automatic Urban Sound Classification Using Feature Learning 
Techniques. https://steinhardt.nyu.edu/scmsAdmin/media/users/ec109/MTT-14-01-
013.pdf 
Joshi, N. (2019). How AI Can And Will Predict Disasters. Forbes. 
Kaewtip, K., Alwan, A., O’Reilly, C., & Taylor, C. E. (2016). A robust automatic birdsong 
phrase classification: A template-based approach. The Journal of the Acoustical Society 
of America, 140(5), 3691–3701. https://doi.org/10.1121/1.4966592 
Kansal, A., Singh, Y., Kumar, N., & Mohindru, V. (2016). Detection of forest fires using 
machine learning technique: A perspective. Proceedings of 2015 3rd International 
Conference on Image Information Processing, ICIIP 2015, 241–245. 
https://doi.org/10.1109/ICIIP.2015.7414773 
Karbasi, M., Ahadi, S. M., & Bahmanian, M. (2011). Environmental Sound Classification 
using Spectral Dynamic Features. IEEE ICICS, 2–7. 
https://doi.org/10.1109/ICICS.2011.6173513 
Khalaf, M., Hussain, A. J., Al-Jumeily, D., Baker, T., Keight, R., Lisboa, P., Fergus, P., & Al 
Kafri, A. S. (2018). A Data Science Methodology Based on Machine Learning Algorithms 
 113  
University of Ghana http://ugspace.ug.edu.gh
for Flood Severity Prediction. 2018 IEEE Congress on Evolutionary Computation, CEC 
2018 - Proceedings, 1–8. https://doi.org/10.1109/CEC.2018.8477904 
Khalaf, M., Hussain, A. J., Al-Jumeily, D., Fergus, P., & Idowu, I. O. (2015). Advance flood 
detection and notification system based on sensor technology and machine learning 
algorithm. 2015 22nd International Conference on Systems, Signals and Image 
Processing - Proceedings of IWSSIP 2015, 105–108. 
https://doi.org/10.1109/IWSSIP.2015.7314188 
Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sound 
classification using convolutional neural network and tensor deep stacking network. IEEE 
Access, 7(January), 7717–7727. https://doi.org/10.1109/ACCESS.2018.2888882 
Kim, S., Lee, W., Park, Y. S., Lee, H. W., & Lee, Y. T. (2017). Forest fire monitoring system 
based on aerial image. Proceedings of the 2016 3rd International Conference on 
Information and Communication Technologies for Disaster Management, ICT-DM 2016, 
5–10. https://doi.org/10.1109/ICT-DM.2016.7857214 
Kim, Y., Sa, J., Chung, Y., Park, D., & Lee, S. (2018). Resource-efficient pet dog sound events 
classification using LSTM-FCN based on time-series data. Sensors (Switzerland), 18(11). 
https://doi.org/10.3390/s18114019 
Krishna, D., Marcelino, P., Doxani, G., Siachalou, S., Mitraka, Z., Patias, P., Gingras, B., Fitch, 
W. T., Union, I. T., Lebien, J., Ioup, J., Gomes, L., Vale, Z., Han, W., Coutinho, E., Ruan, 
H., Li, H. H. H., Schuller, B., Yu, X., … Erdol, N. (2018). Deep Convolutional Neural 
Networks and Data Augmentation for Environmental Sound Classification. The Journal 
of the Acoustical Society of America, 8(5), 1–11. 
https://doi.org/10.1109/LSP.2017.2657381 
LeBien, J. G., & Ioup, J. W. (2018). Species-level classification of beaked whale echolocation 
signals detected in the northern Gulf of Mexico. The Journal of the Acoustical Society of 
 114  
University of Ghana http://ugspace.ug.edu.gh
America, 144(1), 387–396. https://doi.org/10.1121/1.5047435 
Lebien, J., & Ioup, J. (2018). Species-level classification of beaked whale echolocation signals 
detected in the northern Gulf of Mexico. The Journal of the Acoustical Society of America, 
387, 3278–3282. https://doi.org/10.1121/1.5047435 
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. 
https://doi.org/10.1038/nature14539 
Li, H., Fei, X., & He, C. (2018). Study on Most Important Factor and Most Vulnerable Location 
for a Forest Fire Case Using Various Machine Learning Techniques. Proceedings - 2018 
6th International Conference on Advanced Cloud and Big Data, CBD 2018, 298–303. 
https://doi.org/10.1109/CBD.2018.00060 
Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more 
discriminating measure than accuracy. IJCAI International Joint Conference on Artificial 
Intelligence, 519–524. 
Lopatka, K., Kotus, J., & Czyzewski, A. (2016). Detection, classification and localization of 
acoustic events in the presence of background noise for acoustic surveillance of hazardous 
situations. Multimedia Tools and Applications, 75(17), 10407–10439. 
https://doi.org/10.1007/s11042-015-3105-4 
Luque, A., Romero-Lemos, J., Carrasco, A., & Barbancho, J. (2018). Non-sequential automatic 
classification of anuran sounds for the estimation of climate-change indicators. Expert 
Systems with Applications, 95, 248–260. https://doi.org/10.1016/j.eswa.2017.11.016 
Luque, A., Romero-Lemos, J., Carrasco, A., & Gonzalez-Abril, L. (2018). Temporally-aware 
algorithms for the classification of anuran sounds. PeerJ, 2018(5), 1–40. 
https://doi.org/10.7717/peerj.4732 
Maccagno, A., Mastropietro, A., Mazziotta, U., Lee, Y., & Uncini, A. (2019). A CNN 
Approach for Audio Classification in Construction Sites. WIRNAt: Vietri Sul Mare (SA), 
 115  
University of Ghana http://ugspace.ug.edu.gh
Italy, June. 
Malfante, M., Mars, J. I., Dalla Mura, M., & Gervaise, C. (2018). Automatic fish sounds 
classification. The Journal of the Acoustical Society of America, 143(5), 2834–2846. 
https://doi.org/10.1121/1.5036628 
Mitilineos, S. A., Potirakis, S. M., Tatlas, N. A., & Rangoussi, M. (2018). A two-level sound 
classification platform for environmental monitoring. Journal of Sensors, 2018. 
https://doi.org/10.1155/2018/5828074 
Mone, G. (2007). Earth Speaks in an Inaudible Voice. Discover Magazine, August. 
Monroe-Kane, C. (2019). 20 Seconds Makes All The Difference : How Sound Waves Help Us 
Understand Earthquakes Geophysicist Ben Holtzman On Using Sound Recordings To 
Study Earthquakes ’ Past ,. WISCONSIN Public Radio (Npr), a Service of the Wisconsin 
Educational Communications Board and the University of Wisconsin-Madison. 
https://www.wpr.org/20-seconds-makes-all-difference-how-sound-waves-help-us-
understand-earthquakes 
Mousa, M., Zhang, X., & Claudel, C. (2016). Flash Flood Detection in Urban Cities Using 
Ultrasonic and Infrared Sensors. IEEE Sensors Journal, 16(19), 7204–7216. 
https://doi.org/10.1109/JSEN.2016.2592359 
Muir, T. G., & Bradley, D. L. (2016). Underwater Acoustics: A Brief Historical Overview 
Through World War II. Acoustics Today, 12(3), 40–48. http://acousticstoday.org/wp-
content/uploads/2016/09/Underwater-Acoustics.pdf 
Nasanbat, E., Lkhamjav, O., Balkhai, A., Tsevee-Oirov, C., Purev, A., & Dorjsuren, M. (2018). 
A spatial distributionmap of the wildfire risk in Mongolia using decision support system. 
International Archives of the Photogrammetry, Remote Sensing and Spatial Information 
Sciences - ISPRS Archives, 42(3W4), 357–362. https://doi.org/10.5194/isprs-archives-
XLII-3-W4-357-2018 
 116  
University of Ghana http://ugspace.ug.edu.gh
Noda, J. J., Travieso, C. M., & Sánchez-Rodríguez, D. (2016). Automatic taxonomic 
classification of fish based on their acoustic signals. Applied Sciences (Switzerland), 
6(12). https://doi.org/10.3390/app6120443 
Oikarinen, T., Srinivasan, K., Meisner, O., Hyman, J. B., Parmar, S., Fanucci-Kiss, A., 
Desimone, R., Landman, R., & Feng, G. (2019). Deep convolutional network for animal 
sound classification and source attribution using dual audio recordings. The Journal of the 
Acoustical Society of America, 145(2), 654–662. https://doi.org/10.1121/1.5087827 
Okamoto, K., Mochida, T., Nozaki, D., Wen, Z., Qi, X., & Sato, T. (2018). Content-Oriented 
Surveillance System Based on ICN in Disaster Scenarios. International Symposium on 
Wireless Personal Multimedia Communications, WPMC, 2018-Novem, 484–489. 
https://doi.org/10.1109/WPMC.2018.8712852 
Ou, H., Au, W., Zurk, L., & Lammers, M. (2013). Automated extraction and classification of 
time-frequency contours in humpback vocalizations. The Journal of the Acoustical 
Society of America, 133(January). 
Oweis, R. J., Abdulhay, E. W., Khayal, A., & Awad, A. (2015). An alternative respiratory 
sounds classification system utilizing artificial neural networks. Biomedical Journal, 
38(2), 153–161. https://doi.org/10.4103/2319-4170.137773 
Palaniappan, R., Sundaraj, K., & Ahamed, N. U. (2013). Machine learning in lung sound 
analysis: A systematic review. Biocybernetics and Biomedical Engineering, 33(3), 129–
135. https://doi.org/10.1016/j.bbe.2013.07.001 
Panagiota, M., Jocelyn, C., & Erwan, P. (2011). State of the art on Remote Sensing for 
vulnerability and damage assessment on urban context. Grenoble, France: URBASIS 
Consortium, March. https://doi.org/10.1097/MD.0000000000008031 
Pandeya, Y. R., Kim, D., & Lee, J. (2018). Domestic cat sound classification using learned 
features from deep neural nets. Applied Sciences (Switzerland), 8(10), 1–17. 
 117  
University of Ghana http://ugspace.ug.edu.gh
https://doi.org/10.3390/app8101949 
Pandeya, Y. R., & Lee, J. (2018). Domestic cat sound classification using transfer learning. 
International Journal of Fuzzy Logic and Intelligent Systems, 18(2), 154–160. 
https://doi.org/10.5391/IJFIS.2018.18.2.154 
Parada, P. P., & Cardenal-Lopez, A. (2014). Using Gaussian mixture models to detect and 
classify dolphin whistles and pulses. The Journal of the Acoustical Society of America, 
135(June), 3371–3381. http://dx.doi.org/10.1121/1.4876439 
Peffers, K., Tuunanen, T., & Rothenberger, M. A. (2008). A Design Science Research 
Methodology for Information Systems Research. Journal of Management Information 
Systems., August 2014. https://doi.org/10.2753/MIS0742-1222240302 
Perlman, D. (2013). THE RUMBLE OF DESTRUCTION / Infrasonic sound. Hearst 
Communication, Inc. https://www.sfgate.com/news/article/THE-RUMBLE-OF-
DESTRUCTION-Infrasonic-sound-too-2632570.php 
Perr, J. (2005). Basic acoustics and Signal Processing. LinuxFocus.Org, 271. 
http://linuxfocus.org 
Peso Parada, P., & Cardenal-López, A. (2014). Using Gaussian mixture models to detect and 
classify dolphin whistles and pulses. The Journal of the Acoustical Society of America, 
135(6), 3371–3380. https://doi.org/10.1121/1.4876439 
Pooja, K. J., & Usha, L. (2015). Robust Sound Event Recognition using Subband Power 
Distribution Image Feature. International Journal of Engineering Research and 
Technology, V4(05), 1116–1121. https://doi.org/10.17577/ijertv4is051087 
Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., Shyu, M.-L., Chen, S.-C., 
& Iyengar, S. S. (2018). A Survey on Deep Learning. ACM Computing Surveys, 51(5), 1–
36. https://doi.org/10.1145/3234150 
Pramono, R. X. A., Bowyer, S., & Rodriguez-Villegas, E. (2017). Automatic adventitious 
 118  
University of Ghana http://ugspace.ug.edu.gh
respiratory sound analysis: A systematic review. In PLoS ONE (Vol. 12, Issue 5). 
https://doi.org/10.1371/journal.pone.0177926 
Qian, K., Zhang, Z., Baird, A., & Schuller, B. (2017a). Active learning for bird sound 
classification via a kernel-based extreme learning machine. The Journal of the Acoustical 
Society of America, 142(4), 1796–1804. https://doi.org/10.1121/1.5004570 
Qian, K., Zhang, Z., Baird, A., & Schuller, B. (2017b). Active learning for bird sounds 
classification. Acta Acustica United with Acustica, 103(3), 361–364. 
https://doi.org/10.3813/AAA.919064 
Qian, K., Zhang, Z., Baird, A., & Schuller, B. (2018). Active learning for bird sound 
classification via a kernel-based extreme learning machine. The Journal of the Acoustical 
Society of America, 1796(2017). https://doi.org/10.1121/1.5004570 
Rascon, C., & Meza, I. (2017). Localization of sound sources in robotics: A review. Robotics 
and Autonomous Systems, 96, 184–210. https://doi.org/10.1016/j.robot.2017.07.011 
Raza, A., Mehmood, A., Ullah, S., Ahmad, M., Choi, G. S., & On, B. W. (2019). Heartbeat 
sound signal classification using deep learning. Sensors (Switzerland), 19(21), 1–15. 
https://doi.org/10.3390/s19214819 
Ren, J., Jiang, X., Yuan, J., & Magnenat-Thalmann, N. (2017). Sound-Event Classification 
Using Robust Texture Features for Robot Hearing. IEEE Transactions on Multimedia, 
19(3), 447–458. https://doi.org/10.1109/TMM.2016.2618218 
Resch, B., Usländer, F., & Havas, C. (2018). Combining machine-learning topic models and 
spatiotemporal analysis of social media data for disaster footprint and damage assessment. 
Cartography and Geographic Information Science, 45(4), 362–376. 
https://doi.org/10.1080/15230406.2017.1356242 
Robakis, E., Watsa, M., & Erkenswick, G. (2018). Classification of producer characteristics in 
primate long calls using neural networks. The Journal of the Acoustical Society of 
 119  
University of Ghana http://ugspace.ug.edu.gh
America, 144(1), 344–353. https://doi.org/10.1121/1.5046526 
Roch, M. A., Newport, D., Baumann-pickering, S., Mellinger, D. K., Qui, S., Soldevilla, M. 
S., & Hildebrand, J. A. (2011). Classification of echolocation clicks from odontocetes in 
the Southern California Bight. The Journal of the Acoustical Society of America, 
129(January), 467–476. https://doi.org/10.1121/1.3514383 
Rothmann, D. (2019). What ’s wrong with CNNs and spectrograms for audio processing ? 
Sounds are “ transparent .” 1–9. 
Rubinstein, G. (2008). ON SOUNDS EMITTED BY INANIMATE OBJECTS IN RUSSIAN. 
The Slavic and East European Journal, 52(4), 561–588. 
https://www.jstor.org/stable/40651272 
Salamon, J., & Bello, J. P. (2017). Deep Convolutional Neural Networks and Data 
Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters, 
24(3), 279–283. https://doi.org/10.1109/LSP.2017.2657381 
Sasmaz, E., & Tek, F. B. (2018). Animal Sound Classification Using A Convolutional Neural 
Network. UBMK 2018 - 3rd International Conference on Computer Science and 
Engineering, 625–629. https://doi.org/10.1109/UBMK.2018.8566449 
Sayad, Y. O., Mousannif, H., & Al Moatassime, H. (2019). Predictive modeling of wildfires: 
A new dataset and machine learning approach. Fire Safety Journal, 104(September 2018), 
130–146. https://doi.org/10.1016/j.firesaf.2019.01.006 
Sengupta, N., Sahidullah, M., & Saha, G. (2016). Lung sound classification using cepstral-
based statistical features. Computers in Biology and Medicine, 75, 118–129. 
https://doi.org/10.1016/j.compbiomed.2016.05.013 
Sermet, Y., & Demir, I. (2018). An intelligent system on knowledge generation and 
communication about flooding. Environmental Modelling and Software, 108(August 
2017), 51–60. https://doi.org/10.1016/j.envsoft.2018.06.003 
 120  
University of Ghana http://ugspace.ug.edu.gh
Shamir, L., Yerby, C., Simpson, R., von Benda-Beckmann, A. M., Tyack, P., Samarra, F., 
Miller, P., & Wallin, J. (2014). Classification of large acoustic datasets using machine 
learning and crowdsourcing: Application to whale calls. The Journal of the Acoustical 
Society of America, 135(2), 953–962. https://doi.org/10.1121/1.4861348 
Simmonds, J., & MacLennan, D. (2005). Underwater Sound. Fisheries Acoustics: Theory and 
Practice, 1945, 20–69. 
Singhvi, A., Saget, B., & Lee, J. (2018). What Went Wrong With Indonesia’s Tsunami Early 
Warning System. The New York Times. 
Soule, B. (2014). Post-crisis analysis of an ineffective tsunami alert: The 2010 earthquake in 
Maule, Chile. Disasters, 38. https://doi.org/10.1111/disa.12045 
Stojanovic, M., & Beaujean, P. P. J. (2016). Acoustic communication. Springer Handbook of 
Ocean Engineering, 359–386. https://doi.org/10.1007/978-3-319-16649-0_15 
Su, Y., Zhang, K., Wang, J., & Madani, K. (2019). Environment sound classification using a 
two-stream CNN based on decision-level fusion. Sensors (Switzerland), 19(7), 1–15. 
https://doi.org/10.3390/s19071733 
Tan, L. N., Alwan, A., Kossan, G., Cody, M. L., & Taylor, C. E. (2015). Dynamic time warping 
and sparse representation classification for birdsong phrase classification using limited 
training data a ). The Journal of the Acoustical Society of America, 137(3). 
https://doi.org/10.1121/1.4906168 
Tarasconi, F., Farina, M., Mazzei, A., & Bosca, A. (2017). The role of unstructured data in 
real-time disaster-related social media monitoring. Proceedings - 2017 IEEE 
International Conference on Big Data, Big Data 2017, 2018-Janua, 3769–3778. 
https://doi.org/10.1109/BigData.2017.8258377 
Temko, A., & Nadeu, C. (2009). Acoustic Event Detection and Classification. In Computers 
in the Human Interaction Loop (Issue December). https://doi.org/10.1007/978-1-84882-
 121  
University of Ghana http://ugspace.ug.edu.gh
054-8_7 
Thakur, A., Thapar, D., Rajan, P., & Nigam, A. (2019). Deep metric learning for bioacoustic 
classification: Overcoming training data scarcity using dynamic triplet loss. The Journal 
of the Acoustical Society of America, 146(1), 534–547. https://doi.org/10.1121/1.5118245 
Tierney, K. J. (2019). Businesses and Disasters: Vulnerability, Impact, and Recovery. In H. 
Rodríguez, E. L. Quarantelli, & R. R. Dynes (Eds.), Handbook of Disaster Research (pp. 
275–296). Springer. https://doi.org/10.1093/oxfordhb/9780190274481.013.35 
Tobergte, D. R., & Curtis, S. (2013). Environmental Health in Emergencies. Journal of 
Chemical Information and Modeling, 53(9), 1689–1699. 
https://doi.org/10.1017/CBO9781107415324.004 
Tranfield, D., Denyer, D., & Smart, P. (2003). Towards a Methodology for Developing 
Evidence-Informed Management Knowledge by Means of Systematic Review. British 
Journal of Management, 14(3), 207–222. https://doi.org/10.1111/1467-8551.00375 
Turner, C., & Joseph, A. (2015). A Wavelet Packet and Mel-Frequency Cepstral Coefficients-
Based Feature Extraction Method for Speaker Identification. Procedia Computer Science, 
61, 416–421. https://doi.org/10.1016/j.procs.2015.09.177 
UNDP. (2012). Disaster Risk Reduction and Recovery. United Nations Development 
Programme (UNDP) FAST FACTS. www.undp.org/cpr 
Vallimeena, P., Nair, B. B., & Rao, S. N. (2018). Machine Vision Based Flood Depth 
Estimation Using Crowdsourced Images of Humans. 2018 IEEE International 
Conference on Computational Intelligence and Computing Research, ICCIC 2018, 1–4. 
https://doi.org/10.1109/ICCIC.2018.8782363 
Van der Merwe, A., Gerber, A., & Smuts, H. (2020). Guidelines for Conducting Design 
Science Research in Information Systems (pp. 163–178). https://doi.org/10.1007/978-3-
030-35629-3_11 
 122  
University of Ghana http://ugspace.ug.edu.gh
Van Dulmen, S., Sluijs, E., Van Dijk, L., De Ridder, D., Heerdink, R., & Bensing, J. (2007). 
Patient adherence to medical treatment: A review of reviews. BMC Health Services 
Research, 7, 1–13. https://doi.org/10.1186/1472-6963-7-55 
Verma, D., Jana, A., & Ramamritham, K. (2019). Classification and mapping of sound sources 
in local urban streets through AudioSet data and Bayesian optimized Neural Networks. 
Noise Mapping, 6(1), 52–71. https://doi.org/10.1515/noise-2019-0005 
Vrbancic, G., & Podgorelec, V. (2018). Automatic classification of motor impairment neural 
disorders from EEG signals using deep convolutional neural networks. Elektronika Ir 
Elektrotechnika, 24(4), 1–7. https://doi.org/10.5755/j01.eie.24.4.21469 
Wallemacq, P. (2015). The Human Cost of Natural disasters. 
Wang, T., & Nanda, S. (2012). Feature Extraction Methods & Application. GE Global 
Research and GE Power & Water, 41. 
http://www.gis.usu.edu/~doug/RS5750/PastProj/FA2002/KelliTaylor.pdf 
Wang, Y., & Peng, H. (2018). Underwater acoustic source localization using generalized 
regression neural network. The Journal of the Acoustical Society of America, 143(4), 
2321–2331. https://doi.org/10.1121/1.5032311 
Wason, R. (2018). Deep learning: Evolution and expansion. Cognitive Systems Research, 52, 
701–708. https://doi.org/10.1016/j.cogsys.2018.08.023 
Weng, C. G., & Poon, J. (2008). A new evaluation measure for imbalanced datasets. 
Conferences in Research and Practice in Information Technology Series, 87, 27–32. 
Wieland, M., Liu, W., & Yamazaki, F. (2016). Learning change from Synthetic Aperture Radar 
images: Performance evaluation of a Support Vector Machine to detect earthquake and 
tsunami-induced changes. Remote Sensing, 8(10). https://doi.org/10.3390/rs8100792 
Wilson, J. D., & Makris, N. C. (2006). Ocean acoustic hurricane classification. The Journal of 
the Acoustical Society of America, 119(1), 168–181. https://doi.org/10.1121/1.2130961 
 123  
University of Ghana http://ugspace.ug.edu.gh
Winter, R. (2008). Design science research in Europe. European Journal of Information 
Systems, 17(5), 470–475. https://doi.org/10.1057/ejis.2008.44 
Wisner, B., & Adams, J. (2002). Environmental health in emergencies and disasters. In World 
Health Organization (Vol. 62, Issue 5). https://doi.org/10.1007/s00393-003-0515-x 
Wren, Y., Harding, S., Goldbart, J., & Roulstone, S. (2018). A systematic review and 
classification of interventions for speech-sound disorder in preschool children. 
International Journal of Language and Communication Disorders, 53(3), 446–467. 
https://doi.org/10.1111/1460-6984.12371 
Wu, J., Chua, Y., Zhang, M., Li, H., & Tan, K. C. (2018). A spiking neural network framework 
for robust sound classification. Frontiers in Neuroscience, 12(NOV), 1–17. 
https://doi.org/10.3389/fnins.2018.00836 
Wyse, L. (2017). Audio Spectrogram Representations for Processing with Convolutional 
Neural Networks. Proceedings of the First International Workshop on Deep Learning and 
Music Joint with IJCNN, 1(1), 37–41. http://arxiv.org/abs/1706.09559 
Yang, J., Wang, Y. X., Qiao, Y. Y., Zhao, X. X., Liu, F., & Cheng, G. (2015). On Evaluating 
Multi-class Network Traffic Classifiers Based on AUC. Wireless Personal 
Communications, 83(3), 1731–1750. https://doi.org/10.1007/s11277-015-2473-4 
Yaseen, Son, G. Y., & Kwon, S. (2018). Classification of heart sound signal using multiple 
features. Applied Sciences (Switzerland), 8(12). https://doi.org/10.3390/app8122344 
Zhang, L., Wang, D., Bao, C., Wang, Y., & Xu, K. (2019). Large-scale whale-call classification 
by transfer learning on multi-scale waveforms and time-frequency features. Applied 
Sciences (Switzerland), 9(5), 1–11. https://doi.org/10.3390/app9051020 
Zhang, Y.-J., Huang, J.-F., Gong, N., Ling, Z.-H., & Hu, Y. (2018). Automatic detection and 
classification of marmoset vocalizations using deep and recurrent neural networks. The 
Journal of the Acoustical Society of America, 144(1), 478–487. 
 124  
University of Ghana http://ugspace.ug.edu.gh
https://doi.org/10.1121/1.5047743 
Zhang, Ya-jie, Huang, J., Gong, N., Ling, Z., & Hu, Y. (2019). Automatic detection and 
classification of marmoset vocalizations using deep and recurrent neural networks. The 
Journal of the Acoustical Society of America, 478(2018). 
https://doi.org/10.1121/1.5047743 
Zhang, Yan, Lv, D., & Zhao, Y. (2016). Multiple-view active learning for environmental sound 
classification. International Journal of Online Engineering, 12(12), 49–54. 
https://doi.org/10.3991/ijoe.v12i12.6458 
 
  
 125  
University of Ghana http://ugspace.ug.edu.gh
APPENDIX A: PRIMARY STUDIES USED FOR THE SYSTEMATIC 
REVIEW 
REF NO BIBLOGRAPHY 
A1.  Shamir, L. Yerby, C. Simpson, R. von Benda-Beckmann, A. M. Tyack, P. Samarra, 
F. Miller, P. & Wallin, J. (2014). Classification of large acoustic datasets using 
machine learning and crowdsourcing: Application to whale calls. The Journal 
of the Acoustical Society of America, 135(2), 953–962. 
https://doi.org/10.1121/1.4861348 
A2.  Qian, K. Zhang, Z. Baird, A. & Schuller, B. (2017). Active learning for bird sound 
classification via a kernel-based extreme learning machine. The Journal of the 
Acoustical Society of America, 142(4), 1796–1804. 
https://doi.org/10.1121/1.5004570 
A3.  Malfante, M. Mars, J. I. Dalla Mura, M. & Gervaise, C. (2018). Automatic fish 
sounds classification. The Journal of the Acoustical Society of America, 143(5), 
2834–2846. https://doi.org/10.1121/1.5036628 
A4.  Halkias, X. C. Paris, S. & Glotin, H. (2013). Classification of mysticete sounds using 
machine learning techniques. The Journal of the Acoustical Society of America, 
134(5), 3496–3505. https://doi.org/10.1121/1.4821203 
A5.  Thakur, A. Thapar, D. Rajan, P. & Nigam, A. (2019). Deep metric learning for 
bioacoustic classification: Overcoming training data scarcity using dynamic 
triplet loss. The Journal of the Acoustical Society of America, 146(1), 534–547. 
https://doi.org/10.1121/1.5118245 
A6.  Cvengros, R. M. Valente, D. Nykaza, E. T. Vipperman, J. S. Cvengros, R. M. 
Valente, D. & Nykaza, E. T. (2017). Blast noise classification with common 
sound level meter metrics Blast noise classification with common sound level 
meter metrics. 822(2012). https://doi.org/10.1121/1.4730921 
A7.  Briggs, F. Lakshminarayanan, B. Neal, L. Fern, X. Z. Raich, R. Hadley, S. J. K. 
Hadley, A. S. & Betts, M. G. (2012). Acoustic classification of multiple 
simultaneous bird species: A multi-instance multi-label approach. The Journal 
of the Acoustical Society of America, 131(6), 4640–4650. 
https://doi.org/10.1121/1.4707424 
A8.  Robakis, E. Watsa, M. & Erkenswick, G. (2018). Classification of producer 
characteristics in primate long calls using neural networks. The Journal of the 
Acoustical Society of America, 144(1), 344–353. 
https://doi.org/10.1121/1.5046526 
A9.  Ibrahim, A. K. Chérubin, L. M. Zhuang, H. Schärer Umpierre, M. T. Dalgleish, F. 
Erdol, N. Ouyang, B. & Dalgleish, A. (2018). An approach for automatic 
classification of grouper vocalizations with passive acoustic monitoring. The 
Journal of the Acoustical Society of America, 143(2), 666–676. 
https://doi.org/10.1121/1.5022281 
A10. Zhan g, Y.-J. Huang, J.-F. Gong, N. Ling, Z.-H. & Hu, Y. (2018). Automatic 
detection and classification of marmoset vocalizations using deep and recurrent 
neural networks. The Journal of the Acoustical Society of America, 144(1), 
478–487. https://doi.org/10.1121/1.5047743 
A11. Oika rinen, T. Srinivasan, K. Meisner, O. Hyman, J. B. Parmar, S. Fanucci-Kiss, A. 
Desimone, R. Landman, R. & Feng, G. (2019). Deep convolutional network 
for animal sound classification and source attribution using dual audio 
recordings. The Journal of the Acoustical Society of America, 145(2), 654–662. 
https://doi.org/10.1121/1.5087827 
A12. Ibra him, A. K. Zhuang, H. Chérubin, L. M. Umpierre, M. T. S. Ali, A. M. Richard, 
S. Sch, M. T. Ali, A. M. Nemeth, R. S. & Erdol, N. (2019). Classification of 
red hind grouper call types using random ensemble of stacked autoencoders. 
2155. https://doi.org/10.1121/1.5126861 
 126  
University of Ghana http://ugspace.ug.edu.gh
A13. Guil ment, T. Socheleau, F.-X. Pastor, D. & Vallez, S. (2018). Sparse representation-
based classification of mysticete calls. The Journal of the Acoustical Society of 
America, 144(3), 1550–1563. https://doi.org/10.1121/1.5055209 
A14. Kae wtip, K. Alwan, A. O’Reilly, C. & Taylor, C. E. (2016). A robust automatic 
birdsong phrase classification: A template-based approach. The Journal of the 
Acoustical Society of America, 140(5), 3691–3701. 
https://doi.org/10.1121/1.4966592 
A15. Bind er, C. & Paul, H. (2019). Range-dependent impacts of ocean acoustic 
propagation on automated classification of transmitted bowhead and 
humpback whale vocalizations. 2480. https://doi.org/10.1121/1.5097593 
A16. Roc h, M. A. Newport, D. Baumann-pickering, S. Mellinger, D. K. Qui, S. Soldevilla, 
M. S. & Hildebrand, J. A. (2011). Classification of echolocation clicks from 
odontocetes in the Southern California Bight. The Journal of the Acoustical 
Society of America, 129(January), 467–476. https://doi.org/10.1121/1.3514383 
A17. Alle n, J. A. Murray, A. Noad, M. J. Dunlop, R. A. & Garland, E. C. (2017). Using 
self-organizing maps to classify humpback whale song units and quantify their 
similarity. The Journal of the Acoustical Society of America, 142(4), 1943–
1952. https://doi.org/10.1121/1.4982040 
A18. Tan,  L. N. Alwan, A. Kossan, G. Cody, M. L. & Taylor, C. E. (2015). Dynamic time 
warping and sparse representation classification for birdsong phrase 
classification using limited training data a ). 137(3). 
https://doi.org/10.1121/1.4906168 
A19. Ou,  H. Au, W. Zurk, L. & Lammers, M. (2013). Automated extraction and 
classification of time-frequency contours in humpback vocalizations. 
133(January). 
A20. LeB ien, J. G. & Ioup, J. W. (2018). Species-level classification of beaked whale 
echolocation signals detected in the northern Gulf of Mexico. The Journal of the 
Acoustical Society of America, 144(1), 387–396. https://doi.org/10.1121/1.5047435 
A21. Gire t, N. Roy, P. Albert, A. Pachet, F. Kreutzer, M. & Bovet, D. (2011). Finding 
good acoustic features for parrot vocalizations: The feature generation approach. The 
Journal of the Acoustical Society of America, 129(2), 1089–1099. 
A22. Peso  Parada, P. & Cardenal-López, A. (2014). Using Gaussian mixture models to 
detect and classify dolphin whistles and pulses. The Journal of the Acoustical 
Society of America, 135(6), 3371–3380. https://doi.org/10.1121/1.4876439 
A23. Ging ras, B. & Fitch, W. T. (2013). A three-parameter model for classifying anurans 
into four genera based on advertisement calls. 133(October 2012), 547–559. 
A24. Auc outurier, J.-J. Nonaka, Y. Katahira, K. & Okanoya, K. (2011). Segmentation of 
expiratory and inspiratory sounds in baby cry audio recordings using hidden 
Markov models. The Journal of the Acoustical Society of America, 130(5), 
2969–2977. https://doi.org/10.1121/1.3641377 
A25. Bish op, J. C. Falzon, G. Trotter, M. Kwan, P. & Meek, P. D. (2019). Livestock 
vocalisation classification in farm soundscapes. Computers and Electronics in 
Agriculture, 162(April), 531–542. 
https://doi.org/10.1016/j.compag.2019.04.020 
A26. Aziz , S. Awais, M. Akram, T. Khan, U. Alhussein, M. & Aurangzeb, K. (2019). 
Automatic scene recognition through acoustic classification for behavioral 
robotics. Electronics (Switzerland), 8(5). 
https://doi.org/10.3390/electronics8050483 
A27. Che n, H. Yuan, X. Pei, Z. Li, M. & Li, J. (2019). Triple-Classification of Respiratory 
Sounds Using Optimized S-Transform and Deep Residual Networks. IEEE 
Access, 7(April), 32845–32852. 
https://doi.org/10.1109/ACCESS.2019.2903859 
A28. Bou rouhou, A. Jilbab, A. Nacir, C. & Hammouch, A. (2019). Heart sounds 
classification for a medical diagnostic assistance. International Journal of 
Online and Biomedical Engineering, 15(11), 88–103. 
https://doi.org/10.3991/ijoe.v15i11.10804 
A29. Yase en, Son, G. Y. & Kwon, S. (2018). Classification of heart sound signal using 
multiple features. Applied Sciences (Switzerland), 8(12). 
https://doi.org/10.3390/app8122344 
 127  
University of Ghana http://ugspace.ug.edu.gh
A30. Pand eya, Y. R. Kim, D. & Lee, J. (2018). Domestic cat sound classification using 
learned features from deep neural nets. Applied Sciences (Switzerland), 8(10), 1–17. 
https://doi.org/10.3390/app8101949 
A31. Luq ue, A. Romero-Lemos, J. Carrasco, A. & Barbancho, J. (2018). Non-sequential 
automatic classification of anuran sounds for the estimation of climate-change 
indicators. Expert Systems with Applications, 95, 248–260. 
https://doi.org/10.1016/j.eswa.2017.11.016 
A32. Kim , Y. Sa, J. Chung, Y. Park, D. & Lee, S. (2018). Resource-efficient pet dog sound 
events classification using LSTM-FCN based on time-series data. Sensors 
(Switzerland), 18(11). https://doi.org/10.3390/s18114019 
A33. Luq ue, A. Romero-Lemos, J. Carrasco, A. & Gonzalez-Abril, L. (2018). Temporally-
aware algorithms for the classification of anuran sounds. PeerJ, 2018(5), 1–40. 
https://doi.org/10.7717/peerj.4732 
A34. https ://doi.org/10.1121/1.3641377 
Aykanat, M. Kılıç, Ö. Kurt, B. & Saryal, S. (2017). Classification of lung sounds 
using convolutional neural networks. Eurasip Journal on Image and Video 
Processing, 2017(1). https://doi.org/10.1186/s13640-017-0213-2 
A35. Zhan g, Yan, Lv, D. & Zhao, Y. (2016). Multiple-view active learning for 
environmental sound classification. International Journal of Online 
Engineering, 12(12), 49–54. https://doi.org/10.3991/ijoe.v12i12.6458 
A36. Han , W. Coutinho, E. Ruan, H. Li, H. Schuller, B. Yu, X. & Zhu, X. (2016). Semi-
supervised active learning for sound classification in hybrid learning environments. 
PLoS ONE, 11(9), 1–19. https://doi.org/10.1371/journal.pone.0162075 
A37. Nod a, J. J. Travieso, C. M. & Sánchez-Rodríguez, D. (2016). Automatic taxonomic 
classification of fish based on their acoustic signals. Applied Sciences 
(Switzerland), 6(12). https://doi.org/10.3390/app6120443 
A38. Raza , A. Mehmood, A. Ullah, S. Ahmad, M. Choi, G. S. & On, B. W. (2019). 
Heartbeat sound signal classification using deep learning. Sensors 
(Switzerland), 19(21), 1–15. https://doi.org/10.3390/s19214819 
A39. Su,  Y. Zhang, K. Wang, J. & Madani, K. (2019). Environment sound classification 
using a two-stream CNN based on decision-level fusion. Sensors 
(Switzerland), 19(7), 1–15. https://doi.org/10.3390/s19071733 
A40. Kha mparia, A. Gupta, D. Nguyen, N. G. Khanna, A. Pandey, B. & Tiwari, P. (2019). 
Sound classification using convolutional neural network and tensor deep 
stacking network. IEEE Access, 7(January), 7717–7727. 
https://doi.org/10.1109/ACCESS.2018.2888882 
A41. Bold , N. Zhang, C. & Akashi, T. (2019). Cross-domain deep feature combination for 
bird species classification with audio-visual data. IEICE Transactions on 
Information and Systems, E102D(10), 2033–2042. 
https://doi.org/10.1587/transinf.2018EDP7383 
A42. Verm a, D. Jana, A. & Ramamritham, K. (2019). Classification and mapping of sound 
sources in local urban streets through AudioSet data and Bayesian optimized 
Neural Networks. Noise Mapping, 6(1), 52–71. https://doi.org/10.1515/noise-
2019-0005 
A43. Wu,  J. Chua, Y. Zhang, M. Li, H. & Tan, K. C. (2018). A spiking neural network 
framework for robust sound classification. Frontiers in Neuroscience, 
12(NOV), 1–17. https://doi.org/10.3389/fnins.2018.00836 
A44. Pand eya, Y. R. & Lee, J. (2018). Domestic cat sound classification using transfer 
learning. International Journal of Fuzzy Logic and Intelligent Systems, 18(2), 
154–160. https://doi.org/10.5391/IJFIS.2018.18.2.154 
A45. Vrba ncic, G. & Podgorelec, V. (2018). Automatic classification of motor impairment 
neural disorders from EEG signals using deep convolutional neural networks. 
Elektronika Ir Elektrotechnika, 24(4), 1–7. 
https://doi.org/10.5755/j01.eie.24.4.21469 
A46. Sala mon, J. & Bello, J. P. (2017). Deep Convolutional Neural Networks and Data 
Augmentation for Environmental Sound Classification. IEEE Signal 
Processing Letters, 24(3), 279–283. 
https://doi.org/10.1109/LSP.2017.2657381 
A47. Owe is, R. J. Abdulhay, E. W. Khayal, A. & Awad, A. (2015). An alternative 
respiratory sounds classification system utilizing artificial neural networks. 
 128  
University of Ghana http://ugspace.ug.edu.gh
Biomedical Journal, 38(2), 153–161. https://doi.org/10.4103/2319-
4170.137773 
A48. Fang , S. H. Wang, C. Te, Chen, J. Y. Tsao, Y. & Lin, F. C. (2019). Combining 
acoustic signals and medical records to improve pathological voice 
classification. APSIPA Transactions on Signal and Information Processing, 
8(2019), 1–11. https://doi.org/10.1017/ATSIP.2019.7 
 
  
 129  
University of Ghana http://ugspace.ug.edu.gh
APPENDIX B: PYTHON CODES FOR LOADING THE DATA 
import os 
from tqdm import tqdm 
import pandas as pd 
import numpy as np 
from scipy.io import wavfile 
from python_speech_features import mfcc, logfbank  
from matplotlib import pyplot as plt  
from path import Path  
import librosa 
import librosa.display 
#waveforms 
def plot_signals(signals):  
    f ig, axes = plt.subplots(nrows=2, ncols=5, sharex=False, sharey=True, f igsize=(20,5))  
    f ig.suptitle('Time Series', size=16)  
    i = 0 
    try: 
        for x in range(2): 
          for y in range(5): 
            axes[x,y].set_title(list(signals.keys())[i])  
            axes[x,y].plot(list(signals.values())[i])  
            axes[x,y].get_xaxis().set_visible(False)  
            axes[x,y].get_yaxis().set_visible(False)  
            i += 1 
    except IndexError: 
        pass 
#fft         
def plot_fft(fft):  
  f ig, axes = plt.subplots(nrows=2, ncols=5, sharex=False, sharey=True, f igsize=(20,5))  
  f ig.suptitle('Fourier Transforms', size=16)  
  i = 0 
  try: 
      for x in range(2): 
        for y in range(5): 
          data = list(fft.values())[i]  
          Y, freq = data[0], data[1]  
          axes[x,y].set_title(list(fft.keys())[i])  
          axes[x,y].plot(freq, Y) 
          axes[x,y].get_xaxis().set_visible(False)  
          axes[x,y].get_yaxis().set_visible(False)  
          i += 1 
  except IndexError: 
        pass 
#fbc       
def plot_fbank(fbank):  
  f ig, axes = plt.subplots(nrows=2, ncols=5, sharex=False, sharey=True, f igsize=(20,5))  
  f ig.suptitle('Filter Bank Coefficients', size=16) 
  i = 0 
  try: 
      for x in range(2): 
        for y in range(5): 
            axes[x,y].set_title(list(fbank.keys())[i])  
            axes[x,y].imshow(list(fbank.values())[i], cmap='hot', interpolation='nearest')  
            axes[x,y].get_xaxis().set_visible(False)  
            axes[x,y].get_yaxis().set_visible(False)  
            i += 1 
  except IndexError: 
 130  
University of Ghana http://ugspace.ug.edu.gh
      pass 
#mfcc         
def plot_mfccs(mfcc):  
  f ig, axes = plt.subplots(nrows=2, ncols=5, sharex=False, sharey=True, f igsize=(20,5))  
  f ig.suptitle('Mel Frequency Cepstrum Coefficients', size=16)  
  i = 0 
  try: 
      for x in range(2): 
        for y in range(5): 
            axes[x,y].set_title(list(mfcc.keys())[i])  
            axes[x,y].imshow(list(mfcc.values())[i], cmap='hot', interpolation='nearest')  
            axes[x,y].get_xaxis().set_visible(False)  
            axes[x,y].get_yaxis().set_visible(False)  
            i += 1 
  except IndexError: 
      pass 
def calc_fft(y, rate):  
    n = len(y)  
    freq = np.fft.rfftfreq(n, d=1/rate)  
    Y = abs(np.fft.rfft(y)/n)  
    return (Y, freq)  
#define envelope 
def envelope(y, rate, threshold):  
    mask = [] 
    y = pd.Series(y).apply(np.abs)  
    y_mean = y.rolling(window=int(rate/10), min_periods=1, center=True).mean()  
    for mean in y_mean: 
        if  mean > threshold: 
           mask.append(True) 
        else: 
           mask.append(False) 
    return mask 
#load data  
sounds = pd.read_csv('sounds.csv')      
sounds.set_index('filename', inplace=True)  
for f  in sounds.index: 
    rate, signal = wavfile.read('dataset/'+f)  
    sounds.at[f , 'length'] = signal.shape[0]/rate  
classes = list(np.unique(sounds.label))  
class_dist = sounds.groupby(['labe l'])['length'].mean()  
f ig, ax = plt.subplots()  
ax.set_title('Class Distribution', y=1.10)  
ax.pie(class_dist, labels=class_dist.index, autopct='%1.1f%%', shadow=False, 
startangle=90) 
ax.axis('equal')  
plt.show()  
sounds.reset_index(inplace=True) 
signals ={} 
fft = {} 
fbank = {} 
mfccs = {} 
for c in classes: 
    wav_file = sounds[sounds.label == c].iloc[0,0]  
    signal, rate = librosa.load('dataset/'+wav_file, sr=44100)  
    mask = envelope(signal, rate, 0.0005) 
    signal = signal[mask] 
    signals[c] = signal 
    fft[c] =  calc_fft(signal, rate)  
    bank = logfbank(signal[:rate], rate, nfilt=26, nfft=1103).T  
    fbank[c] = bank 
 131  
University of Ghana http://ugspace.ug.edu.gh
    mel = mfcc(signal[:rate], rate, numcep=13, nfilt=26, nfft=1103).T  
    mfccs[c] = mel 
plot_signals(signals)  
plt.show()  
plot_fft(fft) 
plt.show()  
plot_fbank(fbank)  
plt.show()  
plot_mfccs(mfccs) 
plt.show()  
#audio downsampling  
if  len(os.listdir('clean')) == 0:  
    for f  in tqdm(sounds.filename): 
        signal,rate = librosa.load('dataset/'+f, sr=16000)  
        mask = envelope(signal, rate, 0.0005) 
        wavfile.write(filename='clean/'+f, rate=rate, data=signal[mask])  
APPENDIX C: PYTHON CODES FOR MODEL PREPARATION/PREDICTION 
from tqdm import tqdm 
import pandas as pd 
import numpy as np 
from scipy.io import wavfile 
from python_speech_features import mfcc 
from matplotlib import pyplot as plt 
from keras.models import load_model 
from keras.utils import to_categorical 
from keras.utils.vis_utils import plot_model 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, 
roc_auc_score  
from sklearn.metrics import roc_curve 
#defining a function to plot the ROC curves 
def plot_roc_curve(fpr, tpr): 
    plt.plot(fpr, tpr, color='blue', label='ROC') 
    plt.plot([0,1], [0,1], color='orange', linestyle='--') 
    plt.xlabel('False Positive Rate') 
    plt.ylabel('True Positive Rate') 
    plt.title('Receiver Operating Characteristics (ROC) Curve') 
    plt.legend() 
    plt.show() 
#Add comment 
def build_rand_feat(): 
    X = [] 
    y = [] 
    _min, _max = float('inf'), -float('inf') 
    for _ in tqdm(range(n_samples)): 
        rand_class = np.random.choice(class_dist.index, p=prob_dist) 
        file = np.random.choice(sounds[sounds.label==rand_class].index) 
        rate, wav =  wavfile.read('clean/'+file) 
        label = sounds.at[file, 'label'] 
        rand_index = np.random.randint(0, wav.shape[0]-config.step) 
        sample = wav[rand_index:rand_index+config.step] 
        X_sample = mfcc(sample, rate, 
                        numcep=config.nfeat, nfilt=config.nfilt, nfft=config.nfft).T 
        _min = min(np.amin(X_sample), _min) 
        _max = max(np.amax(X_sample), _max) 
        X.append(X_sample if config.mode == 'conv' else X_sample.T) 
 132  
University of Ghana http://ugspace.ug.edu.gh
        y.append(classes.index(label)) 
    X, y = np.array(X), np.array(y) 
    X = (X - _min) / (_max - _min) 
    if config.mode == 'conv': 
        X = X.reshape(X.shape[0], X.shape[1], X.shape[2], 1) 
    elif config.mode == 'time': 
        X = X.reshape(X.shape[0], X.shape[1], X.shape[2]) 
    y = to_categorical(y, num_classes=5) 
    return X, y 
class Config: 
    def __init__(self, mode='conv', nfilt=26, nfeat=13, nfft=512, rate=16000): 
        self.mode = mode 
        self.nfilt = nfilt 
        self.nfeat = nfeat 
        self.nfft = nfft 
        self.rate = rate 
        self.step = int(rate/10) 
#load data 
sounds = pd.read_csv('sounds.csv')      
sounds.set_index('filename', inplace=True) 
for f in sounds.index: 
    rate, signal = wavfile.read('clean/'+f) 
    sounds.at[f, 'length'] = signal.shape[0]/rate 
classes = list(np.unique(sounds.label)) 
class_dist = sounds.groupby(['label'])['length'].mean() 
#creating a class balance by extracting 100ms(0.1s) from each audio recording 
n_samples = 2 * int(sounds['length'].sum()/0.1) 
prob_dist = class_dist / class_dist.sum() 
choices = np.random.choice(class_dist.index, p=prob_dist) 
fig, ax = plt.subplots() 
ax.set_title('Class Distribution', y=1.10) 
ax.pie(class_dist, labels=class_dist.index, autopct='%1.1f%%', shadow=False, startangle=90) 
ax.axis('equal') 
plt.show() 
config = Config(mode='conv') 
if config.mode == 'conv': 
    X, y = build_rand_feat() 
    y_flat = np.argmax(y, axis=1) 
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 0) 
    input_shape = (X.shape[1], X.shape[2], 1) 
elif config.mode == 'time': 
    X, y = build_rand_feat() 
    y_flat = np.argmax(y, axis=1) 
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 0) 
    input_shape = (X.shape[1], X.shape[2]) 
if config.mode == 'conv': 
    model = load_model('conv_model.h5') 
    model.summary() 
    plot_model(model, to_file='conv_model.png', show_shapes=True, show_layer_names=True) 
elif config.mode == 'time': 
    model = load_model('rnn_model.h5') 
    model.summary() 
    plot_model(model, to_file='RNN_model.png', show_shapes=True, show_layer_names=True) 
#plot the roc curve for the model     
y_pred = model.predict(X_val) 
print(roc_auc_score(y_val, y_pred, average='micro')) 
auc = roc_auc_score(y_val, y_pred, average='micro') 
fpr, tpr, thresholds = roc_curve(y_pred, y_val) 
plot_roc_curve(fpr, tpr) 
y_pred = y_pred.argmax(axis=1) 
 133  
University of Ghana http://ugspace.ug.edu.gh
y_val = y_val.argmax(axis=1) 
#Evaluation Measures 
print(accuracy_score(y_val, y_pred)) 
print(recall_score(y_val, y_pred, average='micro')) 
print(precision_score(y_val, y_pred, average='micro')) 
f1_score = f1_score(y_val, y_pred, average='micro') 
print(f1_score) 
confusion_matrix = confusion_matrix(y_val, y_pred) 
#print(f1_score) 
print(confusion_matrix) 
APPENDIX D: PYTHON CODES FOR 10-FOLD MODEL VALIDATION 
# KFold Cross Validation approach 
kf = KFold(n_splits=10,shuffle=True) 
kf.split(X)     
# Initialize the accuracy of the models to blank list. The accuracy of each model will be appended to 
this list 
accuracy_model = [] 
# Iterate over each train-test split 
for train_index, test_index in kf.split(X): 
    # Split train-test 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index] 
    # Train the model 
    model.fit(X_train, y_train, epochs=10, batch_size=32, 
                      shuffle=True, 
                      class_weight=class_weight) 
    # Append to accuracy_model the accuracy of the model 
    accuracy_model.append(accuracy_score(y_test.argmax(axis=1), 
model.predict(X_test).argmax(axis=1), normalize=True)*100) 
if config.mode == 'conv': 
    model.save('conv_model.h5')     
elif config.mode == 'time': 
    model.save('rnn_model.h5') 
y_pred = model.predict(X_val) 
y_pred = y_pred.argmax(axis=1) 
y_val = y_val.argmax(axis=1) 
#Evaluation Measures 
print(accuracy_score(y_val, y_pred)) 
print(recall_score(y_val, y_pred, average='micro')) 
print(precision_score(y_val, y_pred, average='micro')) 
#f1_score = f1_score(y_test, y_pred, average='micro') 
confusion_matrixs = confusion_matrix(y_val, y_pred) 
#print(f1_score) 
print(confusion_matrixs) 
APPENDIX E: PYTHON CODES FOR AUC-ROC 
# Compute ROC curve and ROC area for each class 
fpr = dict() 
tpr = dict() 
roc_auc = dict() 
n_classes = len(classes) 
for i in range(n_classes): 
    fpr[i], tpr[i], _ = roc_curve(y_val[:, i], y_prob[:, i]) 
 134  
University of Ghana http://ugspace.ug.edu.gh
    roc_auc[i] = auc(fpr[i], tpr[i]) 
# Compute micro-average ROC curve and ROC area 
fpr["micro"], tpr["micro"], _ = roc_curve(y_val.ravel(), y_prob.ravel()) 
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) 
# First aggregate all false positive rates 
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)])) 
# Then interpolate all ROC curves at this points 
mean_tpr = np.zeros_like(all_fpr) 
for i in range(n_classes): 
    mean_tpr += interp(all_fpr, fpr[i], tpr[i]) 
# Finally average it and compute AUC 
mean_tpr /= n_classes 
fpr["macro"] = all_fpr 
tpr["macro"] = mean_tpr 
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"]) 
# Plot all ROC curves 
plt.figure() 
lw=2 
plt.plot(fpr["micro"], tpr["micro"], 
         label='micro-average ROC curve (area = {0:0.2f})' 
               ''.format(roc_auc["micro"]), 
         color='deeppink', linestyle=':', linewidth=4) 
plt.plot(fpr["macro"], tpr["macro"], 
         label='macro-average ROC curve (area = {0:0.2f})' 
               ''.format(roc_auc["macro"]), 
         color='navy', linestyle=':', linewidth=4) 
colors = cycle(['aqua', 'darkorange', 'cornflowerblue']) 
for i, color in zip(range(n_classes), colors): 
    plt.plot(fpr[i], tpr[i], color=color, lw=lw, 
             label='ROC curve of class {0} (area = {1:0.2f})' 
             ''.format(i, roc_auc[i])) 
plt.plot([0, 1], [0, 1], 'k--', lw=lw) 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.05]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 
plt.title('AUC-ROC') 
plt.legend(loc="lower right") 
plt.show() 
print(roc_auc_score(y_val, y_pred, average='micro')) 
 
 
 135