Expert Systems With Applications 258 (2024 ) 125133 A 0 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Enhancing corporate bankruptcy prediction via a hybrid genetic algorithm and domain adaptation learning architecture T. Ansah-Narh a,b,c,∗, E.N.N. Nortey b, E. Proven-Adzri a, R. Opoku-Sarkodie d a Ghana Space Science and Technology Institute, Ghana Atomic Energy Commission, P. O. Box LG 80, Legon-Accra, Ghana b Department of Statistics and Actuarial Science, University of Ghana, P. O. Box LG 115, Legon-Accra, Ghana c School of Technology, Ghana Institute of Management and Public Administration, P. O. Box AH 50, Achimota-Accra, Ghana d Department of Information Technology and Mathematical Sciences, Methodist University Ghana, P. O. Box DC 940, Dansoman-Accra, Ghana A R T I C L E I N F O Dataset link: Taiwanese Bankruptcy Prediction dataset, Polish Companies Bankruptcy data Keywords: Bankruptcy prediction Financial ratios Genetic algorithm Domain adaptation learning Data distribution shifts Bayesian optimisation A B S T R A C T In the contemporary business landscape, accurately evaluating a company’s financial health is essential for stakeholders to mitigate risks and avert bankruptcy. This study presents an innovative approach to improving business bankruptcy prediction through the hybrid integration of Domain Adaptation Learning (DAL) and Genetic Algorithm (GA) techniques. The hybrid model harnesses DAL to address distributional changes in real- world scenarios and utilises GA’s proficiency in feature selection. Six machine learning models are rigorously evaluated against the proposed hybrid model: Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Gradient Boosting (GB), k-Nearest Neighbours (k-NN), and Stacking Ensemble (SE). Our hybrid model performs well on imbalanced target datasets using the Area Under the Precision–Recall Curve metric: 0.93 (RF), 0.93 (SVM), 0.89 (LR), 0.91 (GB), 0.88 (k-NN), and 0.92 (SE). These findings highlight the model’s ability to overcome the limitations of traditional approaches, offering a more reliable predictive framework for stakeholders to make informed decisions and proactively manage financial stability. Future research directions may explore the applicability of this hybrid model across different industries and the integration of additional techniques to further enhance its performance. 1. Introduction Examining a company’s financial performance is an important task, as it plays a pivotal role in determining its strengths and weaknesses. A company’s daily transaction records serve as a valuable source of infor- mation for decision-making, especially when focusing on scenarios that lead to bankruptcy. When a company experiences financial distress, it undergoes a gradual evolution, initially with limited liquidity and eventually leading to bankruptcy (Fahlevi & Marlinah, 2018). In today’s business environment, there has been a marked increase in the number of companies facing financial failure and subsequent liquidation. A relevant example is the financial sector reforms that began in Ghana in 2017, which resulted in the central bank revoking the licenses of 23 universal banks and 388 microfinance and microcredit companies.1 Also, because privately held enterprises frequently lack the trustworthy and open financial statements of publicly audited organisations, recent research by da Silva Mattos and Shasha (2024) has shown how difficult it is to predict insolvency for these types of businesses. Managing these less trustworthy reports presents special difficulties for stakeholders as ∗ Corresponding author at: Ghana Space Science and Technology Institute, Ghana Atomic Energy Commission, P. O. Box LG 80, Legon-Accra, Ghana. E-mail addresses: theophilus.ansah-narh@gaec.gov.gh (T. Ansah-Narh), ennnortey@ug.edu.gh (E.N.N. Nortey), emmanuel.proven-adzri@gaec.gov.gh (E. Proven-Adzri), rsarkodie@mucg.edu.gh (R. Opoku-Sarkodie). 1 https://www.bog.gov.gh/wp-content/uploads/2019/08/Revocation-of-Licenses-of-SDIs-16.8.19.pdf. a result. These case studies emphasise how vital it is to carry out in- depth research in order to offer information that will help pertinent stakeholders prevent business defaults. In fact, the Ghana case serves as a compelling illustration of the challenges faced by companies in dynamic economic environments, highlighting the critical need for predictive models that can adapt to evolving industry landscapes. While the data used in this study originates from the Taiwan bankruptcy pre- diction dataset, it also incorporates the Polish Companies Bankruptcy data and considers the Ghanaian context, highlighting the global issue of corporate bankruptcy and the need for adaptable predictive models across diverse financial environments. To address this urgent research need, it is important to advance our understanding of the complex dynamics that lead to corporate bankruptcy. This study aims to focus on the following primary research question: How can the accuracy and adaptability of bankruptcy prediction models be enhanced to effectively handle distributional changes in real-world scenarios? The recent finan- cial crisis highlights the importance of continually re-evaluating and refining methods to improve predictive power. Given the complexity https://doi.org/10.1016/j.eswa.2024.125133 Received 6 March 2024; Received in revised form 31 July 2024; Accepted 15 Augu vailable online 20 August 2024 957-4174/© 2024 Elsevier Ltd. All rights are reserved , including those for text and st 2024 data mining , AI training , and similar technologies. https://www.elsevier.com/locate/eswa https://www.elsevier.com/locate/eswa https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data mailto:theophilus.ansah-narh@gaec.gov.gh mailto:ennnortey@ug.edu.gh mailto:emmanuel.proven-adzri@gaec.gov.gh mailto:rsarkodie@mucg.edu.gh https://www.bog.gov.gh/wp-content/uploads/2019/08/Revocation-of-Licenses-of-SDIs-16.8.19.pdf https://doi.org/10.1016/j.eswa.2024.125133 https://doi.org/10.1016/j.eswa.2024.125133 T. Ansah-Narh et al. o Expert Systems With Applications 258 (2024 ) 125133 of the global business environment, a comprehensive investigation of the factors that influence a company’s financial position is necessary. Additionally, the landscape of financial analysis is constantly changing due to technological advances, so it is essential to utilise innovative methodologies. Previous research has mostly concentrated on using structural and statistical approaches to predict insolvency. The latter, which is the subject of this study, uses traditional machine learning models such as k-nearest neighbours (Chen et al., 2011; Li & Wang, 2017), discrimi- nant analysis (Altman, 1968; Kliestik, Vrbka, & Rowland, 2018), logit models (Chi & Tang, 2006; Li, Lee, Zhou, & Sun, 2011), artificial neural networks (ANN) (Odom & Sharda, 1990; Zhang, Hu, Patuwo, & Indro, 1999), and decision trees (Olson, Delen, & Meng, 2012; Syed Nor, Is- mail, & Yap, 2019). For instance, the study by Min, Lee, and Han (2006) discusses how bankruptcy prediction affects bank lending decisions and profitability. In contrast to logistic regression (LR) and neural networks, it emphasises the recent application of support vector machines (SVM) in this field and shows its promising outcomes. The article highlights the growing application of genetic algorithms (GA) in conjunction with other AI methods such as neural networks and case-based reasoning. It does, however, highlight the paucity of research on the combination of GA and SVM, in spite of its promise for useful applications. In order to improve bankruptcy prediction, the study uses GA to simul- taneously optimise two factors–feature subset selection and parameter optimisation, in order to increase SVM performance. These models seek to pinpoint the relevant financial factors that directly impact bankruptcy prediction. The former strategy, on the other hand, entails complex accounting ratio forecasts and a thorough comprehension of the economic subtleties of the organisation being studied. While the concept of ANN dates back approximately eight decades to McCulloch and Pitts (1943)’s threshold logic-based model designed to emulate the human brain, the modern environment is characterised by the incorporation of high-performance computing systems that are driving AI into the mainstream. The computational design of ANN is based on interconnected neurons, where each connection facilitates the transmission of signals from one neuron to another. The receiving neuron processes the signal and subsequently transmits the processed information to other interconnected neurons. The organisation of these neurons typically involves layers, with the first layer serving as the input and the last layer as the output. Sandwiched between these are hidden layers, which can be shallow or deep in design. The versatility of ANN architecture allows it to handle data ranging from single to multiple dimensions, making it applicable across a broad spectrum of cases. The efficiency of AI applications, particularly in bankruptcy prediction, has substantially increased due to the growth of large and diverse datasets, both organised and unstructured. Investigations by Awoyemi, Adetunmbi, and Oluwadare (2017), Kristóf and Virág (2020), Sharma, Banerjee, Tiwari, and Patni (2021), Tripathi, Edla, Cheruku, and Kuppili (2019) into bankruptcy prediction demonstrate this efficiency. In this domain, the term ‘‘prediction’’ is often used interchangeably with ‘‘classification’’ because the ultimate goal is to determine whether a company will likely face financial distress or bankruptcy. Table 1 provides a comparative analysis of various stud- ies in bankruptcy prediction, highlighting the diversity in methods, datasets, and results. However, the AI approach to bankruptcy prediction described above relies on assumptions inherent in classical machine learning. One key assumption is that the training and test sets come from the same distribution, meaning a model trained on labelled data is expected to perform effectively on test data. This assumption may not always hold in real-world applications where training and test data can come from different distributions. Discrepancies can arise from various factors, such as differences in the origins of the training and test sets or an outdated training set due to changes in data patterns over time. In instances where there is a disparity across domain distributions, blindly applying the trained model to a new dataset can lead to a e 2 decline in performance. Addressing this challenge falls within the realm of domain adaptation, a subfield within machine learning (Farahani, Voghoei, Rasheed, & Arabnia, 2021; Guan & Liu, 2021; Jiang & Zhai, 2007). The primary objective of domain adaptation is to mitigate issues arising from differing distributions by aligning them, thus enabling the trained model to generalise effectively within the domain of inter- est. This alignment process is crucial for ensuring the robustness and applicability of the predictive model in real-world scenarios. In light of the complex issues mentioned, this study attempts to investigate a novel approach to reduce the effects of distributional changes. The principal aim is to facilitate the creation of bankruptcy prediction models with increased flexibility and robustness under dy- namic and changing conditions. As a result, this work presents a hybrid model that combines the strengths of Genetic Algorithm (GA) and Domain Adaptation Learning (DAL). This strategy aims to create a more reliable and adaptable predictive model for bankruptcy analysis by combining the best features of both approaches. The GA employed utilises a heuristic search-based scheme to extract relevant financial features from the original dataset and feed them into the proposed DAL pipeline for bankruptcy prediction. Keep in mind that the com- putational method used by the optimisation solver in GA is based on biological evolution and follows the guidelines of the natural evolution process (Ghanea-Hercock, 2003). This methodology finds applications across diverse fields, showcasing its versatility. Notably, it has been extensively used in pattern recognition (Alsultanny & Aqel, 2003; Kim, Park, Yang, & Sim, 2006; Maulik & Bandyopadhyay, 2000; Pal & Wang, 1996), route optimisation (Inagaki, Haseyama, & Kitajima, 1999), net- work intrusion detection (Li, 2004), and image processing (Bhanu, Lee, & Ming, 1995; Saitoh, 1999). On the other hand, DAL is a machine learning paradigm that aims to address the challenges that arise when a model trained in one domain (source domain2) is deployed to another related domain.(target domain3) The core principle of domain adapta- tion pipelines is to leverage knowledge gained from the source domain to improve the generalisation and performance of the model in the tar- get domain. This becomes very important when there are distributional changes between training data and deployment data. Therefore, for the purpose of this work, we seek to perform the following tasks: i. Create a spatial distribution model to visually represent corre- lations between financial variables and gain a comprehensive understanding of hidden patterns within the original dataset. ii. Mitigate biases in the source domain dataset through the use of simulation techniques, ensuring the robustness and reliability of the analysis. iii. Systematically identify and select financial features considered crucial for predicting corporate bankruptcy within the source domain, thereby improving the precision of the subsequent mod- elling process. iv. Utilise the financial features selected in Task (iii) within the source domain to apply domain adaptation techniques, ensuring the model’s robustness in the face of variations in data dis- tributions. This crucial step enhances the model’s applicability to real-world scenarios beyond the training data by addressing potential disparities in data distribution. We put together the following sections of the paper: Section 2 dis- cusses the data acquisition process, emphasising the dataset’s relevance, coverage of financial information, and the rigorous data-gathering procedures followed to ensure accuracy and applicability. Section 3 focuses on the classification measures used in the study to evaluate the performance of the proposed hybrid model for bankruptcy prediction. 2 The source domain refers to the domain from which the training data is btained to build a machine learning model. 3 The target domain is the domain where the model’s performance is valuated. The target data is mostly imbalanced. T. Ansah-Narh et al. Expert Systems With Applications 258 (2024 ) 125133 Table 1 Summary of various bankruptcy prediction models and their respective performance metrics. Author Dataset Methoda ACC (%) Obs. period Attributes Pred. type Limitation Almaskati, Bird, Yeung, and Lu (2021) S & P firms • ALTb • OHLc • ZMJd • SHWe • BSHf • CHSg • PRMh • 0.82 • 0.73 • 0.82 • 0.79 • 0.73 • 0.80 • 0.81 2005–2015 19 Bankruptcy • Impact of specific governance variables • Comparison of different non-parametric methods • Temporal changes in governance impact Liang, Tsai, Dai, and Eberle (2018) • Taiwanese • Chinese • Australian • German • SVMi • KNNj • MLPk • CARTl • Bayesm • 77.60–91.27 • 69.40–90.39 • 71.40–89.38 • 73.20–93.00 • 70.70–88.82 2005–2015 • 95 • 45 • 14 • 24 • Bankruptcy • Bankruptcy • Credit • Credit • Exploration of new classifier ensembles • Type I error reduction • Dataset Diversity Barboza, Kimura, and Altman (2017) North American firms • MDAn, LRo, ANNp • Baggingq, Boostingr, RFs, SVMi • 52–77 • 71–87 1985–2013 11 Bankruptcy • Limited analysis of new financial indicators • Longitudinal changes in model performance • Study focused on North American firms Heo and Yang (2014) Korean construction companies AdaBoostt 78.5 2008–2012 12 Bankruptcy • Limited analysis of new financial indicators • Exploration of other algorithms • Study focused on Korean construction firms a Method: The approach or algorithm used for bankruptcy prediction. b ALT: Altman Z-score Model. c OHL: Ohlson O-score Model. d ZMJ: Zmijewski Model. e SHW: Shumway Model. f BSH: Bharath–Shumway Model. g CHS: Campbell, Hilscher, and Szilagyi Model. h PRM: Premachandra Model. i SVM: Support Vector Machines. j KNN: K-Nearest Neighbors. k MLP: Multi-Layer Perceptron. l CART: Classification and Regression Trees. m Bayes: Bayesian Classifiers. n MDA: Multiple Discriminant Analysis. o LR: Logistic Regression. p ANN: Artificial Neural Networks. q Bagging: Bootstrap Aggregating. r Boosting: Ensemble Technique for Improving Weak Models. s RF: Random Forest. t AdaBoost: Adaptive Boosting. It outlines the metrics and evaluation criteria employed to assess the model’s predictive accuracy, adaptability, and generalisation capabili- ties. The section provides insights into how the model’s performance is measured and analysed in the context of imbalanced target datasets. The Section 4 delves into the implications of the results, providing a detailed analysis of how the hybrid model addresses the challenges of traditional approaches in bankruptcy prediction. The final section (in Section 5) summarises the key findings and contributions of the study, emphasising the significance of the hybrid model in enhancing corpo- rate bankruptcy prediction. It discusses the implications of the research for financial risk management and decision-making, highlighting the potential impact of the hybrid model on stakeholders in the business landscape. The conclusion also outlines future research directions and areas for further exploration to enhance the performance and usefulness of the hybrid model in different industries and scenarios. 2. Data and methods 2.1. Data acquisition In the present study, we utilised the Taiwan bankruptcy prediction dataset from the University of California, Irvine machine learning 3 repository, originally compiled from the Taiwan Economic Journal and covering the period from 1999–2009. Additionally, the study in- corporated the Polish Companies Bankruptcy data to evaluate the effectiveness of the proposed hybrid model. The selection of these datasets was based on their comprehensive coverage of financial in- formation and provided a solid foundation for training and evaluation of the hybrid model. For instance, in the case of the Taiwan data, a rigorous procedure was followed during the data-gathering phase to guarantee the accuracy and applicability of the dataset (Liang, Lu, Tsai, & Shih, 2016). Two fundamental standards were utilised in the process of collecting the data: i. The selected companies were required to disclose their finan- cial information for at least three years before the start of the financial crisis. This criterion ensured that the dataset contained sufficient temporal context so that the model could capture pre-crisis trends and patterns. ii. Another important criterion was to consider similar companies within the same industry. This step was important for a nuanced analysis of the financial picture, allowing the model to identify industry-specific dynamics. The goal was to improve the model’s ability to generalise insights across companies with comparable economic conditions. T. Ansah-Narh et al. a a d s a a t p o f Expert Systems With Applications 258 (2024 ) 125133 Table 2 Descriptive statistics of some financial ratios in the Taiwan bankruptcy data. Variables Count Mean Std Min 25% 50% 75% Max Roa(C) Before Interest And Depreciation Before ... 6819.00 0.51 0.06 0.00 0.48 0.50 0.54 1.00 Roa(A) Before Interest And % After Tax 6819.00 0.56 0.07 0.00 0.54 0.56 0.59 1.00 Roa(B) Before Interest And Depreciation After Tax 6819.00 0.55 0.06 0.00 0.53 0.55 0.58 1.00 Operating Gross Margin 6819.00 0.61 0.02 0.00 0.60 0.61 0.61 1.00 Realised Sales Gross Margin 6819.00 0.61 0.02 0.00 0.60 0.61 0.61 1.00 Operating Profit Rate 6819.00 1.00 0.01 0.00 1.00 1.00 1.00 1.00 Pre-Tax Net Interest Rate 6819.00 0.80 0.01 0.00 0.80 0.80 0.80 1.00 After-Tax Net Interest Rate 6819.00 0.81 0.01 0.00 0.81 0.81 0.81 1.00 Non-Industry Income And Expenditure/Revenue 6819.00 0.30 0.01 0.00 0.30 0.30 0.30 1.00 Continuous Interest Rate (After Tax) 6819.00 0.78 0.01 0.00 0.78 0.78 0.78 1.00 Operating Expense Rate 6819.00 1,995,347,312.80 3,237,683,890.52 0.00 0.00 0.00 4,145,000,000.00 9,990,000,000.00 Research And Development Expense Rate 6819.00 1,950,427,306.06 2,598,291,554.00 0.00 0.00 509,000,000.00 3,450,000,000.00 9,980,000,000.00 Cash Flow Rate 6819.00 0.47 0.02 0.00 0.46 0.47 0.47 1.00 Interest-Bearing Debt Interest Rate 6819.00 16,448,012.91 108,275,033.53 0.00 0.00 0.00 0.00 990,000,000.00 Tax Rate (A) 6819.00 0.12 0.14 0.00 0.00 0.07 0.21 1.00 Net Value Per Share (B) 6819.00 0.19 0.03 0.00 0.17 0.18 0.20 1.00 Net Value Per Share (A) 6819.00 0.19 0.03 0.00 0.17 0.18 0.20 1.00 Net Value Per Share (C) 6819.00 0.19 0.03 0.00 0.17 0.18 0.20 1.00 Persistent Eps In The Last Four Seasons 6819.00 0.23 0.03 0.00 0.21 0.22 0.24 1.00 Cash Flow Per Share 6819.00 0.32 0.02 0.00 0.32 0.32 0.33 1.00 Revenue Per Share (Yuan ¥) 6819.00 1,328,640.60 51,707,089.77 0.00 0.02 0.03 0.05 3,020,000,000.00 Operating Profit Per Share (Yuan ¥) 6819.00 0.11 0.03 0.00 0.10 0.10 0.12 1.00 Per Share Net Profit Before Tax (Yuan ¥) 6819.00 0.18 0.03 0.00 0.17 0.18 0.19 1.00 Realised Sales Gross Profit Growth Rate 6819.00 0.02 0.01 0.00 0.02 0.02 0.02 1.00 Operating Profit Growth Rate 6819.00 0.85 0.01 0.00 0.85 0.85 0.85 1.00 After-Tax Net Profit Growth Rate 6819.00 0.69 0.01 0.00 0.69 0.69 0.69 1.00 Regular Net Profit Growth Rate 6819.00 0.69 0.01 0.00 0.69 0.69 0.69 1.00 Continuous Net Profit Growth Rate 6819.00 0.22 0.01 0.00 0.22 0.22 0.22 1.00 Total Asset Growth Rate 6819.00 5,508,096,595.25 2,897,717,771.17 0.00 4,860,000,000.00 6,400,000,000.00 7,390,000,000.00 9,990,000,000.00 Net Value Growth Rate 6819.00 1,566,212.06 114,159,389.52 0.00 0.00 0.00 0.00 9,330,000,000.00 Total Asset Return Growth Rate Ratio 6819.00 0.26 0.01 0.00 0.26 0.26 0.26 1.00 Cash Reinvestment % 6819.00 0.38 0.02 0.00 0.37 0.38 0.39 1.00 Current Ratio 6819.00 403,284.95 33,302,155.83 0.00 0.01 0.01 0.02 2,750,000,000.00 Quick Ratio 6819.00 8376594.82 244,684,748.45 0.00 0.00 0.01 0.01 9,230,000,000.00 Interest Expense Ratio 6819.00 0.63 0.01 0.00 0.63 0.63 0.63 1.00 Total Debt/Total Net Worth 6819.00 4,416,336.71 168,406,905.28 0.00 0.00 0.01 0.01 9,940,000,000.00 Debt Ratio % 6819.00 0.11 0.05 0.00 0.07 0.11 0.15 1.00 Net Worth/Assets 6819.00 0.89 0.05 0.00 0.85 0.89 0.93 1.00 Long-Term Fund Suitability Ratio (A) 6819.00 0.01 0.03 0.00 0.01 0.01 0.01 1.00 Borrowing Dependency 6819.00 0.37 0.02 0.00 0.37 0.37 0.38 1.00 Contingent Liabilities/Net Worth 6819.00 0.01 0.01 0.00 0.01 0.01 0.01 1.00 Operating Profit/Paid-In Capital 6819.00 0.11 0.03 0.00 0.10 0.10 0.12 1.00 Net Profit Before Tax/Paid-In Capital 6819.00 0.18 0.03 0.00 0.17 0.18 0.19 1.00 Inventory And Accounts Receivable/Net Value 6819.00 0.40 0.01 0.00 0.40 0.40 0.40 1.00 Total Asset Turnover 6819.00 0.14 0.10 0.00 0.08 0.12 0.18 1.00 Accounts Receivable Turnover 6819.00 12,789,705.24 278,259,836.98 0.00 0.00 0.00 0.00 9,740,000,000.00 Average Collection Days 6819.00 9,826,220.86 256,358,895.71 0.00 0.00 0.01 0.01 9,730,000,000.00 Inventory Turnover Rate (Times) 6819.00 2,149,106,056.61 3,247,967,014.05 0.00 0.00 0.00 4,620,000,000.00 9,990,000,000.00 Fixed Assets Turnover Frequency 6819.00 1,008,595,981.82 2,477,557,316.92 0.00 0.00 0.00 0.00 9,990,000,000.00 Net Worth Turnover Rate (Times) 6819.00 0.04 0.04 0.00 0.02 0.03 0.04 1.00 Revenue Per Person 6819.00 2,325,854.27 136,632,654.39 0.00 0.01 0.02 0.04 8,810,000,000.00 Operating Profit Per Person 6819.00 0.40 0.03 0.00 0.39 0.40 0.40 1.00 Allocation Rate Per Person 6819.00 11,255,785.32 294,506,294.12 0.00 0.00 0.01 0.02 9,570,000,000.00 Working Capital To Total Assets 6819.00 0.81 0.06 0.00 0.77 0.81 0.85 1.00 Quick Assets/Total Assets 6819.00 0.40 0.20 0.00 0.24 0.39 0.54 1.00 Current Assets/Total Assets 6819.00 0.52 0.22 0.00 0.35 0.51 0.69 1.00 Cash/Total Assets 6819.00 0.12 0.14 0.00 0.03 0.07 0.16 1.00 Quick Assets/Current Liability 6819.00 3,592,902.20 171,620,908.61 0.00 0.01 0.01 0.01 8,820,000,000.00 Cash/Current Liability 6819.00 37,159,994.15 510,350,903.16 0.00 0.00 0.00 0.01 9,650,000,000.00 Current Liability To Assets 6819.00 0.09 0.05 0.00 0.05 0.08 0.12 1.00 Operating Funds To Liability 6819.00 0.35 0.04 0.00 0.34 0.35 0.36 1.00 Inventory/Working Capital 6819.00 0.28 0.01 0.00 0.28 0.28 0.28 1.00 Inventory/Current Liability 6819.00 55,806,804.53 582,051,554.62 0.00 0.00 0.01 0.01 9,910,000,000.00 Current Liabilities/Liability 6819.00 0.76 0.21 0.00 0.63 0.81 0.94 1.00 Working Capital/Equity 6819.00 0.74 0.01 0.00 0.73 0.74 0.74 1.00 Current Liabilities/Equity 6819.00 0.33 0.01 0.00 0.33 0.33 0.33 1.00 Long-Term Liability To Current Assets 6819.00 54,160,038.14 570,270,621.96 0.00 0.00 0.00 0.01 9,540,000,000.00 Retained Earnings To Total Assets 6819.00 0.93 0.03 0.00 0.93 0.94 0.94 1.00 Total Income/Total Expense 6819.00 0.00 0.01 0.00 0.00 0.00 0.00 1.00 Total Expense/Assets 6819.00 0.03 0.03 0.00 0.01 0.02 0.04 1.00 Current Asset Turnover Rate 6819.00 1,195,855,763.31 2,821,161,238.26 0.00 0.00 0.00 0.00 10,000,000,000.00 Quick Asset Turnover Rate 6819.00 2,163,735,272.03 3,374,944,402.17 0.00 0.00 0.00 4,900,000,000.00 10,000,000,000.00 Working Capital Turnover Rate 6819.00 0.59 0.01 0.00 0.59 0.59 0.59 1.00 Cash Turnover Rate 6819.00 2471976967.44 2,938,623,226.68 0.00 0.00 1,080,000,000.00 4,510,000,000.00 10,000,000,000.00 Cash Flow To Sales 6819.00 0.67 0.01 0.00 0.67 0.67 0.67 1.00 Fixed Assets To Assets 6819.00 1,220,120.50 100,754,158.71 0.00 0.09 0.20 0.37 8,320,000,000.00 Current Liability To Liability 6819.00 0.76 0.21 0.00 0.63 0.81 0.94 1.00 Current Liability To Equity 6819.00 0.33 0.01 0.00 0.33 0.33 0.33 1.00 Equity To Long-Term Liability 6819.00 0.12 0.02 0.00 0.11 0.11 0.12 1.00 t n ( a b t c l l I m S i The dataset covered a wide range of industries, such as the manufac- turing sector (which includes industrial and electronics enterprises), the service sector (which includes shipping, tourist, and retail compa- nies), and other non-financial industry entities. The dataset comprises a substantial sample of 6819 observations, each characterised by 96 ttributes. Within this dataset, 220 observations have been identified s instances of bankruptcy. In Table 2, we present comprehensive escriptive statistics for the chosen financial ratios in the dataset. The tatistical metrics employed encompass fundamental measures such s mean, standard deviation, minimum, maximum, and percentiles t 25%, 50%, and 75%. These metrics offer a general overview of he distribution and central tendencies of the selected financial ratios, roviding valuable insights into their variability and the overall profile f the Taiwan bankruptcy data under examination. Three important actors in the current investigation necessitate domain adaptation. t 4 First off, the dataset spans a sizable amount of time, from 1999 o 2009, and may include notable changes in industry dynamics, fi- ancial reporting standards, or economic situations. The training data before the financial crisis) and possible test data (perhaps obtained fter the financial crisis or in another economic environment) can ecome disconnected as a result of these temporal shifts, causing he model to perform poorly when applied to new data. In fact, the hallenge of dataset shifts affecting the performance of supervised earning predictors has necessitated the development of a framework ike DetectShift1 to quantify and address these shifts (Maia Polo, zbicki, Lacerda, Ibieta-Jimenez, & Vicente, 2023). There are three ain types of data shifts that can affect model performance: Covariate hift, where the input features’ distribution changes between the train- ng and testing datasets while the output variable’s distribution remains he same, potentially leading to biased predictions; Concept Shift, T. Ansah-Narh et al. 𝑘 s t 1 p w b s m ∑ a B i s c Expert Systems With Applications 258 (2024 ) 125133 where the relationship between input features and the output variable changes due to factors like economic conditions or industry trends; and Prior Probability Shift, where the distribution of the target variable changes, affecting the model’s predictive accuracy, particularly in cases of imbalanced data (Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2022). Second, the diversity of the dataset across different industries high- lights the necessity of understanding industry-specific dynamics. Each industry can experience unique shifts due to various factors such as technological advancements, regulatory changes, and evolving market conditions. These shifts might not be captured in the training set, leading to a model that performs well on historical data but poorly on new data reflecting current industry conditions. For example, a model trained on pre-2008 financial data might not account for post- crisis regulatory changes that significantly impact financial reporting and risk assessment. Similarly, a model trained on manufacturing data from an era of manual processes may struggle to predict outcomes in a modern, highly automated industry. The possibility of unanticipated shifts in industry conditions or trends underscores the importance of domain adaptability. Domain adaptation techniques allow models to adjust to new industry environments by learning from both his- torical and current data. This involves identifying industry-specific features that remain relevant over time, weighting instances to pri- oritise more recent and relevant data, and developing representations that are robust to changes in industry dynamics. By incorporating these techniques, models can generalise well across a range of industry situations, maintaining their accuracy and reliability even as industry conditions evolve. Finally, a typical criticism of biased training data in the setting of machine learning has been identified, pointing out the possible influence of bias on the performance of the model, especially with respect to minority target labels. The DAL approach proves invaluable in tackling data temporal shifts by strategically aligning features, weighting instances, and crafting domain-invariant representations. These techniques ensure that ma- chine learning models adapt effectively to changes in the temporal distribution of data. By selecting and transforming features that re- main stable across different time periods, assigning higher weights to instances that reflect the target domain’s temporal characteristics, and learning representations insensitive to temporal variations, models be- come more resilient to temporal shifts. Additionally, harnessing transfer learning strategies such as pre-training on diverse temporal data and fine-tuning using domain adaptation methods enhances the model’s ability to generalise and perform well across varying temporal contexts. 2.2. Handling outliers Removing outliers from financial ratios before bankruptcy detection is essential to ensure the accuracy and reliability of the analysis. Outliers can skew statistical measures, distort trends, and mislead the interpretation of financial data, which can have significant im- plications, especially in critical decisions like bankruptcy prediction. A recent study conducted by Nyitrai and Virág (2019) highlighted the necessity of financial indicators in predicting bankruptcy and the challenges posed by outliers in these indicators. The authors explored different approaches to handling outliers, specifically focusing on win- sorisation and the use of CHAID4-based categorisation of financial ratios. In this work, we adopted a hybrid Bayesian change point and Ham- pel identifier (BCP-HI) method (Pehlivan, 2024). This amalgamation scheme can potentially identify outliers more precisely than winsorisa- tion. Winsorisation replaces extreme values with values from the tails of the distribution, which can mask subtle outliers, especially when 4 Chi-squared Automatic Interaction Detector. s 5 dealing with multiple change points. The first component, BCP analysis, helps pinpoint these potential change points, allowing the HI portion to better target outliers within those segments. Additionally, the hybrid method utilises the data itself to identify outliers. By modelling the data with a normal distribution before and after potential change points, it can compute unique probabilities for each data point, leading to a more data-driven approach to outlier detection compared to winsorisation’s fixed threshold approach. 2.2.1. The integration of BCP-HI method The BCP-HI outlier detection depicted in Algorithm 1 begins by preprocessing the financial data and determining initial parameters such as the window size (𝑤) and the number of change points (𝑐𝑝). The notation 𝑤 represents the size of the window used for outlier detection, while the number of 𝑐𝑝 indicates the expected number of shifts in the data distribution. These parameters are crucial for the effectiveness of the algorithm in identifying outliers accurately. Algorithm 1 BCP-HI Outlier Detection Algorithm 1: Input: Financial ratio data 2: Output: Refined financial ratio with outliers corrected 3: procedure BCP-HI(𝑑𝑎𝑡𝑎) 4: Preprocess the input data 5: Determine initial parameters: window size 𝑤, number of change points 𝑐𝑝 6: Perform Bayesian Change Point (BCP) analysis to identify change points 7: Divide data into subsegments based on change points 8: for each subsegment do 9: Calculate median and MAD for the subsegment 10: Apply Hampel Identifier (HI) for outlier detection 11: Replace identified outliers with NaN values 12: Apply median filter of size 𝑤 to correct outliers 13: end for 14: return Refined time series data with outliers corrected 15: end procedure Let denote the financial metric dataset over a given period as 𝑫 = {𝑥1, 𝑥2,… , 𝑥𝑛} with 𝑛 data points. The hybrid method first employs BCP analysis to detect significant shifts (𝑐𝑝𝑠) in the financial metric data. These change points represent instances where there is a notable change in the underlying distribution or behaviour of the financial met- ric. Consider the set of change points as 𝑐𝑝𝑠 = {𝑐𝑝1, 𝑐𝑝2,… , 𝑐𝑝𝑘}, where signifies the number of detected change points. The BCP analysis egments the financial data into subsegments with distinct distribu- ion properties. Suppose 𝑆𝑖 represent the 𝑖th subsegment, where 𝑖 = , 2,… , 𝑘+1. Each subsegment 𝑆𝑖 is delineated by two adjacent change oints 𝑐𝑝𝑖 and 𝑐𝑝𝑖+1, such that 𝑆𝑖 = {𝑥𝑐𝑝𝑖 , 𝑥𝑐𝑝𝑖+1−1,… , 𝑥𝑐𝑝𝑖+1−1}. This first approach incorporates Bayesian probabilistic modelling to estimate the likelihood of each data point belonging to a specific subsegment given the observed financial metric data as defined in Eq. (1); 𝑃 (𝑆𝑖|𝑥𝑗 ) = 𝑃 (𝑥𝑗 |𝑆𝑖) ⋅ 𝑃 (𝑆𝑖) ∑𝑘+1 𝑚=1 𝑃 (𝑥𝑗 |𝑆𝑚) ⋅ 𝑃 (𝑆𝑚) , (1) here 𝑃 (𝑆𝑖|𝑥𝑗 ) represents the posterior probability of data point 𝑥𝑗 elonging to subsegment 𝑆𝑖, 𝑃 (𝑥𝑗 |𝑆𝑖) signifies the likelihood of ob- erving data point 𝑥𝑗 under the distribution parameters of subseg- ent 𝑆𝑖, 𝑃 (𝑆𝑖) denotes the prior probability of subsegment 𝑆𝑖, and 𝑘+1 𝑚=1 𝑃 (𝑥𝑗 |𝑆𝑚) ⋅ 𝑃 (𝑆𝑚) calculates the weighted sum of likelihoods cross all subsegments, ensuring that probabilities sum to 1. This ayesian methodology facilitates the probabilistic detection of signif- cant changes (change points) in financial metric data and enables the egmentation of the data into subsegments with distinct distributional haracteristics, thereby enhancing the analysis and processing of each ubsegment independently. T. Ansah-Narh et al. Expert Systems With Applications 258 (2024 ) 125133 Fig. 1. Outlier detection and correction in financial metrics using a hybrid BCP-HI scheme. The first column shows the original time series data for each financial metric before outlier detection. The second Column highlights the outliers (in red circles) detected in the original data using the HI method. Column three shows the time series data after outliers have been removed. Within each subsegment 𝑆𝑖, the algorithm calculates the median, denoted as median(𝑆𝑖). The Median Absolute Deviation (MAD) is then computed, which measures the dispersion of the data points within the subsegment. The MAD is defined as the median of the absolute devia- tions from the median of the subsegment, mathematically expressed as MAD(𝑆𝑖) = median(|𝑥𝑗 − median(𝑆𝑖)|) for all 𝑥𝑗 in 𝑆𝑖. After computing the median and MAD for each subsegment, the algorithm applies the HI for outlier detection. The HI flags a data point 𝑥𝑗 in subsegment 𝑆𝑖 as an outlier if its absolute deviation from the median exceeds a specified threshold. This threshold is typically set as a multiple of the MAD. Specifically, a data point 𝑥𝑗 is considered an outlier if |𝑥𝑗 − median(𝑆𝑖)| > 𝑘 × MAD(𝑆𝑖), where 𝑘 is a constant multiplier that determines the sensitivity of the outlier detection. By applying this criterion, the algorithm effectively identifies outliers within each subsegment based on the robust statistical properties of the median and MAD. Identified outliers in the financial ratios are replaced with NaN values. Let {𝑜1, 𝑜2,… , 𝑜𝑚} denote the indices of the identified outliers in the data 𝐷. For each outlier 𝑥 in the dataset, we set 𝑥 = NaN. 𝑜𝑖 𝑜𝑖 6 To correct for these outliers, a median filter of size 𝑤 is applied to the time series. The median filter processes the time series by sliding a window of size 𝑤 across the data points. For each window position, the median value of the data points within the window is computed. If the window is centred at index 𝑗, the window includes data points {𝑥𝑗−⌊𝑤∕2⌋, 𝑥𝑗−⌊𝑤∕2⌋+1,… , 𝑥𝑗+⌊𝑤∕2⌋}. The median of these data points, excluding NaN values, is used to replace the NaN value at index 𝑗. Mathematically, for each outlier index 𝑜𝑖, the corrected value is given by: 𝑥𝑜𝑖 = median({𝑥𝑜𝑖−⌊𝑤∕2⌋, 𝑥𝑜𝑖−⌊𝑤∕2⌋+1,… , 𝑥𝑜𝑖+⌊𝑤∕2⌋} ⧵ {NaN}) (2) This process ensures that outliers are replaced with more representative values based on the local neighbourhood of data points, effectively smoothing the financial ratios while preserving important trends and patterns. The plots in Fig. 1 show the results of the hybrid outlier detection method applied to several financial metrics over the observed period of time. The 𝑦-axis represents the values of the financial metric, and the 𝑥-axis represents the metric indices. Take note of how severe outliers displayed in the figure are identified and calibrated leveraging on the BCP-HI statistical technique. T. Ansah-Narh et al. Expert Systems With Applications 258 (2024 ) 125133 Fig. 2. A graph of correlation matrix describing the linear association between financial variables. Each element in the triangular matrix shows the correlation coefficient between two variables. 2.3. Bivariate analysis We propose a correlation matrix (in Fig. 2) to thoroughly exam- ine the links between financial ratios. A quantitative indicator of the relationships between these financial ratios is the Pearson correlation coefficient (𝑟), which can be calculated using the following formula in Eq. (3) (Cohen et al., 2009): 𝑟 = ∑𝑛 𝑖=1(𝑋𝑖 − �̄�)(𝑌𝑖 − 𝑌 ) √ ∑𝑛 𝑖=1(𝑋𝑖 − �̄�)2 ⋅ ∑𝑛 𝑖=1(𝑌𝑖 − 𝑌 )2 . (3) The individual data points of the two financial ratios are represented by 𝑋𝑖 and 𝑌𝑖, their respective means are shown by �̄� and 𝑌 , and the total number of data points is indicated by 𝑛. We observe from the matrix plot that most of the financial variables have values that are approaching 0. This closeness suggests that there is little correlation between any two of the chosen variables, confirming that there is no discernible multicollinearity between the independent financial ratios. These observed characteristics underscore the necessity of utilising all financial ratios to ascertain their collective importance in determining relevant features. 2.4. Genetic algorithm for feature selection Genetic algorithms, or GAs, have proven time and time again to be remarkably efficient at resolving a wide range of optimisation issues. 7 (Nolfi, Floreano, Miglino, Mondada, et al., 1994) documented their achievements in a variety of applications, including sophisticated robot motion optimisation, control system parameter fine-tuning, and robotic system path planning. In addition to their conventional uses, GAs are flexible in machine learning, especially when it comes to feature selection. This versatility makes GA a valuable tool for systematic navigation in situations with complex combinations of features. The heuristic search algorithm uses principles inspired by natural selection and evolution to iteratively refine a subset of traits and gradually converge to an optimal set. Adopting this approach effectively balances model complexity and prediction metrics such as F1-score, precision, recall, and AUC-ROC, while also improving classifier performance. In the present work, we investigate the GA algorithm used to determine the ideal subset of features on the financial dataset that maximises the model classifier’s accuracy. Let 𝐗 denote the feature matrix of a dataset with 𝑁 instances and 𝐿 features, and 𝐲 represent the corresponding target label vector. The following characteristics describe the GA: (i) Population size (𝑁): Representing the number of individuals in each generation of the GA. The notation 𝑁 ∈ Z+ is a key factor in determining population diversity and the trade-off between exploration and exploitation. (ii) Offspring production (𝜆): The number of offspring produced in each generation, 𝜆 determines the rate at which new genetic T. Ansah-Narh et al. p m f o t v m p b Expert Systems With Applications 258 (2024 ) 125133 material is introduced into the population. Like 𝑁 , 𝜆 is also a positive integer: 𝜆 ∈ Z+. (iii) Crossover probability (𝑃𝑐): Reflecting the likelihood of mating occurring between two individuals. The symbol 𝑃𝑐 influences the exploration–exploitation balance by controlling the exchange of genetic material between parents. Mathematically, 𝑃𝑐 is a probability value between 0 and 1: 𝑃𝑐 ∈ [0, 1]. (iv) Mutation probability (𝑃𝑚): This parameter represents the likeli- hood that a bit in an individual’s binary string will be flipped. Mutation introduces genetic diversity and prevents premature convergence. Mathematically, 𝑃𝑚 is a probability value between 0 and 1: 𝑃𝑚 ∈ [0, 1]. (v) Number of generations (𝐺): 𝐺 signifies the total iterations or epochs for which the genetic algorithm will run, determining how many times the evolutionary process (selection, crossover, and mutation) is applied. Mathematically, 𝐺 is a positive integer: 𝐺 ∈ Z+. Each individual in the population is represented by a binary string of length 𝐿. The binary string encodes the presence (1) or absence (0) of each feature in the subset. Mathematically, an individual 𝐼 is represented as 𝐼 = [𝑔1, 𝑔2,… , 𝑔𝐿]. (4) Here, 𝑔𝑖 is the 𝑖th gene in the binary string, indicating the presence or absence of the 𝑖th feature. The binary encoding provides a concise and flexible representation of feature subsets. Each gene 𝑔𝑖 is a binary variable defined as 𝑔𝑖 ∈ {0, 1}. The initial population is then formed by randomly generating binary strings of length 𝐿, ensuring diversity in the initial set of individuals. The initialisation of an individual 𝐼 can be expressed as 𝐼𝑖 ∼ Bernoulli(0.5) for 𝑖 = 1, 2,… , 𝐿. (5) Algorithm 2 Genetic Algorithm for Feature Selection 1: Input: Dataset features 𝐗, target labels 𝐲, Population size 𝑁 , Lambda 𝜆, Crossover probability 𝑃𝑐 , Mutation probability 𝑃𝑚, Number of generations 𝐺 2: Output: Selected features selected_features_ga 3: Initialise Population: 4: Each individual 𝐼 is represented as a binary string of length 𝐿, where 𝐿 is the number of features. 5: Initialise Genetic Algorithm Set: 6: Define fitness function fitness_function, crossover function, mutation function, and selection function 7: Initialise Population: 8: Create a population of 𝑁 individuals 9: Evaluate Initial Population: 10: Evaluate the fitness of each individual using the fitness function 11: for 𝑡 = 1 to 𝐺 do 12: Apply Crossover and Mutation: 13: Generate offspring using crossover and mutation operations with probabilities 𝑃𝑐 and 𝑃𝑚 14: Evaluate Offspring: 15: Evaluate the fitness of each offspring using the fitness function 16: Select Individuals for the Next Generation: 17: Use the selection function to choose individuals for the next generation based on their fitness 18: end for 19: Select Best Individual: 20: Choose the individual with the highest fitness as the best individual 21: Extract Selected Features: 22: Extract indices of selected features from the best individual: selected_features_ga 8 The expression Bernoulli(𝑝) represents a Bernoulli distribution with robability 𝑝, and 𝐼𝑖 is the 𝑖th gene in the binary string. Keeping in ind that genetic operations (namely, crossover and mutation) are undamental processes in genetic algorithms that shape the evolution f the population over generations. In the context of feature selection, hese operations manipulate the binary string representations of indi- iduals to explore new solutions. Crossover involves combining genetic aterial from two parent individuals to create offspring. The crossover oint is randomly selected along the binary string, and genetic material eyond that point is swapped between parents. For instance, let 𝐼1 and 𝐼2 be two parent individuals with binary string representations: 𝐼1 = [𝑔11 , 𝑔 1 2 ,… , 𝑔1𝑖 ,… , 𝑔1𝐿], 𝐼2 = [𝑔21 , 𝑔 2 2 ,… , 𝑔2𝑖 ,… , 𝑔2𝐿]. (6) The crossover point 𝐶 is randomly selected, and offspring 𝑂1 and 𝑂2 are created: 𝑂1 = [𝑔11 , 𝑔 1 2 ,… , 𝑔1𝐶 , 𝑔 2 𝐶+1, 𝑔 2 𝐶+2,… , 𝑔2𝐿], 𝑂2 = [𝑔21 , 𝑔 2 2 ,… , 𝑔2𝐶 , 𝑔 1 𝐶+1, 𝑔 1 𝐶+2,… , 𝑔1𝐿]. (7) The offspring replaces the genetic material beyond the crossover point with the parents. After the crossover operation, the next step is the application of mutation. Mutation involves randomly changing the value of one or more genes in an individual. This introduces genetic diversity into the population and prevents premature convergence to suboptimal solutions. In this work, the mutation is applied to each gene independently with a mutation probability 𝑃𝑚: MutatedGene𝑖 = { 1 − 𝑔𝑖, with probability 𝑃𝑚 𝑔𝑖, with probability 1 − 𝑃𝑚 This mutated gene replaces the original gene in the binary string. If, for example, a mutation with 𝑃𝑚 occurs at position 𝑗 = 5 in 𝑂1, the binary string might be updated as follows: Original 𝑂1 = [𝑔11 , 𝑔 1 2 , 𝑔 1 3 , 𝑔 2 4 , 𝑔 2 5 ,… , 𝑔2𝐿], Mutated 𝑂1 = [𝑔11 , 𝑔 1 2 , 𝑔 1 3 , 𝑔 2 4 , 𝟏,… , 𝑔2𝐿] (8) Note that after the crossover and mutation operations, the popula- tion is updated with the newly created individuals. These individuals have genetic material inherited from their parents, with potential variations introduced through mutation. Now, to provide a quantitative measure of the individual’s performance, we introduce a fitness func- tion. The fitness function evaluates how well an individual performs the task at hand. In the context of feature selection, the fitness function measures the effectiveness of a subset of features. The GA aims to find the subset of features 𝑆 that maximises the accuracy of a machine learning model. The fitness function is typically defined as the accuracy achieved by the model trained on the dataset with the selected features: Fitness (𝐼) = Accuracy(Model(𝑋train,𝑆 , 𝑦train), 𝑋test,𝑆 , 𝑦test) (9) Here, 𝑋train and 𝑦train are the training features and labels, and 𝑋test and 𝑦test are the test features and labels. Model is the machine learning model used in this case the Extra Trees classifier, and 𝑆 represents the selected features according to the binary string 𝐼 . The ultimate goal is to maximise the fitness function, thereby identifying the subset of features that results in the highest accuracy. Practical development and execution of the GA are depicted in Algorithm 2. The visual representation in Fig. 3 provides a comprehensive view of the key metrics crucial for assessing financial stability and forecasting potential financial distress. Additionally, the lollipop chart not only showcases the significance of each ratio but also emphasises their hierarchical importance in the context of bankruptcy prediction. Relative feature importance for each feature is determined by averaging its importance across multiple GA runs and normalising these averages. Next, we discuss the various machine learning models used in the study. T. Ansah-Narh et al. s Expert Systems With Applications 258 (2024 ) 125133 Fig. 3. A lollipop chart illustrating the distribution of the top 50 financial ratios, highlighting their relative importance in predicting the likelihood of bankruptcy. t p t i v 𝐹 2 f b i i m a 2.5. Machine learning models 2.5.1. Random Forest (RF) The RF algorithm is an effective ensemble learning method that avoids over-fitting by combining random feature selection with bag- ging to manipulate complex data patterns. The resulting mathematical framework illustrates the essential ideas that govern how successful random forests are as machine learning techniques. Let  denote the original dataset housing  samples. For every de- cision tree within the ensemble (comprising a total of  trees), a boot- strap sample 𝑏 of size  is created by iteratively selecting samples with replacement from  such that 𝑏 = {(𝐱1, 𝑦1), (𝐱2, 𝑦2),… , (𝐱𝑁 , 𝑦𝑁 )}. Each decision tree undergoes training on its respective bootstrap sam- ple, yielding  independently trained trees. This algorithm introduces an additional level of randomness by including only a subset of fea- tures at each node when building the decision tree. This deliberate selection of random features aims to decorrelate the trees, prevent excessive similarity, and allow us to capture different aspects of the data. Mathematically, at each node 𝑗 of a decision tree, a random ubset of features 𝑚 is chosen from the complete feature set 𝑀 such o 9 that 𝑚 ≤ 𝑀 . The purpose of this stochastic selection is to increase the ree’s diversity, fortify the algorithm’s resilience, and enhance its ca- acity for generalisation. Taking into account the ensemble of decision rees 𝑓1(𝐱), 𝑓2(𝐱),… , 𝑓𝑇 (𝐱), the final prediction 𝐹 (𝐱) for a new input 𝐱 s determined through a combination mechanism, typically involving oting for classification as given in Eqs. (10). (𝐱) = mode(𝑓1(𝐱), 𝑓2(𝐱),… , 𝑓𝑇 (𝐱)) (for classification) (10) .5.2. Support Vector Machine (SVM) The basic idea behind SVM is finding the best hyperplane in the eature space to separate different classes in the data effectively. This ecomes especially important when dealing with binary classification ssues, where the classes are usually designated as 0 and 1. Since ts introduction by Vapnik (1982), SVM has been a major force in achine learning systems, outperforming many of its competitors in short amount of time. Its ascendancy is attributed to the dual factors f simplicity and superior performance, as evidenced by studies such T. Ansah-Narh et al. f M i n i T w e r 2 s b b m N t a s b 2 r f p g p c t T n v Expert Systems With Applications 258 (2024 ) 125133 as Peng and Xu (2013). The widespread adoption of SVM is under- scored by its successful application across diverse research domains. Noteworthy fields where SVM has demonstrated its efficacy include finance, as exemplified by Luo, Yan, and Tian (2020), Tay and Cao (2001); chemistry, as explored by Li, Liang, and Xu (2009); renewable energy prediction, with contributions from Zendehboudi, Baseer, and Saidur (2018); medicine, as demonstrated by Wang, Zheng, Yoon, and Ko (2018); text classification, a domain addressed by Tong and Koller (2001); and face recognition, with seminal work by Osuna, Freund, and Girosit (1997). Given a set of training data points (𝐱1, 𝑦1), (𝐱2, 𝑦2),… , (𝐱𝑛, 𝑦𝑛), where 𝐱𝑖 is the feature vector for the 𝑖th data point, and 𝑦𝑖 is the corresponding class label such that 𝑦𝑖 ∈ {0, 1}. The decision function of an SVM is given by Eq. (11): 𝑓 (𝐱) = 𝐰 ⋅ 𝐱 + 𝑏 (11) Here, 𝐰 is the weight vector, 𝐱 is the input feature vector, and 𝑏 is the bias term. The goal of SVM is to find the optimal hyperplane that maximises the margin between the two classes. The margin is defined as the distance between the hyperplane and the nearest data point from each class. Mathematically, the margin (𝑀) is given by the formula in Eq. (12): 𝑀 = 2 ‖𝐰‖ , (12) where ‖𝐰‖ is the Euclidean norm of the weight vector. To ensure that the SVM correctly classifies the training data and maximises the margin, it must satisfy the following constraints: i. For each positive training example (𝐱𝑖 with 𝑦𝑖 = 1): 𝐰 ⋅ 𝐱𝑖 + 𝑏 ≥ 1 ii. For each negative training example (𝐱𝑖 with 𝑦𝑖 = 0): 𝐰 ⋅ 𝐱𝑖 + 𝑏 < 0 The above constraints can be combined into a single expression to get Eq. (13): 𝑦𝑖(𝐰 ⋅ 𝐱𝑖 + 𝑏) ≥ 1 (13) This is the standard formulation of the linear SVM optimisation prob- lem. Nonetheless, when confronted with intricate systems, such as the one under consideration, we enhance the foundational principles established in linear scenarios by incorporating a kernel function. Employing SVMs with kernel methods involves working in high- dimensional feature spaces, allowing for the construction of non-linear decision boundaries. The decision function for SVM with a kernel 𝐾 can be expressed as Eq. (14): 𝑓 (𝐱) = 𝑛 ∑ 𝑖=1 𝛼𝑖𝑦𝑖𝐾(𝐱𝑖, 𝐱) + 𝑏 (14) Here, 𝛼𝑖 are the Lagrange multipliers obtained during the optimisation process, and 𝐾(𝐱𝑖, 𝐱) is the kernel function. The optimisation problem or SVM with kernel methods is given by: inimise 1 2 𝑛 ∑ 𝑖=1 𝑛 ∑ 𝑗=1 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝐾(𝐱𝑖, 𝐱𝑗 ) − 𝑛 ∑ 𝑖=1 𝛼𝑖 subject to the constraints: 𝑛 ∑ 𝑖=1 𝛼𝑖𝑦𝑖 = 0 0 ≤ 𝛼𝑖 ≤ 𝐶 for 𝑖 = 1, 2,… , 𝑛 where 𝐶 is the regularisation parameter that controls the trade-off between achieving a low training error and a large margin. The decision boundary is determined by the support vectors, which are the data points 𝐱 corresponding to non-zero Lagrange multipliers 𝛼 . The kernel 𝑖 𝑖 p 10 function 𝐾(𝐱𝑖, 𝐱𝑗 ) implicitly computes the dot product of the data points n a higher-dimensional space, allowing SVMs to capture complex, on-linear relationships in the data. Commonly used kernel functions nclude: i. Linear Kernel (𝐾(𝐱𝑖, 𝐱𝑗 ) = 𝐱𝑖 ⋅ 𝐱𝑗): Corresponds to the standard linear SVM described in Eq. (11). ii. Polynomial Kernel (𝐾(𝐱𝑖, 𝐱𝑗 ) = (𝐱𝑖 ⋅ 𝐱𝑗 + 𝑐)𝑑): Introduces non- linearity through polynomial terms. iii. Radial Basis Function or Gaussian Kernel ( 𝐾(𝐱𝑖, 𝐱𝑗 ) = exp ( − ‖𝐱𝑖 − 𝐱𝑗‖2 2𝜎2 )) (Ding, Liu, Yang, & Cao, 2021): Provides a smooth, non-linear decision boundary. iv. Sigmoid Kernel (𝐾(𝐱𝑖, 𝐱𝑗 ) = tanh(𝛽𝐱𝑖 ⋅ 𝐱𝑗 + 𝜃)): Represents a hyperbolic tangent function, introducing non-linearities. hese expressions capture the essence of SVM with kernel methods, hich leverage the mathematical concept of the kernel to implicitly op- rate in a high-dimensional space, enabling the modelling of non-linear elationships in the data. .5.3. k-nearest neighbours (k-NN) The basic idea underlying the nearest neighbour algorithm is rather imple: instances are grouped based on the class of their nearest neigh- ours. It is frequently advantageous to take into account not just one ut several neighbours in order to increase the robustness of this ethod. Therefore, the commonly known approach is the k-Nearest eighbour (k-NN) algorithm, where the class is determined based on he consensus of k nearest neighbours. The algorithm requires the vailability of training examples during runtime, meaning they must be tored in memory at the time of execution. Consequently, it can also e referred to as a memory-based algorithm (Cunningham & Delany, 020). As the real learning or model construction is deferred until untime when predictions are needed, this technique is considered a orm of lazy learning, making it flexible and adaptive to varying data atterns encountered during runtime. Analytically, we can express the k-NN algorithm by considering a iven dataset 𝐷 with 𝑛 data points in a feature space, where each data oint 𝑖 is represented by a feature vector 𝐱𝑖 and is associated with a lass label 𝑦𝑖 (for classification). For a new data point 𝐱 that we want o classify or predict, the k-NN algorithm operates as follows: i. Distance Metric:- Let 𝑑(𝐱𝑖, 𝐱) be a distance metric that measures the distance between data point 𝐱𝑖 and the query point 𝐱. Com- mon distance metrics include Euclidean distance, Manhattan distance, or other suitable measures based on the problem at hand. ii. Nearest Neighbours:- Identify the k nearest neighbours of the query point 𝐱 from the dataset 𝐷 based on the chosen distance metric. Let 𝑁(𝐱) represent the set of indices of these k nearest neighbours. iii. For k-NN classification:- For classification tasks, assign the class label 𝑦 to the query point based on majority voting among the class labels of its k nearest neighbours as given in Eq. (15): 𝑦 = argmax 𝑐 ∑ 𝑖∈𝑁(𝐱) 𝛿(𝑦𝑖, 𝑐) (15) where 𝛿(𝑦𝑖, 𝑐) is the Kronecker delta function that equals 1 if 𝑦𝑖 = 𝑐 and 0 otherwise. he k-NN algorithm essentially relies on the assumption that points earby in the feature space are likely to have similar labels or target alues. The choice of the distance metric and the value of k are critical arameters that influence the algorithm’s performance. T. Ansah-Narh et al. H t t c i c a s i d a t t i m 2 i s i G D t u t a b i 𝐹 w w l m r u 𝐹 w f t g i t p b I a 𝑓 T i a i e s s g g f t t e o a i 𝑃 p d p m r p m S g 2 f A a w i d i c Expert Systems With Applications 258 (2024 ) 125133 2.5.4. Logistic Regression (LR) The LR is a widely used statistical method in machine learning for binary classification tasks. This model predicts the probability that an instance belongs to a particular class, and it is particularly useful in scenarios where the dependent variable is categorical and binary, such as predicting whether an event will occur or not. Statistically, the LR model is expressed in Eq. (16): 𝑃 (𝑌 = 1) = 1 1 + 𝑒−(𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑛𝑋𝑛) (16) ere, 𝑃 (𝑌 = 1) represents the probability of the positive class, 𝛽0 is he intercept, 𝛽1, 𝛽2,… , 𝛽𝑛 are the coefficients, and 𝑋1, 𝑋2,… , 𝑋𝑛 are he feature values. Keep in mind that each coefficient represents the hange in the log-odds of the dependent variable for a one-unit change n the corresponding independent variable, holding other variables onstant. Also, the intercept represents the log-odds of the event when ll independent variables are zero. This method is advantageous in machine learning for several rea- ons. Firstly, LR inherently produces probabilities, making the model nterpretable and allowing decision-makers to understand the confi- ence level of predictions. Additionally, the LR method handles linear nd non-linear relationships between features and the log-odds of he outcome, providing flexibility in capturing complex patterns in he data. Altman’s Z-score, a popular bankruptcy prediction model, ncorporates LR to assess the financial health of companies based on ultiple financial ratios (Altman, 1968). .5.5. Gradient Boosting (GB) The initial GB approach, often referred to as the GB Machine, was ntroduced by Friedman (1999, 2002). This method acts as the corner- tone algorithm that lays the groundwork for subsequent advancements n boosting techniques such as XGBoost (Chen & Guestrin, 2016), Light- BM (Ke et al., 2017), and CatBoost (Prokhorenkova, Gusev, Vorobev, orogush, & Gulin, 2018). GB is a powerful machine learning ensemble echnique that combines the predictions of multiple weak learners, sually decision trees, to create a robust and accurate model. Given a training dataset (𝑥𝑖, 𝑦𝑖) where 𝑥𝑖 represents the input fea- ures and 𝑦𝑖 the corresponding target labels, the objective is to construct n ensemble of weak learners ℎ(𝑥). The final prediction is obtained y combining these weak learners in an additive manner as presented n Eq. (17): (𝑥) = 𝑀 ∑ 𝑚=1 𝛽𝑚ℎ𝑚(𝑥) (17) here 𝑀 denotes the number of weak learners, 𝛽𝑚 represents the eight assigned to each learner, and ℎ𝑚(𝑥) is an individual weak earner. The fundamental idea behind GB is to iteratively fit new odels to the errors of the existing ensemble, thereby reducing the esidual errors in predictions. In each iteration 𝑚, the model 𝐹 (𝑥) is pdated to get Eq. (18): 𝑚(𝑥) = 𝐹𝑚−1(𝑥) + 𝜆𝑚 ⋅ ℎ𝑚(𝑥) (18) here 𝐹𝑚(𝑥) is the composite model at iteration 𝑚, 𝐹𝑚−1(𝑥) is the model rom the previous iteration, and 𝜆𝑚 is the learning rate that controls he contribution of each weak learner. At each iteration, the negative radient of the loss function with respect to the current model 𝐹𝑚−1(𝑥) s calculated, denoted as − 𝜕𝐿(𝐹𝑚−1(𝑥)) 𝜕𝐹𝑚−1(𝑥) . The weak learner ℎ𝑚(𝑥) is then trained to fit the negative gradient, minimising the local approximation of the loss. This iterative process continues until a predefined number of weak learners are incorporated into the ensemble. The effectiveness of GB in bankruptcy prediction has been demon- strated in various studies. For instance, in a comprehensive analysis by Carmona, Climent, and Momparler (2019), GB algorithms were employed to develop predictive models for bankruptcy, showcasing superior performance compared to other machine learning techniques. The study emphasised the importance of ensemble methods, such as GB, in achieving high predictive accuracy and robustness in the context of bankruptcy prediction. w 11 2.6. Bayesian hyperparameter tuning The basis of Bayesian hyperparameter tuning is Bayesian optimisa- tion–a stochastic model-based scheme for maximising costly opaque functions. The goal of this work is to determine which combination of hyperparameters maximises an objective function (in this case, the performance metric for the machine learning models discussed in Sec- tion 2.5. The first step of this method is to define the objective function 𝑓 (𝑥) with hyperparameters denoted by 𝑥. For instance, in machine learning classification, the model’s hyperparameters are represented by 𝑥, whilst the accuracy, F1 score, or any other assessment metric could be represented by 𝑓 (𝑥). We adopted the famous Gaussian process () o model the objective function. This is because the function creates a robabilistic estimate with associated uncertainty by finding the distri- ution of likely values of the objective function at a particular location. t consists of a covariance function 𝜅(𝑥, 𝑥′) and a mean function 𝑚(𝑥) s depicted in Eq. (19). (𝑥) ∼ (𝑚(𝑥), 𝜅(𝑥, 𝑥′)) (19) he notation 𝜅(𝑥, 𝑥′) models the correlation between various points n the input space, whereas the 𝑚(𝑥) records the objective function’s verage behaviour. The next step is to iteratively select the next point for evaluation n the parameter space based on an acquisition function. This strat- gy aims to maintain a balance between exploration, which involves ampling in areas of high uncertainty, and exploitation, which involves ampling in areas where optimal solutions are expected to exist. The oal is to systematically navigate the search space through an intelli- ent selection of evaluation points and effectively combine the search or new information with the use of existing knowledge to control he optimisation process. This careful balance allows the algorithm o efficiently explore unknown regions of the parameter space whilst xploiting regions where the objective function is likely to reach its ptimal value. After specifying the new hyperparameter configuration and evalu- ting the objective function, the  model is updated using Bayesian nference. The updated  posterior distribution is given by Eq. (20): (𝑓 (𝑥)|𝑫) = 𝑃 (𝑫|𝑓 (𝑥)) ⋅ 𝑃 (𝑓 (𝑥)) 𝑃 (𝑫) (20) Here, 𝑫 represents the data collected so far, 𝑃 (𝑫|𝑓 (𝑥)) is the robability of the data given in the  model, 𝑃 (𝑓 (𝑥)) is the prior istribution of  and 𝑃 (𝑫) is the marginal probability. Table 3 provides a comprehensive summary of the optimal hy- erparameters for the selected machine learning algorithms. For each odel, details are provided on the specific hyperparameters tuned, the ange or choices considered during the tuning process, and the best arameter values found. For example, the best parameters of the RF odel include 64 estimators and a maximum depth of 6, while the VM model achieved optimal performance with a 𝐶 value of 1e6 and a amma value of 0.00143 using an RBF kernel. .7. Domain adaptation learning In our developed model, we addressed the challenge of imbalanced inancial data by partitioning the information generated by the Genetic lgorithm (GA) into source and target domains. The source domain imed to maintain a balanced representation of the majority class, hile the target domain deliberately mirrored the imbalances present n the initial dataset. To establish a balanced subset for the source omain, we applied a clustering approach to the imbalanced train- ng dataset, specifically focusing on the majority class. The cluster entroids, representing characteristic samples of the dominant class, ere retained. This clustering process utilised the imblearn Python T. Ansah-Narh et al. m p d t t l ( a Expert Systems With Applications 258 (2024 ) 125133 Table 3 Best hyperparameters for different models. Algorithm/Model name Hyperparameter Best parameter Random forest n_estimators: (10, 100) n_estimators: 64 max_depth: (1, 20) max_depth: 6 min_samples_split: (2, 10) min_samples_split: 2 min_samples_leaf: (1, 10) min_samples_leaf: 1 Support vector machine C: (1e−6, 1e+6, log-uniform) C: 1e+6 gamma: (1e−6, 1e+1, log-uniform) gamma: 0.00143 kernel: [linear, poly, rbf, sigmoid] kernel: rbf Logistic regression C: (1e−6, 1e+6, log-uniform) C: 1.7086 penalty: [l1, l2] penalty: l2 solver: [lbfgs, newton-cg, sag, saga] solver: sag max_iter: (100, 1000, 10000) max_iter: 1000 tol: (1e−6, 1e−3, log-uniform) tol: 4.4260e−5 fit_intercept: [True, False] fit_intercept: False class_weight: [None, balanced] class_weight: None Gradient boosting n_estimators: (10, 100) n_estimators: 42 learning_rate: (1e−6, 1e+1, log-uniform) learning_rate: 0.5144 max_depth: (1, 20) max_depth: 7 min_samples_split: (2, 10) min_samples_split: 4 min_samples_leaf: (1, 10) min_samples_leaf: 10 k-nearest neighbours n_neighbours: (1, 20) n_neighbours: 1 weights: [uniform, distance] weights: uniform p: [1, 2] p: 2 1 1 1 1 1 package.5 The resulting balanced subset served as the foundation for building a model with improved representativeness, capturing various patterns and features from the initially imbalanced data. The crucial step in developing a balanced source domain demonstrated a more thorough understanding of underlying patterns and made it easier for subsequent domain adaptation, which allowed the model to gain strong and widely applicable properties. It also reduced the possibility of bias towards particular cases. Conversely, the target domain was created from the unbalanced training dataset by deliberately manipulating the distribution to reflect imbalances, with a focus on the minority class. This deliberate imbalance aimed to simulate challenging circumstances for the model during adaptation. The method employed a random sampling procedure on minority class instances, ensuring that the target domain retained the intrinsic complexity and biases of the original data. Following the creation of source and target domains, the challenge of class imbalance was addressed by introducing sample weights during model training. This proactive approach aimed to prevent the dominat- ing majority class from overshadowing the learning process, promoting a more effective and balanced model. For the source domain, sample weights were calculated based on class frequencies, with less frequent classes receiving higher weights. In the target domain, sample weights were designed to address imbalances, assigning higher weights to the minority class. Both source and target domain sample weights un- derwent normalisation to maintain relative importance, ensuring their sum equalled 1. By integrating sample weights, the model prioritised inority class instances during training, enhancing its adaptability and erformance in scenarios with prevalent class imbalances in the target omain. To ensure that the selected features are evenly distributed in both he source and target domains, we implement a quantile transforma- ion as a means of standardising these features. According to studies ike Gallón, Loubes, and Maza (2013), Liu et al. (2019), Pan and Zhang 2018), Peterson and Cavanaugh (2019), this approach is preferred over lternative standardisation methods like mean normalisation or z-score 5 https://pypi.org/project/imblearn/. 12 Algorithm 3 Quantile Transformation Normalisation Require: 𝑋: Input dataset with 𝑛 samples and 𝑚 features Ensure: Normalised dataset 𝑋norm 1: Initialise an empty array 𝑋norm to store the normalised values 2: for 𝑖 ← 1 to 𝑚 do ⊳ Iterate over each feature 3: Sort the values of feature 𝑖 in ascending order 4: Calculate the quantiles for each value based on its rank and the total number of samples 5: Initialise an empty array 𝑋norm_feature to store the normalised values of feature 𝑖 6: for 𝑗 ← 1 to 𝑛 do ⊳ Iterate over each sample 7: Calculate the rank of the 𝑗th sample in feature 𝑖 8: Calculate the percentile of the 𝑗th sample based on its rank and the total number of samples 9: Map the percentile to a standard normal distribution using the inverse cumulative distribution function (CDF) of the normal distribution 0: Store the mapped value in 𝑋norm_feature 1: end for 2: Append 𝑋norm_feature to 𝑋norm 3: end for 4: return 𝑋norm scaling because of its robustness in handling outliers and non-Gaussian distributions. The provided pseudocode described in Algorithm 3 out- lines the process of quantile transformation normalisation, a method used to transform data such that it follows a standard normal dis- tribution. It starts by iterating over each feature in the dataset and sorts the values of each feature in ascending order. Then, it calculates the quantiles for each value based on its rank and the total number of samples. For each sample, it calculates its percentile based on its rank and the total number of samples and maps this percentile to a standard normal distribution using the inverse cumulative distribution function (CDF) of the normal distribution. These mapped values are stored in new arrays for each feature, and the normalised feature arrays are appended to form the normalised dataset, which is returned as the https://pypi.org/project/imblearn/ T. Ansah-Narh et al. 1 1 1 1 1 1 1 1 2 2 2 t s o r o h c e d d d d w d k m 3 3 D t w o m c f 1 r l t p a l i f B Expert Systems With Applications 258 (2024 ) 125133 output. By using quantile transformation, feature distributions between domains are more robustly and efficiently aligned. Algorithm 4 Domain Adaptation Pipeline 1: Load Datasets: 2: Load the source and target datasets, 𝑋source, 𝑦source and 𝑋target, 𝑦target, respectively. 3: Balance the source dataset and create an imbalanced target dataset. 4: Calculate Sample Weights: 5: Calculate sample weights for source and target domains: 6: sample_weights_source ← 0.5 ∑ 𝑖(𝑦source == 𝑖) 7: sample_weights_target_train ← 0.7 ∑ 𝑖(𝑦target_train == 𝑖) 8: sample_weights_target_test ← 0.3 ∑ 𝑖(𝑦target_test == 𝑖) 9: Standardise Data: 10: Standardise the features using Quantile transformation: 11: 𝑋source_standardised ← 𝑋source 2: 𝑋target_train_standardised ← 𝑋target_train 3: 𝑋target_test_standardised ← 𝑋target_test 4: Bayesian Optimisation: 5: Iterate over classifiers and perform Bayesian optimisation. 6: Find optimal hyperparameters. 7: Evaluate Optimised Classifiers: 8: For each optimised classifier: 9: Fit the classifier on the source domain. 0: Transform source features. 1: Train a transfer model on the target domain. 2: Make predictions on the target domain testing set. With regards to DAL, Bayesian optimisation is used to optimise he hyperparameters described in Section 2.6 by defining a parameter pace tailored to the model classifiers in Section 2.5. In order to ptimise based on cross-validated accuracy, each classifier’s hyperpa- ameter space is methodically explored and exploited using Bayesian ptimisation. This method makes it possible to determine the ideal yperparameters that improve each classifier’s performance when it omes to DAL. Once the classifiers are optimised, a model transfer is xecuted by fitting these fine-tuned classifiers to the source domain ata. After that, the pertinent characteristics are taken out of the source omain and used to train a new model on the unbalanced target omain. The main focus here is on using the data from the source omain to modify the model to fit the specifics of the target domain, hich will in turn improve performance on the testing set for the target omain. Algorithm 4 depicts a comprehensive pipeline for DAL. This proposed process involves several key steps aimed at leveraging nowledge from relevant source domains to adapt machine learning odels to work effectively in the target domain. . Performance metrics .1. Confusion metrics In assessing the performance of a classification model, such as in our AL classifier for bankruptcy prediction, the evaluation is based on the est data, and the results are presented using a confusion matrix. The confusion matrix, denoted as (𝜑) for a model classifier 𝜑, is defined as follows: (𝜑) = 𝑡 ∑ 𝑖,𝑗 𝑚𝑖𝑗 (𝜑), (21) where (𝜑) is a 𝑡 × 𝑡 square matrix for a target domain test set with 𝑡 target labels. Each entry 𝑚 (𝜑) of (𝜑) represents the number of values 𝑖𝑗 13 belonging to a target label 𝑖 but have been assigned to a different target label 𝑗 by 𝜑. Specific computations derived from Eq. (21) include: a. ∑𝑡 𝑖,𝑗=1 𝑚𝑖1(𝜑), the sum of all values of target label 𝑖 ∈ 𝑡. b. ∑𝑡 𝑖=1,𝑗 𝑚1𝑗 (𝜑), the sum of all values of target label 𝑗 ∈ 𝑡. c. ∑𝑡 𝑖,𝑗=𝑖 𝑚𝑖𝑖(𝜑), depicting the crosswise field that validly classifies all the target labels. The introduction of Eq. (21) provides a fundamental framework for comprehending the performance metrics of our classification model. This formula is essential because it captures the essence of a confusion matrix, which is a vital instrument for evaluating the precision and effectiveness of the model. Each entry of this matrix signifies the instances where the model correctly or incorrectly assigned a label, forming the basis for various performance measures. Given our binary classification focus, the confusion matrix becomes a 2 × 2 matrix, encompassing true positives (𝑇𝑃𝑠), false positives (𝐹𝑃𝑠), false negatives (𝐹𝑁𝑠), and true negatives (𝑇𝑁𝑠). These compo- nents denote the counts of correctly identified positive (no bankruptcy) and negative (bankruptcy) instances and their misclassifications, re- spectively. For clarity, in our binary classification scenario: “Positive” signifies the absence of bankruptcy (no bankruptcy), whilst “Negative” corresponds to the presence of bankruptcy. Table 4 summarises various classification measures derived from the confusion matrix, such as sensitivity, specificity, precision, recall, and accuracy, along with their respective computations (Hossin & Sulaiman, 2015). 3.2. Probability calibration Probability calibration is a concept in machine learning that refers to the alignment of predicted probabilities with the true likelihood of the corresponding events. In classification tasks, machine learning mod- els often output predicted probabilities that represent the model’s con- fidence in its predictions. Ideally, these predicted probabilities should accurately reflect the actual probabilities of the events being predicted. One common method for probability calibration is Platt scaling (Böken, 2021; Niculescu-Mizil & Caruana, 2005), which involves fitting a lo- gistic regression model to the predicted probabilities generated by the original model. This additional calibration step can help in refining the predicted probabilities to be more accurate. The mathematical formula for Platt scaling can be expressed as follows: Let 𝛬(𝑥) be the output of your machine learning model (before calibration) for a given instance 𝑥. 𝛬(𝑥) is often a raw score or logit. The calibrated probability (𝑝𝑐) is then obtained using the sigmoid (logistic) function defined in Eq. (22): 𝑝𝑐 (𝑥) = 1 1 + 𝑒(𝛼⋅𝛬(𝑥)+𝛽) (22) here 𝑝𝑐 (𝑥) is the calibrated probability, 𝛬(𝑥) is the output of the riginal model, 𝛼 and 𝛽 are the parameters of the logistic regression odel, which are learned during the calibration process. During the alibration process, we typically use a set of labelled data to train the unction in Eq. (22). The labels would be the true class labels (0 or ), and the input to the logistic function would be 𝛬(𝑥). The logistic egression model is trained to minimise the log-likelihood of the true abels given the calibrated probabilities. Once the model is trained, he learned parameters (𝛼 and 𝛽) are used to calibrate new predicted robabilities. The Brier score is a metric used to assess the calibration of prob- bilistic predictions. For binary classification, the Brier score is calcu- ated as the mean squared difference between the predicted probabil- ties and the actual outcomes. The formula for the Brier score is as ollows: rier score = 1 𝑁 ∑ (𝑦𝑖 − 𝑝𝑖)2 (23) 𝑁 𝑖=1 T. Ansah-Narh et al. w o p p B p t o 4 s a m s f a a t m e m r w p Expert Systems With Applications 258 (2024 ) 125133 Table 4 Classification measures for DAL classifier performance evaluation. Name Description Computation Sensitivity The ability of the classifier to correctly identify all positive scenarios. 𝑇𝑃∕(𝑇𝑃 + 𝐹𝑁) Specificity The ability of the classifier to correctly reject all negative scenarios. 𝑇𝑁∕(𝑇𝑁 + 𝐹𝑃 ) Precision The ratio of relevant scenarios correctly identified by the classifier. 𝑇𝑃∕(𝑇𝑃 + 𝐹𝑃 ) Recall The ratio of all relevant scenarios correctly identified by the classifier 𝑇𝑃∕(𝑇𝑃 + 𝐹𝑁) Accuracy The overall ability of the classifier to make correct decisions, considering both positive and negative scenarios. 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 Fig. 4. Comparative confusion matrices of bankruptcy prediction models in DAL: The performance differences between the classifiers (RF, SVM, LR, GB, k-NN, and SE) in recognising cases that are bankrupt and those that are not are depicted in this figure. Notable patterns include SVM and LR exhibiting good results, and RF and SE demonstrating improved sensitivity with larger true positives and reduced false positives. On the other hand, k-NN and GB show comparatively more false positives. s o t d ( a n c t i b t w d N 3 t r a t f w b m r b c m here 𝑁 is the number of instances in the dataset, 𝑦𝑖 is the true binary utcome (0 or 1) for instance 𝑖, 𝑝𝑖 is the predicted probability for the ositive class for instance 𝑖. In the case of a multi-class classification roblem, we can generalise Eq. (23) to get Eq. (24): rier Score = 1 𝑁 𝑁 ∑ 𝑖=1 𝐾 ∑ 𝑘=1 (𝑦𝑖𝑘 − 𝑝𝑖𝑘)2 (24) where 𝐾 is the number of classes, 𝑦𝑖𝑘 is an indicator variable equal to 1 if the true class for instance 𝑖 is 𝑘, and 0 otherwise, 𝑝𝑖𝑘 is the predicted robability for class 𝑘 for instance 𝑖. This metric Score ranges from 0 o 1, where a lower score indicates better calibration (better alignment f predicted probabilities with actual outcomes). . Results and discussion The approach adopted in this work uses a 5-fold cross-validation trategy within the Bayesian optimisation process to ensure reliable nd generalisable hyperparameter tuning and model evaluation. This ethod divides the dataset into five subsets, iteratively using four ubsets for training and one for validation. This reduces the risk of over- itting and provides a comprehensive assessment of model performance cross different data splits. Additionally, by employing balanced source nd target datasets and applying sample weighting and resampling echniques such as ClusterCentroids, RandomOverSampler, and Rando- UnderSampler, the validation process is further refined. These steps ffectively address class imbalance issues, ensuring that evaluation etrics like accuracy, ROC AUC, and classification reports accurately eflect the models’ capabilities in handling imbalanced data scenarios ithin the domain adaptation framework. The discussion that follows rovides insights into the model’s performance and its implications for p 14 takeholders in the financial domain. For ease of comprehension, in ur binary classification scenario: “0” represents the negative class or he absence of the event being measured (non-bankrupt), whilst “1” enotes the positive class or the presence of the event being measured bankrupt). In Fig. 4, multiple classifiers, including RF, SVM, LR, GB, k-NN, and stacking ensemble (SE), exhibited varying confusion matrices. Take ote that the SE algorithm amalgamates predictions from multiple base lassifiers–in our case, SVM, LR, GB, and k-NN. These predictions are hen processed by a meta-learner, with RF serving as the meta-learner n this case. The meta-learner is trained on the predictions gleaned y the base classifiers to make the final prediction. The consistent rend across these matrices reveals that the models generally perform ell in identifying true negatives (non-bankrupt companies) but show ifferences in their ability to correctly classify bankrupt instances. otably, RF and SE models produced higher true positives (355 and 51, respectively) and lower false positives (118 and 111, respec- ively), indicating a better ability to identify companies at financial isk. SVM and LR also demonstrated favourable results, while k-NN nd GB showed relatively higher false positives, potentially leading o unnecessary concerns for non-bankrupt entities. Stakeholders in the inancial sector should consider these nuances when selecting a model, ith a focus on minimising false negatives to avoid overlooking actual ankruptcies. The absence of evident class-imbalance in the confusion atrix suggests that the classification models are performing well in ecognising both bankrupt and non-bankrupt cases without a significant ias towards one class over the other. This balanced performance is rucial for accurate predictions and indicates the effectiveness of the odels in handling the imbalanced target data. Fig. 5 reveals that RF and the SE models achieved well-rounded erformance across all metrics (precision, recall, and F1-score) for T. Ansah-Narh et al. b e a f S e C n b d r s i c a s c o c h b m p p i a v o g i h a l A s Expert Systems With Applications 258 (2024 ) 125133 Fig. 5. Comparative analysis of classification results for bankruptcy prediction models. This figure presents performance insights for RF, SVM, LR, GB, k-NN, and a SE based on precision, recall, and F1-score metrics for both non-bankrupt (0) and bankrupt (1) classes. oth bankrupt and non-bankrupt classifications. This suggests their ffectiveness in accurately identifying companies in financial distress nd those that are financially stable. While SVM and LR also per- ormed well, they exhibited a trade-off between precision and recall. VM prioritised identifying most bankrupt companies (high recall) ven if it meant misclassifying some healthy ones (lower precision). onversely, LR focused on avoiding false positives (high precision for on-bankrupt companies) at the expense of potentially missing some ankrupt firms (lower recall). The GB followed a similar trend to LR, emonstrating high precision for non-bankrupt companies but lower ecall for bankrupt ones. k-NN achieved a reasonable balance but fell lightly behind RF and SE in both precision and recall. For financial nstitutions, these results suggest that RF and SE might be preferable hoices due to their balanced performance in identifying both bankrupt nd non-bankrupt companies. However, the optimal model selection hould be tailored to specific priorities and operational needs. Finan- ial stakeholders should carefully consider the potential consequences f misclassifications. A high tolerance for false positives (mistakenly lassifying a healthy company as bankrupt) might favour models with igh precision like LR. Conversely, situations where missing a truly ankrupt company (false negative) is more concerning might call for odels with high recall like SVM. Ultimately, the trade-off between recision and recall should be weighed based on risk tolerance and the otential impact of misclassifications. The ROC curve results showcased in Fig. 6 for various models, ndicate their discrimination ability in distinguishing between bankrupt nd non-bankrupt instances. The mean Area Under the Curve (AUC) alues displayed in the ROC curve provide a summary of each model’s verall performance across multiple evaluations. Higher AUC values enerally suggest better model performance in terms of correctly rank- ng instances by their predicted probabilities. RF and SVM exhibit the ighest mean AUC, indicating robust discriminative power. LR, GB, nd the SE also perform well, with AUC values in the high 80 s to ow 90 s. k-NN, while still respectable, shows a slightly lower mean UC. These AUC results suggest that RF and SVM are particularly trong performers in terms of distinguishing between bankrupt and 15 Fig. 6. ROC curve analysis of DAL models for bankruptcy prediction: The ROC curve results showcase the discrimination abilities of RF, SVM, LR, GB, k-NN, and a SE. The AUC values quantify overall model performance, with RF and SVM demonstrating the highest AUC (93%). non-bankrupt cases based on their predicted probabilities. The choice between these models may depend on other considerations, such as interpretability and computational efficiency. LR, GB, and the SE also demonstrate reliable discriminatory capabilities, making them suitable alternatives. k-NN, while showing decent performance, has a slightly lower mean AUC and may require additional scrutiny. In the graphical representation (Fig. 7), the decision boundary is a crucial aspect of understanding how machine learning models make predictions by separating different classes in the feature space. The decision boundary of RF (first graph from the left) is nonlinear and can adapt well to intricate patterns. In this graph, we can observe that RF creates a piecewise constant decision boundary as it combines multiple decision trees to form a consensus prediction. The SVM plot (second graph from the left) aims to find a hyperplane that maximally separates classes in the feature space. In the context of the high AUC results, T. Ansah-Narh et al. Expert Systems With Applications 258 (2024 ) 125133 Fig. 7. Decision boundaries of DAL models for bankruptcy prediction: This figure illustrates the decision boundaries of RF (first from the left), SVM (second from the left), LR (third from the left), GB (fourth from the left), and k-NN (last from the left) based on their performance characteristics. RF and SVM are expected to have flexible, potentially nonlinear boundaries, while LR exhibits linear structures. GB combines simple decision boundaries, contributing to a complex overall boundary. k-NN’s locally adaptive boundary is influenced by the data point distribution. Feature 1 refers to the foremost significant financial ratio, ‘Return on Assets (ROA) Before Interest and % After Tax’, while Feature 2 represents the subsequent crucial financial ratio for predicting bankruptcy, ‘Net Income to Stockholder’s Equity’. Fig. 8. Illustration of calibration curves and Brier scores for binary classifiers. The Brier scores (RF: 0.12, SVM: 0.13, LR: 0.14, GB: 0.15, k-NN: 0.14) indicate well-calibrated probabilistic predictions. The calibration curves showcase the alignment of predicted probabilities with true positive class probabilities. SVM establishes a discriminative hyperplane that effectively separates bankrupt and non-bankrupt instances. Here, the decision boundary is nonlinear, because the best kernel function we chose was ‘rbf’. LR models the log-odds of the probability of belonging to a particular class. The decision boundary for LR is a linear function of the input features. Since the relationship between features and the log-odds is assumed to be linear, the decision boundary is a hyperplane. GB builds an ensemble of weak learners, typically decision trees, sequentially, with each tree aiming to correct the errors of the previous ones. The decision boundary is a combination of simpler decision boundaries, leading to a complex and adaptive overall boundary. The k-NN classifies instances based on the majority class among its k nearest neighbours. The decision boundary is flexible and nonlinear, adapting to the local density of instances in the feature space. A well-calibrated model should produce predicted probabilities that are indicative of the true probability of belonging to the positive class. One way to assess calibration is by examining a calibration curve and calculating the Brier score, as illustrated in Fig. 8. The Brier scores for each model are as follows: RF with a Brier score of 0.12, SVM with 0.13, LR with 0.14, GB with 0.15, and k-NN with 0.14. These scores are relatively low, suggesting that the models provide well-calibrated probabilistic predictions. To gain further insights into the behaviour of each classifier, we analyse the distribution of samples across predicted probability bins. For example, dividing the predicted probabilities into bins (e.g., 0 − 0.1, 0.1 − 0.2,… , 0.9 − 1.0) and counting the number of samples in each bin would reveal how concentrated or dispersed the predicted probabilities are. In general, a well-calibrated model would exhibit a calibration curve that closely follows the diagonal line (𝑦 = 𝑥) and this is exactly what the calibration curves exhibit. The histograms shown in Fig. 9 visually illustrate the central tenden- cies and spatial distribution of predicted probabilities for each model 16 (RF, SVM, LR, GB, and k-NN) across all instances. In general, a well- calibrated model’s histogram of mean predicted probabilities would ideally exhibit a balanced distribution. Specifically, a peak or concen- tration around 0.5 indicates that the model is uncertain about the class assignment for many instances, while values closer to 0 or 1 signify higher confidence in the predicted class. Models that are poorly cali- brated may exhibit skewed histograms, indicating a mismatch between predicted probabilities and the true likelihood of positive outcomes. This aligns with the concept of calibration discussed by Kull, Filho, and Flach (2017). For RF (first from the left) and SVM (second from the left), both models are known for their discriminative power; therefore, we expect to observe a histogram with a clear peak or concentration towards the extremes (0 or 1). This suggests that these models are confident in their predictions and have effectively separated instances into distinct classes. In the case of LR, the model exhibits a smoother and more spread-out histogram, reflecting its inherent probabilistic nature. The mean predicted probability histogram for GB demonstrates a more refined distribution compared to models like RF. Hence, we expect to observe a histogram with peaks or concentrations that reflect the model’s ability to iteratively improve its predictions. Peaks at 0 or 1 in the histogram are indicative of instances where boosting rounds consistently strengthen a specific class assignment, signifying a higher level of confidence in the predictions. For k-NN, which is based on local patterns in the data, produces a histogram with a more varied distribution. Instances where the nearest neighbours are consistently of the same class would result in peaks at 0 or 1, indicating higher confidence, while instances with mixed neighbours might lead to a peak around 0.5. Tables 5 and 6 present a comprehensive comparison of tradi- tional machine learning approaches and DAL techniques to predict bankruptcy for Taiwanese and Polish companies, respectively. The evaluation metrics used provide information about the performance of the models under different conditions. In Table 5, accuracy is generally higher in traditional learning in most classifiers. For example, the RF classifier achieves an accuracy of 0.95 in traditional learning, compared to 0.82 in DAL. Similar trends are observed for SVM, LR, and GB. Pre- cision for Class 0 remains consistently high in both learning paradigms, with traditional learning showing a slight edge. For example, SVM has a precision of 0.96 in traditional learning versus 0.95 in DAL. However, DAL occasionally shows higher precision for Class 1, such as with LR (0 vs. 0.77). Recall for Class 1 is notably poor in traditional learning, as seen with SVM (0) compared to DAL (0.96). This indicates that traditional models struggle with identifying Class 1 instances. Domain adaptation techniques significantly improve recall for both Classes across all classifiers. The F1 score for Class 1 in DAL is considerably higher than in traditional learning. To illustrate, SVM’s F1 score is 0.36 in traditional learning compared to 0.82 in DAL, highlighting the improved balance between precision and recall. The DAL method demonstrates competitive AUC-ROC values and improved calibration (Brier scores), suggesting better reliability and discrimination in pre- dictions under shifting data distributions. The accuracy remains higher in traditional learning for Polish bankruptcy data as well, with SVM T. Ansah-Narh et al. m T P Expert Systems With Applications 258 (2024 ) 125133 Fig. 9. Histograms depicting the central tendencies and spatial distribution of predicted probabilities for RF, SVM, LR, GB, and k-NN models across all instances. Well-calibrated odels exhibit a balanced distribution around 0.5, indicating uncertainty in class assignment, while values closer to 0 or 1 signify higher