Materials Science in Semiconductor Processing 161 (2023) 107427 Contents lists available at ScienceDirect Materials Science in Semiconductor Processing journal homepage: www.elsevier.com/locate/mssp Explainable machine learning for predicting the band gaps of ABX3 perovskites David O. Obada a,b,c,d,*, Emmanuel Okafor e,**, Simeon A. Abolade a, Aniekan M. Ukpong b,f, David Dodoo-Arhin g, Akinlolu Akande a,*** a Mathematical Modelling and Intelligent Systems for Health and Environment Research Group, School of Science, Atlantic Technological University, Ash Lane, Ballytivnan, Sligo, F91 YW50, Ireland b Theoretical and Computational Condensed Matter and Materials Physics Group (TCCMMP), School of Chemistry and Physics, University of KwaZulu-Natal, Pietermaritzburg, 3201, South Africa c Multifunctional Materials Laboratory, Shell Office Complex, Department of Mechanical Engineering, Ahmadu Bello University, Zaria, 810222, Nigeria d Africa Centre of Excellence on New Pedagogies in Engineering Education, Ahmadu Bello University, Zaria, 810222, Nigeria e SDAIA-KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum and Minerals, 31261, Saudi Arabia f National Institute for Theoretical and Computational Sciences (NITheCS), Pietermaritzburg, 3201, South Africa g Department of Materials Science and Engineering, University of Ghana, Legon, LG 25, Ghana A R T I C L E I N F O A B S T R A C T Keywords: In this study, we trained and compared explainable machine learning algorithms for predicting the band gaps of Ensemble learning perovskite materials that have the formula ABX3 containing both zero and non-zero band gaps. Six supervised Neural networks learning models: 5 ensemble learning methods and 1 neural network (CompoundNet) were employed to study Band gaps the non-linear relationship that exists between the band gap and the characteristics of its constituent elements Explainable artificial intelligence such as electronegativity, covalent radius, first ionization energy, and row in the periodic table. The machine learning (ML) models were trained on datasets obtained from density functional theory (DFT) calculations. The results show that CatBoost and XGBoost models yielded the least predictive errors and the highest coefficient of determination of R2 ≥ 88% than other approaches in the testing phase. Furthermore, the Shapley Additive Explanation (SHAP) was used for explaining the model based on the elemental composition of each perovskite compound from the physics standpoint, and a novel holistic feature ranking of the explained models was pro- posed. One key insight gained from the SHAP analysis is that the Pauling electronegativity of the B site cation in the cubic perovskites which characteristically plays an important role in the electronic properties of this class of materials is the feature that contributes most to the prediction of the band gaps. These results reveal the potential of ML to predict materials properties quickly and accurately with datasets useful in the engineering of efficient solar cell devices. 1. Introduction anions. Over the years, perovskite solar cells (PSC) have attracted attention in the field of photovoltaics because of their performance It is well-known that perovskite compounds show a remarkable va- potentials, simple fabrication processes amongst others [4,5], and the riety of mechanical, electrical, optical, magnetic, and transport prop- power conversion efficiency (PCE) of PSC has reached ≥ 25% − 25.7% erties [1–3]. Typically, the structure of an ideal perovskite with the [6–8]. The performance of PSCs is determined by some factors such as, general formula ABX3 consists of A cations which occupy 12-fold coor- band gap, electrical conductivity, high carrier mobility, remarkable dination sites, B cations in the center, and corner-sharing octahedra of X energy level alignment, and low density of defects at the interfaces. * Corresponding author. Mathematical Modelling and Intelligent Systems for Health and Environment Research Group, School of Science, Atlantic Technological University, Ash Lane, Ballytivnan, Sligo, F91 YW50, Ireland. ** Corresponding author. SDAIA-KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum and Minerals, 31261, Saudi Arabia. *** Corresponding author. Mathematical Modelling and Intelligent Systems for Health and Environment Research Group, School of Science, Atlantic Technological University, Ash Lane, Ballytivnan, Sligo, F91 YW50, Ireland. E-mail addresses: david.obada@atu.ie (D.O. Obada), emmanuel.okafor@kfupm.edu.sa (E. Okafor), akinlolu.akande@atu.ie (A. Akande). https://doi.org/10.1016/j.mssp.2023.107427 Received 8 November 2022; Received in revised form 1 February 2023; Accepted 27 February 2023 Available online 24 March 2023 1369-8001/© 2023 Elsevier Ltd. All rights reserved. D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Fig. 1. Block diagram illustrating the learning processes for the supervised learning methods. Fig. 2. Block diagram illustration of the system learning pipeline. Fig. 3. Training phase correlation plot for each of the supervised learning models prediction on either direct or indirect band gaps compared with their respective actual values for 95% of the entire data. Amongst all these factors, the design of the band gap of the perovskite band gaps that span the visible spectrum. Given the complex design layer is crucial and it directly determines its response to the solar space of ABX3 compounds, it is difficult to explore all the possible spectrum [9,10]. Perovskite semiconductors create a unique opportu- combinations both theoretically and experimentally. From a theoretical nity in the engineering of materials because of the large design space. standpoint, the bandgaps of semiconductors using traditional DFT sim- This is possible because the variety of cation-anion combinations has ulations are underestimated within the generalized gradient 2 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Table 1 gaps of ABX3 type compounds [14–18]. To highlight a few, Lee et al. Performance evaluation metric comparison of the different machine learning focused on using the support-vector regression ML technique to predict techniques prediction of direct band gaps of ABX3 perovskites on 95%–5% data the band gaps of 156 binary compounds using descriptors which include splits. the band gaps obtained from DFT calculations using GGA and obtained a Techniques Train Test root mean squared error (RMSE) of 180 meV [19]. Gladkikh et al., [20] MAE RMSE R2 MAE RMSE R2 studied non-linear mappings that exist between band gap and the properties of the elements using the Alternating Conditional Expecta- CATBOOST 0.149 0.205 0.993 0.633 0.867 0.887 tions ML technique (a method useful for small datasets) and compared ± ± ± ± ± ± 0.005 0.006 4.658 0.112 0.119 0.004 the results with other ML methods. They concluded that the best ML × 10− 4 methods which successfully studied the linear mappings were Kernel XGBOOST 0.005 0.008 0.999 0.661 0.871 0.880 Ridge Regression and Extremely Randomized Trees. Liu et al. [7], used ± ± ± ± ± ± ML to predict the experimental band gaps of 227 perovskites obtained 0.001 0.001 2.910 0.213 0.246 0.056 10− 6 from 1254 recent publications with a bid to identify 4 models from 24 × RANDOM 0.324 0.508 0.959 0.798 0.966 0.860 kinds of ML models. The models achieved high accuracy with an RMSE FOREST ± ± ± ± ± ± of down to 0.55. In addition, explainability ML was used to explain the 0.004 0.004 0.001 0.235 0.224 0.034 effect of each chemical composition for their proposed models, and this COMPOUNDNET 0.074 0.129 0.997 0.749 1.090 0.818 further established the potential of ML to accurately predict the band ± ± ± ± ± ± 0.029 0.051 0.002 0.234 0.347 0.074 gaps of perovskite materials. In a study conducted by Huang et al., [21], DECISION TREE 0.000 0.000 1.000 0.765 1.070 0.813 the band gaps of 300 wurtzite nitride semiconductors were calculated ± ± ± ± ± ± using DFT. These datasets were then used to train many ML models for 0.000 0.000 0.000 0.194 0.308 0.094 predicting the band gaps. From all ML models tested, the best perfor- LIGHTGBM 0.678 0.941 0.860 0.920 1.140 0.801 mance was achieved using support-vector regression. Pilania et al., [22] ± ± ± ± ± ± 0.012 0.027 0.009 0.342 0.349 0.078 used the kernel-ridge regression technique and 16 sets of the element-specific descriptor to predict the band gaps of 1306 double perovskites and obtained an RMSE of 80 meV. Rath et al., [23] classified Table 2 ABX3 type perovskites into direct and indirect band gap materials using Performance evaluation metric comparison of the different machine learning the XGBoost classifier with datasets of 1528 ABX3 compounds and ob- techniques prediction of indirect band gaps of ABX3 perovskites on 95%–5% tained an average accuracy of about 72.8%. In a recent article published data splits. by Lyu et al. [24], it was shown that machine-learning models could be Techniques Train Test useful for low-dimensional organic-inorganic halide perovskites. 2 2 To allow for a critical investigation of the models, interpretable MAE RMSE R MAE RMSE R machine learning using Shapley additive explanations (SHAP) was CATBOOST 0.137 0.185 0.994 0.572 0.776 0.906 adopted. The SHAP analysis was performed to determine which of the ± ± ± ± ± ± descriptors used in the prediction was most important. They concluded 0.003 0.004 2.329 0.150 0.227 0.028 × 10− 4 that one major finding from the SHAP analysis is that the absence of XGBOOST 0.005 0.007 0.999 0.588 0.773 0.904 transition metals increased the probability of the perovskite having a ± ± ± ± ± ± direct band gap. SHAP which originated from the cooperative game 0.001 0.002 4.942 0.225 0.304 0.050 6 theory have been used to interpret many complex ML models which are × 10− RANDOM 0.313 0.497 0.956 0.697 0.857 0.884 so-called black box models. In 2017, Lundberg and Lee [25] proposed FOREST ± ± ± ± ± ± the SHAP value to explain various models for better interpretation. 0.006 0.006 0.001 0.264 0.324 0.053 Before the adoption of SHAP, feature importance has been used to COMPOUNDNET 0.068 0.126 0.996 0.595 0.827 0.869 explain ML models. Although these reflect the importance of the features ± ± ± ± ± ± directly with emphasis on the impact of the features on the final model, 0.041 0.072 0.003 0.194 0.279 0.089 LIGHTGBM 0.640 0.902 0.854 0.808 1.002 0.837 it has proven deficient in judging the relationship between the features ± ± ± ± ± ± and the prediction of the results. More so, in the context of using several 0.013 0.026 0.007 0.357 0.441 0.092 ML algorithms, a global ranking of the input features for all the models DECISION TREE 0.000 0.000 1.000 0.634 1.068 0.803 may not be feasible. Therefore, there is a need to propose computa- ± ± ± ± ± ± 0.000 0.000 0.000 0.110 0.235 0.097 tionally efficient methods for accurate global ranking of input features if several ML algorithms are used. To the best of our knowledge, this is the first work that holistically approximation (GGA) [11] but can be overcome by using hybrid func- employs the use of explainable ML models for regressive prediction of tionals or many-body perturbation theory (GW) [12,13]. Nonetheless, direct and indirect bandgaps of inorganic perovskite compounds. these more accurate theoretical approaches are more computationally Another contribution is the use of the explainable model (SHAP) in expensive and can be difficult to implement on a vast array of materials. attempting to assess the feature importance or rationale behind how A possible approach to overcome this limitation is the application of ML these ML models can accurately predict the band gaps of the inorganic models to enhance the predictions. Typically, in the ML approach, the perovskite materials. To assess the importance of these features, we properties of the materials under investigation are calculated by DFT or propose a novel holistic ranking method for identifying the most obtained from laboratory experiments for a small sample size which is prominent feature from the explained models. then used to train statistical ML models. The model learns the trends in In this study, the band gaps of 199 perovskites having a formula the data distribution during the training phase. Thereafter, the model is ABX3 were modeled per the element-specific descriptors of individual used to predict the properties of new datasets based on the trends it has elements viz: electronegativity, covalent radius, first ionization energy, learned. Contextually, when accurate high performing ab-initio calcu- and row in the periodic table. We compared the new ANN model called lations are performed on a set of ABX3 compounds, the results can be CompoundNet to other ensemble machine learning techniques: Decision used to train ML models to obtain predictions for the remaining mate- Trees, Random Forest, CatBoost, XGBoost, and LightGBM. Finally, the rials in the large design space. results are explained from the physics standpoint using the Shapley Several studies have focused on using ML models to predict the band Additive Explanations (SHAP) to evaluate the effect of the specific 3 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Fig. 4. Testing phase correlation plot for each of the supervised learning models prediction on either direct or indirect band gaps compared with their respective actual values for 5% of the entire data. Table 3 Table 4 Performance evaluation metric comparison of the different machine learning Performance evaluation metric comparison of the different machine learning techniques prediction of direct band gaps of ABX3 perovskites on 80%–20% data techniques prediction of indirect band gaps of ABX3 perovskites on 80-20 data splits. splits. Techniques Train Test Techniques Train Test MAE RMSE R2 MAE RMSE R2 MAE RMSE R2 MAE RMSE R2 CATBOOST 0.124 0.166 0.996 0.845 1.390 0.697 CATBOOST 0.116 0.155 0.996 0.795 1.301 0.699 ± ± ± ± ± ± ± ± ± ± ± ± 0.003 0.004 3.656 0.094 0.186 0.099 0.006 0.008 4.873 0.103 0.226 0.125 × 10− 4 × 10− 4 XGBOOST 0.003 0.004 0.999 0.810 1.460 0.664 RANDOM 0.329 0.495 0.957 0.909 1.386 0.650 ± ± ± ± ± ± FOREST ± ± ± ± ± ± 0.001 0.001 1.264 0.126 0.177 0.116 0.012 0.025 0.005 0.182 0.310 0.189 × 10− 6 XGBOOST 0.003 0.004 0.999 0.791 1.397 0.647 RANDOM 0.342 0.507 0.959 0.963 1.450 0.665 ± ± ± ± ± ± FOREST ± ± ± ± ± ± 0.006 0.001 2.006 0.127 0.276 0.178 0.011 0.020 0.004 0.175 0.264 0.150 × 10− 6 LIGHTGBM 0.720 0.991 0.843 1.130 1.610 0.599 LIGHTGBM 0.687 0.965 0.835 1.073 1.529 0.591 ± ± ± ± ± ± ± ± ± ± ± ± 0.025 0.066 0.021 0.106 0.170 0.101 0.029 0.073 0.025 0.109 0.192 0.131 COMPOUNDNET 0.065 0.129 0.997 0.991 1.680 0.557 COMPOUNDNET 0.053 0.087 0.998 0.879 1.521 0.576 ± ± ± ± ± ± ± ± ± ± ± ± 0.027 0.050 0.002 0.124 0.252 0.141 0.018 0.037 0.001 0.146 0.366 0.237 DECISION TREE 0.000 0.000 1.000 0.993 1.790 0.505 DECISION TREE 0.000 0.000 1.000 0.812 1.574 0.557 ± ± ± ± ± ± ± ± ± ± ± ± 0.000 0.000 0.000 0.142 0.245 0.124 0.000 0.000 0.000 0.170 0.365 0.195 descriptors on the band gap predictions. In what follows, we present the Package-VASP at PBE GGA functional level (results not shown) on data description, distribution and pre-processing in section 2, the randomly selected cubic perovskites from the 199 compounds to ensure methodology in section 3, the results and discussion in section 4, and we that the datasets we have adopted are reproducible [28,29]. conclude in section 5. Different features have been proposed as descriptors for the prop- erties of materials [30,31]. In this study, a simple set of element-specific 2. Data description, distribution and preprocessing descriptors have been used. For each of the elements in the ABX3 com- pound, the electronegativity, covalent radius, first ionization energy, In this study, we have used datasets obtained from the work of Korbel and row in the periodic table were used because from the physics et al., [26]. The authors used the more accurate hybrid HSE06 perspective, the selected descriptors have an influence on the band gaps exchange-correlation function to calculate the band gaps of 199 com- of the compounds, and these descriptors can improve the training of pounds. The numerical values for the adopted 199 compounds are machine learning models leading to better performance and greater outlined in Table ESI-1 in the Supplementary Information of Ref. [26]. accuracy. This gave 12 features in total per compound. The 12-dimen- Hence in this study, we used the obtained HSE06 band gaps to train our sional feature space has proven to be effective in the prediction of the models. We calculated the tolerance and octahedral factors to further magnitude of band gaps when regression techniques are used [32]. establish the formation of perovskites and the stability of the com- The calculated band gap examples from the ABX3 perovskites data as pounds. The Goldschmidt tolerance factor was used by assessing the obtained from Ref. [24], were partitioned in the ratio 95%: 5% for the ionic radius of the values as compiled by Shannon [27]. We also per- training set and testing set, respectively. Furthermore, the input features formed first principle calculations with the Vienna Ab-initio Simulation for the used dataset were normalized in the scale [0, 1] using the 4 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Fig. 5. Explainability of the supervised learning model prediction of the direct band gap while revealing the feature importance influencing the model prediction. expression in equation (1). 3. Methods Xj − min(X )X jn = (1) In this section, we briefly describe the supervised learning algorithms max(Xj) − min(Xj) and the corresponding explainable artificial intelligence (XAI) tools used where the Xj denotes the raw input features and the normalized input in this study for a better understanding of the theoretical background. features can be represented as Xn. The effective normalized input fea- tures with the corresponding continuous output labels for each of the 3.1. Supervised learning algorithms aforementioned datasets were passed to the supervised learning algorithms. 3.1.1. CompoundNet The CompoundNet is a feedforward artificial neural network that consists of three main units; input unit, hidden unit, and output unit. The CompoundNet is a multilayer perceptron (MLP) which is an example of a 5 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Fig. 6. Explainability of the supervised learning model prediction of the direct band gap while revealing the feature importance influencing the model prediction (mean SHAP value). supervised learning algorithm that can be used for performing classifi- ∑addition of the sum of weighted inputs (Wl− 1 l− 1k kj ×Xk ) and the corre- cation or regression tasks. In our experiment, the hidden unit consists of sponding bias bl in R1×1 j dimensional space. Note that the variable Wfive neural network layers; whereby each layer contains 64 network kj and Xl− 1nodes. The rationale behind the choice of 64 network nodes is based on k account for the input weights and input features, respectively an intuitive design philosophy whereby we attempt to explore the uni- based on the dimension R 1×12 per each example in the perovskite ma- form nodal distribution based on the formula: 26 = 64. Each network terial composition. Hence the predictive error (cost function) was node within the hidden and output layers generate feature maps as computed using the information from the actual output and predictive defined in the expression; outcome from the hypothetic model. We used the Adam optimizer [33] ∑ in optimizing the predictive error via backpropagation to yield optimal Yl l l− 1 l− 1mlp = bj + (Wkj ×Xk ) (2) weights needed for the CompoundNet model to predict direct or indirect k band gap from the perovskites examples in the testing phase. here the hypothetic model output denoted by Ylmlp compute the effective 6 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Fig. 7. Explainability of the supervised learning model prediction of the indirect band gap while revealing the feature importance influencing the model prediction. 3.1.2. Random forest Random Forest (RF) [34] is one of the traditional ensemble learning Yrf (Dj) = EW [Yj(Wk,X ∈Dj)] (3) techniques that was originally derived from the bagging aggregation The variable EW denotes the estimated output from the random principle. This method can be created by integrating several instances of forest. An RF is a kind of predictive estimator that collects a set of de-correlated estimator trees [35]. This method computes the average randomized base predictive estimator regression trees {Yrf(Wk, X ∈ Dj), from the aggregation of several base learners before determining the k ≥ 2}, where the weight variable can be denoted W = {W1, W2, …Wk} most likely continuous output (estimating an average score from the as a randomly distributed variable. The random output decision trees base learners). For training example given as Dj = (Xj, Yj) ∀ ȷ ∈ N, it were integrated to generate aggregation of several regression trees. We should be noted that the variable Xj represent the actual input features used random forest (Yrf) containing 10,000 base estimators when con- and Yj denotes the actual output from the original dataset. For a given ducting our experiments. normalized input feature Xn ∈ [0,1]k, our goal is to calculate a regression function Yrf(x) = E[Y|X = Xj] within the dataset Dj. 7 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Fig. 8. Explainability of the supervised learning model prediction of the indirect band gap while revealing the feature importance influencing the model prediction (mean SHAP value). Table 5 Explainablity model feature importance ranking in the testing phase; the power index represents the feature positional ranking, when considering direct bandgap prediction. Methods F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 CATBOOST 11 42 73 84 25 106 57 118 69 010 911 312 XGBOOST 11 22 103 74 45 06 67 38 99 510 811 1112 RANDOM FOREST 11 82 53 74 25 46 107 38 09 1110 611 912 COMPOUNDNET 11 22 103 84 95 76 57 118 49 310 611 012 LIGHTGBM 11 22 73 44 05 106 97 68 39 810 1111 512 DECISION TREE 11 82 73 44 25 106 37 58 69 910 1111 012 8 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 3.1.3. Decision tree A decision tree is an example of a supervised learning technique mainly used for solving classification or regression tasks [36,37]. Hence, given an input feature space, the decision trees operate based on the principle associated with entropy and information gain in the formation of a supervised learning model. 3.1.4. XGBoost EXtreme Gradient Boosting (XGBoost) method is a scalable tree boosting technique [38]; this method relies on a sparse-aware learning paradigm that allows multiple base-tree learners to predict sparse and clustered data. The main design philosophy of an XGBoost is that it factors in data compression, cache accessibility, and sharding for creating a more scalable decision tree predictive system. 3.1.5. CatBoost The Catboost [39] is an example of the ensemble learning algorithm. The name CatBoost was derived from the compound words; “categorical boosting”. A typical CatBoost relies on base learners by ordering and employing an innovative learning algorithm for operating categorical features. The main merit of CatBoost is that it has the prowess to address prediction shifting arising from output target leakage. This method is one of the most competitive state-of-the-art ensemble learning method. 3.1.6. LightGBM The light gradient boosting method (LightGBM) [40] is another competitive ensemble learning method that depends on decision trees that employ two main algorithm paradigms; gradient-based one-side sampling and an exclusive feature bundling. This method is often used for solving classification and regression tasks. The block diagram describing the learning process for each of the described methods is shown in Fig. 1. The learning process starts with data preprocessing, then followed by the actual learning process using the ML models. The model evaluation and the prediction on the new data completes the process. 3.2. Explainable artificial intelligence Many classical machine learning and deep learning techniques are often considered black-box as a result of limited internal information about the rationale behind their model interpretability [41]. Based on the recent advances in AI, It has become pertinent to explore explainable Artificial Intelligence (XAI) and its relevance in understanding the feature importance that influences a certain machine learning model prediction. An example of an XAI algorithm is SHapley Additive exPla- nations (SHAP). SHAP is an explainability tool that relies on the unifi- cation of frameworks that allow researchers or experts to gain insightful interpretation of complex predictive models. The core unit of a SHAP algorithm involves identifying a novel class by assessing additive feature relevance and finding the unique solution of the new class based on a collection of desirable attributes. Overall, the SHAP estimation approach aligns effectively with human intuition. We considered two forms of SHAP explainers; Tree-based explainer was used for interpreting the ensemble learning models, while the sampling-based explainer was used for interpreting the CompoundNet model. A block diagram illustration of the developed system pipeline is shown in Fig. 2. 3.3. Performance metrics The generalization capacity of the trained models can be measured using the following performance metrics; 1. Coefficient of Determination: The coefficient of determination commonly known as R2 is a metric tool employed for determining the degree of correlation existing between two or more sets of variables. The R2 can also be described as the goodness of fit. An R2 can be 9 Table 6 Holistic feature ranking using all the outcomes from the explained machine learning models after predicting direct bandgap. Features F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 Sum of 6 21 23 30 31 33 47 54 54 54 55 60 ranking Rs Best feature 1 2 7 4 10 8 5 0 3 9 6 11 ranking order Feature Pauling Pauling First Covalent Row First Covalent Pauling Covalent Row First Row importance Electronegativity of Electronegativity of Ionization Radius of (Element Ionization Radius of Electronegativity of Radius of (Element Ionization (Element Element 2 Element 3 energy of Element 2 2) Energy Element 3 Element 1 Element 1 1) Energy 3) Element 2 Element 3 Element 1 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 Table 7 Explainablity model feature importance ranking in the testing phase; the power index represents the feature positional ranking when considering indirect bandgap prediction. Methods F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 CATBOOST 11 42 73 24 85 56 107 08 119 610 311 912 XGBOOST 11 22 73 44 05 106 37 68 99 510 811 1112 RANDOM FOREST 11 22 73 54 45 86 107 98 39 010 1111 612 COMPOUNDNET 11 102 83 64 45 26 57 98 119 010 711 312 LIGHTGBM 11 22 73 44 105 06 97 38 89 610 1111 512 DECISION TREE 11 22 73 44 105 66 37 98 89 010 1111 512 defined within the range {0, 1}. If R2 = 1, the model is said to have a the band gaps. perfect fit and is highly reliable. However, if a model yields R2 = 0, However, we observe that the LightGBM and Random Forest expe- then the hypothetical model can be described as yielding poor cor- rienced underfitting relative to other techniques in the training phase. relation with weak generalization potential. The mathematical for- To further inspect the generalization potential for each of the methods, it mula R2 can be described as: is pertinent to explore the overall performance metrics for these ∑ p 2 methods in the testing phase. The observation of one-fifth of the total 2 i(Yj − YR = 1 − i ) ∑ (4) experimental runs showing the testing phase correlation plots of all the j(Yj − Y 2 j) methods predicting direct or indirect bandgaps is shown in Fig. 4. Overall, the best methods (CatBoost and XGBoost) yielded the least where Yj accounts for the target values and Yp denotes the predicted predictive errors: MAE ≤ 0.66 or RMSE ≤ 0.87 and the highest R2 j ≥ 0.88 outputs from the described supervised learning algorithms. The variable for the prediction of both direct and indirect band gap. We generally Yj denotes mean of Yj. observe that the Decision Tree technique suffers from an overfitting problem. 2. Root Mean Square Error (RMSE): The RMSE measures the effective For fewer amount of training examples (80%) of the entire data: the difference between the actual experimental target output and the summary of the supervised learning performance index for both training predicted model output. Furthermore, RMSE determines or measures and testing sets were presented in Table 3 and Table 4, respectively. 2 the goodness of fit from the generated generalized regression model. From the described tables, we report an excellent R and minimal pre- An RMSE can be defined as; dictive errors in CatBoost, XGBoost, and Decision Trees in the training phase, but these methods were unable to generalize very well in the ( )1 1∑ 2 testing phase due to overfitting problems. This often emanates due to RMSE p 2= (Yj − Yj ) (5) n decision tree depth and specificity drawn in the testing phase that is j unable to assume the model distribution in the testing phase. However, in the testing phase, we report that CatBoost when compared with other 3. Mean Absolute Error (MAE): MAE is another evaluation metric for methods yielded the best R2 ≥ 0.697 and least predictive errors (MAE ≥ calculating the absolute difference between target output and the 0.795 and RMSE ≥ 1.30) in the testing phase. Hence we can draw an predicted model output (continuous variables); inference from our evaluations that CatBoost has the best generalization 1∑ potential and was capable to learn the multivariate input feature space. MAE p= |Yj − Yj | (6) n The remaining ensemble learning techniques and CompoundNet j outperform Decision Tree across all the evaluated metrics. These ob- servations are the same for models generated using large and fewer 4. Results and discussion training examples. In this section, we provide the computational results obtained, and 4.2. Explainability analysis and the proposed holistic feature ranking discussed the research findings for the investigated supervised learning models. We start with the supervised learning models and then discuss To hypothetically explain the model rationale with respect to feature our findings on the explainability ML. relevance behind the goodness of one method relative to the other ap- proaches, we examine the SHAP algorithm for assessing the impact of 4.1. Supervised learning performance evaluation the feature importance of the ABX3 perovskites when the supervised learning algorithms are used in predicting the indirect and direct band In Fig. 3, we show correlation plots of all the method predictions for gaps. The SHAP graphs are shown in Figs. 5–8. Highlighting the both direct (Fig. 3a) and indirect band gaps (Fig. 3b) for one-fifth (15) of importance of the feature graphs, the most important feature can be the total experimental runs within the training phase. Using larger found at the top, and the importance of the other features is ranked in training examples which translate to (95%) of the effective data was descending order. As shown in Figs. 5–8, the Y-axis depicts the feature used for generating each of the supervised learning models. The sum- nomenclature, while the X-axis shows the corresponding mean of the marized evaluation performance results obtained from the supervised magnitude of the SHAP values. The mean of the magnitude reveals the learning model prediction of the direct and indirect band gaps in the average impact the feature has on the model output. When the mean of training and testing phases are reported in Table 1 and Table 2 the magnitude is high, it means the impact on the predicted value is respectively. high. The color of all the points represents the value of the corre- From Tables 1 and 2, we report that most of the ensemble learning sponding feature. The red color represents high values while the blue techniques (CatBoost, XGBoost, and Decision Tree), and neural network color represents low values. It should be noted that in Figs. 5 and 7, the method (CompoundNet) yielded a coefficient of determination R2 > points of the feature on the right-hand side depicts that the SHAP 0.99 and lower predictive error when compared with LightGBM and analysis contributes positively in the model prediction, while the left- Random Forest in the training phase. The R2 of the superior ensemble hand-side contributes negatively during the model predictions. learning models indicates that there is a strong degree of correlation Based on the observation of the feature ranking from Figs. 5–8, we between the elemental composition of each ABX3 type perovskites and employed our novel holistic feature ranking method to determine the 10 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 global feature ranking of the ABX3 perovskites across all the explained supervised learning models. A summary of our findings on the feature importance and the holistic feature ranking for all the methods were reported on Table 5 to Table 8, respectively. From the latter, we report that the most important feature is the “Pauling Electronegativity of Element 2”. This is because the feature appears as the first ranked across all the examined methods and has the least sum of ranks as shown in Table 6 and Table 8. From Figs. 5–8, we report each of the methods’ feature importance ranking on Tables 5 and 7. Then we developed a hypothesis formulation to assess each of the methods by counting the frequency of the feature ranking across all the methods to determine which of the feature con- tributes the most from the trained supervised learning model predictions. ∑ R ps(Fi ) = count F p ( i ) × p (7) i where Rs is the sum of effective ranking per feature ∀ the learning models, Fpi represents the input feature having a code in the range {0 − 11}, the index variable p denotes the feature positional or ranking value, and i is the number of entries per each of the features. Suppose an input is given as Fpi = 2, the positional value is p = {2, 5}, the frequency of occurrence of the input feature Fpi = 2 is given as count(F p i ) = {3,3} and by performing a calculation on equation (7) yields a value Rs = 21. By extending the same principle to the remaining features, a summarized best feature ranking is reported in Tables 6 and 8 for the prediction of the direct and indirect bandgaps, respectively. Based on the insights drawn from the holistic feature ranking in the testing phase for both direct and indirect band gaps, we used a scatter plot relationship (not shown) of the Pauling electronegativity of element 2 (most important feature) and row of element 3 (least important feature) versus band gap predictions for each of the supervised learning models. The investigation revealed that electronegativities between 1.8 and 2.0 contributes the most to the band gaps between 0 and 2 eV (a consideration that aligns with the Shockley–Queisser limit of band gaps). On the other hand, the investigation showed that the anions (element 3) in row 2 contributes the least to the band gaps between 0 and 2 eV. Therefore, the synergistic effect of varied electronegativities in the crystal system and the rows of the elements can be used to tune the band gaps of perovskite compounds for effective light-harvesting. From Tables 6 and 8, based on the holistic ranking done for the prediction of both direct and indirect band gaps, the Pauling electro- negativity of element 2 and 3, and the first ionization energy of element 2 are the most important features. A reflection on this assertion can be made: the Pauling electronegativity is based on the energies of dissoci- ation and cannot be regarded as a property of individual atoms, but of atoms that are bonded. Therefore, the energy of the semiconducting perovskite compounds would typically involve the transfer of an elec- tron from the valence band to the conduction band. Since the valence band is characterized primarily by the orbitals of the anion, and the conduction band is primarily characterized by the orbitals of the cation, then it is expected that some numerical parameters like the ionization energy and the electronegativity of the anion and cation will correlate strongly with the energy band gaps [42,43]. It has been established that there is a strong relationship between the offset of the conduction band and ionicity (difference between the metal and oxygen electronegativ- ities of the compounds) [44]. The inferences from the SHAP analysis as described are important from a physics standpoint. These inferences can enable scientists with the selection of constituent elements to maximize the probabilities of obtaining the predictions of direct and indirect band gaps of ABX3 perovskite. This can help in tailoring the search for direct band gap materials for specific industrial applications. The SHAP analysis reveal that the range of electronegativity for the B cation of all the compounds is the most important feature which determines the band gaps and this is 11 Table 8 Holistic feature ranking using all the outcomes from the explained machine learning models after predicting indirect bandgap. Features F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 Sum of 6 18 24 26 32 43 49 50 51 52 54 63 ranking Rs Best feature 1 2 4 7 10 8 0 6 5 9 3 11 ranking order Feature Pauling Pauling Covalent First Row First Pauling First Covalent Row Covalent Row importance Electronegativity of Electronegativity of Radius of Ionization (Element Ionization Electronegativity of Ionization Radius of (Element Radius of (Element Element 2 Element 3 Element 2 energy of 2) Energy Element 1 Energy Element 3 1) Element 1 3) Element 2 Element 3 Element 1 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 in agreement with work carried out by Gladkikh et al., [20], and Rath Data availability et al., [23]. Data will be made available on request. 5. Conclusion References In this study, we have demonstrated that the CatBoost model has a more predictive power for the determination of direct and indirect band [1] A. Kojima, K. Teshima, Y. Shirai, T. Miyasaka, Organometal halide perovskites as gaps of ABX perovskites with a correlation score of R2 0.88 in the visible-light sensitizers for photovoltaic cells, J. Am. Chem. Soc. 131 (17) (2009) 3 ≥ 6050–6051. testing phase when models are generated from large data samples, [2] J. Wang, J. Neaton, H. Zheng, V. Nagarajan, S. Ogale, B. Liu, D. Viehland, which is fascinating for the practical exploitation of the algorithms. The V. Vaithyanathan, D. Schlom, U. Waghmare, et al., Epitaxial bifeo3 multiferroic SHAP analysis yielded the features and their impact on the predictions. thin film heterostructures, Science 299 (5613) (2003) 1719–1722. [3] S. Aharon, A. Dymshits, A. Rotem, L. Etgar, Temperature dependence of hole The electronegativities for the B cation in the cubic perovskite com- conductor free formamidinium lead iodide perovskite based solar cells, J. Mater. pounds were showcased as the most important feature in the prediction Chem. A 3 (17) (2015) 9171–9178. of the band gaps. The insights are crucial when designing materials in a [4] L. Meng, J. You, Y. Yang, Addressing the stability issue of perovskite solar cells for commercial applications, Nat. Commun. 9 (1) (2018) 1–4. large materials discovery space and when synthesizing light-harvesting [5] E.H. Jung, N.J. Jeon, E.Y. Park, C.S. Moon, T.J. Shin, T.-Y. Yang, J.H. Noh, J. Seo, perovskites. Efficient, stable and scalable perovskite solar cells using poly (3-hexylthiophene), The robust implementation of the ML algorithms can aid a deliberate Nature 567 (7749) (2019) 511–515. [6] Y. Zhao, F. Ma, Z. Qu, S. Yu, T. Shen, H.-X. Deng, X. Chu, X. Peng, Y. Yuan, discovery of new ABX3 perovskites with suitable band gaps for opto- X. Zhang, et al., Inactive (pbi2) 2rbcl stabilizes perovskite films for efficient solar electronic applications. This can invariably reduce trial and error ex- cells, Science 377 (6605) (2022) 531–534. periments in the laboratories and also reduce the number of ab initio [7] Y. Liu, W. Yan, H. Zhu, Y. Tu, L. Guan, X. Tan, Study on bandgap predications of abx3-type perovskites by machine learning, Org. Electron. 101 (2022), 106426. DFT calculations needed. This is therefore in resonance with the overall [8] L. Chu, S. Zhai, W. Ahmad, J. Zhang, Y. Zang, W. Yan, Y. Li, High-performance objective of materials informatics which accelerates the design and se- large-area perovskite photovoltaic modules, Nano Res. Energy 1 (2) (2022), lection of materials. Furthermore, this study demonstrated that the e9120024. newly proposed holistic ranking feature provides a simple and efficient [9] X.-X. Gao, W. Luo, Y. Zhang, R. Hu, B. Zhang, A. Züttel, Y. Feng, M.K. Nazeeruddin, Stable and high-efficiency methylammonium-free perovskite solar cells, Adv. global ranking across all the investigated methods. Mater. 32 (9) (2020), 1905502. Additionally, some of the ensemble learning methods outperformed [10] K.-G. Lim, S. Ahn, Y.-H. Kim, Y. Qi, T.-W. Lee, Universal energy level tailoring of the CompoundNet; however, there were instances the CompoundNet self-organized hole extraction layers in organic solar cells and organic–inorganic hybrid perovskite solar cells, Energy Environ. Sci. 9 (3) (2016) 932–939. was better than some ensemble learning techniques during the predic- [11] P. Mori-Sánchez, A.J. Cohen, W. Yang, Localization and delocalization errors in tion of the direct or indirect band gaps. Future work can explore using 1 density functional theory and implications for band-gap prediction, Phys. Rev. Lett. dimensional deep learning architecture involving a 1 k moving 100 (14) (2008), 146401. − × [12] J. Heyd, J.E. Peralta, G.E. Scuseria, R.L. Martin, Energy band gaps and lattice kernel convolving with the input feature of the ABX3 perovskites to parameters evaluated with the heyd-scuseria-ernzerhof screened hybrid functional, generate informative feature-maps that may help in yielding a possible J. Chem. Phys. 123 (17) (2005), 174101. improvement in the prediction of the band gaps. [13] M. Shishkin, G. Kresse, Self-consistent g w calculations for semiconductors and insulators, Phys. Rev. B 75 (23) (2007), 235102. [14] X. Cai, F. Liu, A. Yu, J. Qin, M. Hatamvand, I. Ahmed, J. Luo, Y. Zhang, H. Zhang, Funding Y. Zhan, Data-driven design of high-performance masnxpb1-xi3 perovskite materials by machine learning and experimental realization, Light Sci. Appl. 11 (1) (2022) 1–12. The authors wish to thank the Irish Research Council for funding [15] G.S. Thoppil, A. Alankar, Predicting the formation and stability of oxide granted to David O. Obada with Project ID GOIPD/2021/28. Most of the perovskites by extracting underlying mechanisms using machine learning, Comput. calculations were performed on the Kelvin cluster maintained by the Mater. Sci. 211 (2022), 111506. Trinity Centre for High Performance Computing. This cluster was fun- [16] M. Del Cueto, C. Rawski-Furman, J. Arago, E. Orti, A. Troisi, Data-driven analysis of hole-transporting materials for perovskite solar cells performance, J. Phys. Chem ded through grants from the Higher Education Authority, through its C 126 (31) (2022) 13053–13061. PRTLI program. The authors also wish to acknowledge the Irish Centre [17] V. Venkatraman, The utility of composition-based machine learning models for for High-End Computing (ICHEC) for the provision of computational band gap prediction, Comput. Mater. Sci. 197 (2021), 110637. [18] O. Allam, C. Holmes, Z. Greenberg, K.C. Kim, S.S. Jang, Density functional facilities and support. theory–machine learning approach to analyze the bandgap of elemental halide perovskites and ruddlesden-popper phases, ChemPhysChem 19 (19) (2018) CRediT authorship contribution statement 2559–2565. [19] J. Lee, A. Seko, K. Shitara, K. Nakayama, I. Tanaka, Prediction model of band gap for inorganic compounds by combination of density functional theory calculations David O. Obada: Writing – review & editing, Writing – original and machine learning techniques, Phys. Rev. B 93 (11) (2016), 115104. draft, Validation, Methodology, Investigation, Funding acquisition, [20] V. Gladkikh, D.Y. Kim, A. Hajibabaei, A. Jana, C.W. Myung, K.S. Kim, Machine learning for predicting the band gaps of abx3 perovskites from elemental Formal analysis, Data curation, Conceptualization. Emmanuel Okafor: properties, J. Phys. Chem. C 124 (16) (2020) 8905–8918. Writing – review & editing, Writing – original draft, Validation, Re- [21] Y. Huang, C. Yu, W. Chen, Y. Liu, C. Li, C. Niu, F. Wang, Y. Jia, Band gap and band sources, Methodology, Investigation, Formal analysis, Data curation, alignment prediction of nitride-based semiconductors using machine learning, J. Mater. Chem. C 7 (11) (2019) 3238–3245. Conceptualization. Simeon A. Abolade: Methodology, Formal analysis, [22] G. Pilania, A. Mannodi-Kanakkithodi, B. Uberuaga, R. Ramprasad, J. Gubernatis, Data curation. Aniekan M. Ukpong: Supervision. David Dodoo-Arhin: T. Lookman, Machine learning bandgaps of double perovskites, Sci. Rep. 6 (1) Formal analysis, Data curation. Akinlolu Akande: Writing – review & (2016) 1–10. editing, Validation, Supervision, Software, Resources, Project adminis- [23] S. Rath, G.S. Priyanga, N. Nagappan, T. Thomas, Discovery of direct band gap perovskites for light harvesting by using machine learning, Comput. Mater. Sci. tration, Methodology, Funding acquisition, Conceptualization. 210 (2022), 111476. [24] R. Lyu, C.E. Moore, T. Liu, Y. Yu, Y. Wu, Predictive design model for low- Declaration of competing interest dimensional organic–inorganic halide perovskites assisted by machine learning, J. Am. Chem. Soc. 143 (32) (2021) 12766–12776. [25] S.M. Lundberg, B. Nair, M.S. Vavilala, M. Horibe, M.J. Eisses, T. Adams, D. The authors declare that they have no known competing financial E. Liston, D.K.-W. Low, S.-F. Newman, J. Kim, et al., Explainable machine-learning interests or personal relationships that could have appeared to influence predictions for the prevention of hypoxaemia during surgery, Nat. Biomed. Eng. 2 (10) (2018) 749–760. the work reported in this paper [26] S. Körbel, M.A. Marques, S. Botti, Stability and electronic properties of new inorganic perovskites from high-throughput ab initio calculations, J. Mater. Chem. C 4 (15) (2016) 3157–3167. 12 D.O. Obada et al. M a t e r i a l s S c ie n c e i n S e m i c o n d u c t o r P r o c e s s i n g 161 (2023) 107427 [27] R.D. Shannon, Revised effective ionic radii and systematic studies of interatomic [36] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, distances in halides and chalcogenides, Acta Crystallogr. Sect. A Cryst. Phys. Diffr. A. Ng, B. Liu, P.S. Yu, et al., Top 10 algorithms in data mining, Knowl. Inf. Syst. 14 Theor. Gen. Crystallogr. 32 (5) (1976) 751–767. (1) (2008) 1–37. [28] G. Kresse, J. Furthmüller, Efficient iterative schemes for ab initio total-energy [37] O.Z. Maimon, L. Rokach, Data Mining with Decision Trees: Theory and calculations using a plane-wave basis set, Phys. Rev. B 54 (16) (1996), 11169. Applications, vol. 81, World scientific, 2014. [29] J.P. Perdew, K. Burke, M. Ernzerhof, Generalized gradient approximation made [38] T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in: Proceedings of simple, Phys. Rev. Lett. 77 (18) (1996) 3865. the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data [30] L. Ward, A. Agrawal, A. Choudhary, C. Wolverton, A general-purpose machine Mining, ACM, 2016, pp. 785–794. learning framework for predicting properties of inorganic materials, npj Comput. [39] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin, Catboost: Mater. 2 (1) (2016) 1–7. unbiased boosting with categorical features, Adv. Neural Inf. Process. Syst. 31. [31] F.A. Faber, A. Lindmaa, O.A. Von Lilienfeld, R. Armiento, Machine learning [40] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: a energies of 2 million elpasolite (a b c 2 d 6) crystals, Phys. Rev. Lett. 117 (13) highly efficient gradient boosting decision tree, Adv. Neural Inf. Process. Syst. 30. (2016), 135502. [41] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, [32] L. Weston, C. Stampfl, Machine learning the band gap properties of kesterite i 2- ii- Adv. Neural Inf. Process. Syst. 30. iv- v 4 quaternary compounds for photovoltaics applications, Phys. Rev. Mater. 2 [42] J. Duffy, Trends in energy gaps of binary compounds: an approach based upon (8) (2018), 085407. electron transfer parameters from optical spectroscopy, J. Phys. C Solid State Phys. [33] D. P. Kingma, J. Ba, Adam: A method for Stochastic Optimization, arXiv preprint 13 (16) (1980) 2979. arXiv:1412.6980. [43] K. Dagenais, M. Chamberlin, C. Constantin, Modeling energy band gap as a [34] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. function of optical electronegativity for binary oxides, J. Young Invest. 25 (2013) [35] G. Biau, Analysis of a random forests model, J. Mach. Learn. Res. 13 (1) (2012) 1–6. 1063–1095. [44] R. Ruh, V.A. Patel, Proposed phase relations in the hfo 2-rich portion of the system hf–hfo 2, J. Am. Ceram. Soc. 56 (11) (1973) 606–607. 13