The impact of cluster resolution feature selection on pattern recognition and classification for detecting Sudan dye adulteration in palm oil Joanna K. Kwao a,c , Cheetham Mingle b , John N. Addotey d, Kwabena F.M. Opuni a , Lawrence A. Adutwum a,* a Department of Pharmaceutical Chemistry, School of Pharmacy, University of Ghana, P. O. Box LG 43, Legon, Accra, Ghana b Food Physicochemical Laboratory, Food and Drugs Authority, P. O. Box CT 2783, Cantonments, Accra, Ghana c Department of Chemistry and Biochemistry, University of South Carolina, Columbia, SC 29208, United States of America d Department of Pharmaceutical Chemistry, Faculty of Pharmacy and Pharmaceutical Sciences, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana A R T I C L E I N F O Keywords: Chemometrics Machine Learning Adulteration Palm Oil Sudan Dyes Feature Selection FTIR A B S T R A C T This study evaluates the performance of some commonly used chemometric and machine learning techniques such as principal component analysis (PCA), artificial neural network (ANN), k-nearest neighbors (KNN), logistic regression discriminant analysis (LRDA), partial least squares discriminant analysis (PLSDA), support vector machine (SVM), and gradient boosted decision tree (GBDT) on HATR − FTIR data for detecting Sudan dye adulteration in palm oil. We employed the Icoshift for data alignment and Savitzky-Golay smoothing to enhance the data quality. Cluster resolution feature selection (CRFS) selected 2.39 % of 3351 features. Using only the 80 selected features PCA models showed a clear separation between adulterated and pure palm oil samples and an improvement in explained variance which hitherto was not observed. LRDA, PLSDA and SVM showed improved training TPR, ACC and MCC after feature selection. KNN showed improvement all model quality parameters after feature selection. 1. Introduction Adulteration of food with harmful chemical additives constitutes a major public health challenge [1–3]. This practice is often intentional and seeks to enhance the appearance and perceived quality of food products since the appearance of food is one of the easiest interpretable markers of quality for consumers [2,4] An example highlighted in literature is the adulteration of palm oil, ketchup, and chili products with Sudan dyes [5–7]. Sudan I, II, III, and IV are synthetic lipophilic dyes used in various industries and research laboratories as a coloring and staining agent. Structurally, they contain chromophoric azo groups (–N = N–) and extended π-conjugated systems, characterized by intense red color. This property makes these dyes attractive for use as food additives. However, evidence suggests that they are unsuitable for human consumption [1,2,4]. In vitro and in vivo studies have demonstrated the potential genotoxicity of Sudan dyes and their aromatic amine metabolites. Mechanisms of toxicity include formation of covalent DNA adducts, generation of reactive metabolites, and contact allergenicity [4,8]. Various international and independent regulatory agencies have recognized the potential toxicity of Sudan I, II, III and IV dyes and have banned or strictly limited their use as food additives [9–11]. Despite these findings, adulteration of palm oil with Sudan dyes re mains persistent in Ghana. The visual appearance of palm oil is an easily interpretable quality marker for consumers, and this accounts for the frequent adulteration of market samples with Sudan dyes. Due to the limited information on Sudan dye carcinogenicity, no level of con sumption can be considered safe. Therefore, routine evaluation of palm oil for the presence of Sudan dyes is quintessential for public safety. Historically, the detection of Sudan dyes in palm oil has involved some form of chromatography [12–15]. These techniques are generally complex, time-consuming, and expensive. Qualitative analysis with spectroscopy is an effective and rapid alternative for detection and regulation of adulterated food products [16–18]. Spectroscopy, coupled with machine learning, data mining and chemometrics, is a rapid and non-destructive approach for detecting food adulteration [19–23]. These techniques have been applied for the authentication of honey [24–27], milk, meat [19,28,29] and other food * Corresponding author. E-mail address: adutwum@ug.edu.gh (L.A. Adutwum). Contents lists available at ScienceDirect Microchemical Journal journal homepage: www.elsevier.com/locate/microc https://doi.org/10.1016/j.microc.2024.112433 Received 7 July 2024; Received in revised form 17 October 2024; Accepted 9 December 2024 Microchemical Journal 208 (2025) 112433 Available online 12 December 2024 0026-265X/© 2024 Published by Elsevier B.V. https://orcid.org/0000-0003-0538-160X https://orcid.org/0000-0003-0538-160X https://orcid.org/0000-0002-1615-8449 https://orcid.org/0000-0002-1615-8449 https://orcid.org/0000-0003-1653-1458 https://orcid.org/0000-0003-1653-1458 https://orcid.org/0000-0001-6912-1001 https://orcid.org/0000-0001-6912-1001 mailto:adutwum@ug.edu.gh www.sciencedirect.com/science/journal/0026265X https://www.elsevier.com/locate/microc https://doi.org/10.1016/j.microc.2024.112433 https://doi.org/10.1016/j.microc.2024.112433 http://crossmark.crossref.org/dialog/?doi=10.1016/j.microc.2024.112433&domain=pdf products. Chemometric methods (e.g. supervised and unsupervised) have been used to detect Sudan dyes in palm oil [22,23,30–32]. Data from spectroscopic analyses, including surface-enhanced Raman spec troscopy (SERS), visible-near infrared (VIS-NIR), have been analysed using unsupervised methods such as principal component analysis (PCA) and supervised methods such as linear discriminant analysis (LDA) and support vector machines (SVM) for the identification of samples adul terated with Sudan [22,23,30–32]. However, the models from these studies have limited capability for quality monitoring because they lack clear distinctions between pure and adulterated samples. Another study based on VIS-NIR data and PCA failed to provide external model vali dation, raising questions about their potential utility. To address this shortcoming, a subsequent study used a model classification rate which could be misleading when dealing with imbalanced data [33]. Despite the significant contributions of spectroscopy and chemo metrics to the detection of Sudan dye adulterations in palm oil, only a few chemometric techniques have been studied. These are PCA, LDA and SVM. A more comprehensive evaluation of the most frequently used classification algorithms will facilitate the selection of an optimum tool for this application. This is important to eliminate false negatives in the detection of dangerous adulterants in food. Additionally, it is also known that chemometric/machine learning models benefit from prior feature selection routines since it eliminates spurious sections of the spectra, which may impact the learning models adversely [34–36]. Unfortunately, the implementation of a feature se lection routine before analysis in the case of Sudan dyes is conspicuously missing in the literature. In this study, we compare some commonly used machine learning algorithms to determine the optimum algorithm for the detection of Sudan dyes in palm oil. We further explored the impact of a feature selection routine on pattern recognition and machine learning model performances. The following machine learning algorithms: artificial neural network discriminant analysis (ANN), gradient boosted decision tree (GBDT), k-nearest neighbor (KNN), logistic regression discriminant analysis (LRDA), partial least squares discriminant analysis (PLSDA) and support vector machine discrimination analysis (SVM) were used. The model performance was evaluated using the sensitivity, specificity, ac curacy, and Matthew correlation coefficient (MCC)[37–40]. 2. Experimental 2.1. Data collection A total of ninety-eight (98) palm oil samples were obtained from the storage units of Ghana Food and Drugs Authority (FDA). These samples had been analyzed and the presence of Sudan dye or otherwise confirmed using high-performance liquid chromatography (HPLC) by the FDA before they were stored. The samples included 44 pure and 54 adulterated ones. The IR spectra of the samples were obtained using the Perkin Elmer Spectrum 100 FT-IR spectrometer (Perkin Elmer, Serial Number: 67322) fitted with a horizontal attenuated total reflectance (HATR) accessory (Perkin Elmer, Serial Number: 67322). Spectra were collected within a wavenumber range of 650 – 4000 cm− 1 at a resolution of 4 cm− 1 with 4 replicate measurements using Perkin Elmer Spectrum ES (version 10.3). Each spectrum was saved individually as a file with a .sp extension. Prior to sample measurement, the HATR plate was carefully and thoroughly cleaned in situ with acetone and soft lint-free tissue paper before filling in with the sample to be measured. The cleanliness of the plate was verified by checking the background after cleaning the plate. The spectra for the pure samples were collected before those for the adulterated samples to prevent cross-contamination. A total of 392 spectra representing 176 and 216 for pure and confirmed adulterated samples, respectively were obtained. 2.2. Data processing and feature selection Data importation, processing and analysis were performed using Matlab® 2022b (Mathworks, Natick, MA, USA) and PLS Toolbox® ver. 9.1 (Eigenvector Research, Manson, WA, USA). The data was imported into the average of the replicate measure ments for each sample was determined. The final data was organized into a matrix of samples in rows and wavenumbers in columns. The dataset matrix consisted of 98 × 3351 (samples/objects × wave numbers/cm− 1). Minor shifts in the spectra which occur from run to run were cor rected using the interval correlation optimized shift algorithm (Icoshift) [41,42]. A Savitsky Golay smoothing algorithm was also applied with a smoothing window of 11 and a second-degree polynomial [43]. The aligned and smoothed data was organized and randomly assigned into two groups containing two-thirds (65 samples) and one-third (33 sam ples) for training and validation sets, respectively. Feature selection was performed using the cluster resolution feature selection algorithm (CR-FS) [44–47]. The Fisher ratio ranking metric was used while the start and stop numbers for the CR-FS were auto matically detected using the null probability analysis [47]. The feature selection was performed using a permutation number of 10. For each feature selection permutation, a random subset of the training set data was selected. 2.3. Machine learning model and model quality Estimation Two sets of data were used for the downstream analysis. These were the original data incorporating all the variables and a second set which contains the variables retained during the feature selection process. Both data sets were auto scaled and normalized prior to model generation. The PCA and machine learning models were built using the training set data and evaluated with the external validation set for each of the datasets. Subsequently, the model quality parameters namely, sensi tivity/true positive rate (TPR), specificity/true negative rate (TNR), accuracy (ACC), and MCC were computed for all the ML models [33,37]. Sensitivity/TPR is the probability of a positive prediction being a truly positive class. Specificity/TNR on the other hand is the measure of the probability that a negative prediction is truly negative. TPR, TNR, ACC and MCC are defined by the Eqs. (1)–(4) below where, TP- number of positive samples predicted positive, TN – number of negative samples predicted negative, FP – number of negative samples predicted positive, and FN – number of positive samples predicted negative. Sensivity/TPR = TP TP + FN (1) Specificty/TNR = TN TN + FP (2) Accuracy/ACC = TP + TN TP + TN + FP + FN (3) MCC = TP × TN − FP × FN ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ (TP + FP)(TP + FN)(TN + FP)(TN + FN) √ (4) 3. Results and discussions Adulteration of palm oil with Sudan dyes presents a major health risk to consumers and hence the need for a fast, reliable, and cheap method for quality monitoring cannot be overemphasized. Due to the non- destructive and cost-effective nature of spectroscopic methods, their use with machine learning techniques for food adulteration detection has become popular. Earlier studies achieved a lot of success, but few methods compared the performance of commonly used machine learning algorithms and the impact of feature selection on their performance. J.K. Kwao et al. Microchemical Journal 208 (2025) 112433 2 Herein, we aimed to evaluate commonly used machine learning al gorithms on spectroscopic data to compare their performance when they are applied to the detection of Sudan dye adulteration in palm oil. The workflow describing the data handling and analysis are shown in Fig. SI in the supporting information. Shifts in spectroscopic data do occur though they are not as marked as in chromatographic data, baseline drifts and distortions in Fourier transform spectra also occur [48]. These distortions become more pro nounced over the long-term use of the spectrometers. Even though equipment standardization can reduce the misalignment in spectra data it is still important to align the data before analysis. Hence, spectro scopic data must be aligned especially in a case like this where the entire spectra are employed. Icoshift is a useful algorithm for both spectro scopic and chromatographic data alignment [41,42]. Icoshift was used for the alignment of the data. Fig. 1a and 1b show the raw and aligned data, respectively. Employing second derivative to spectroscopic data have been demonstrated to improve the performance of machine learning algorithms [49,50]. Hence, a second derivative Savitsky Golay smoothing was applied to the aligned data, results of which are shown in Fig. 2. The application of derivatives to spectroscopic data improves the data by reducing noise, enhancing features, and reducing the impact of baseline drifts that may be present [16,51,52]. Cluster resolution feature selection (CRFS), a wrapper type of feature selection combines sequential backward elimination and forward se lection using the cluster resolution metric to evaluate the contribution of features to the separation of samples in different classes in principal components space [44–47]. It has been widely used on both spectro scopic, chromatographic, and other kinds of data [47,53]. We imple mented CRFS here without user-defined start and stop numbers. A permutation number of ten (10) was used and features surviving at least 70 % of the time were retained. Of the 3351 variables (wavenumbers), 80 representing 2.39 % were retained. Fig. 3a shows the feature survival rate after the 10 permutations. The location of the features that survived at least 70 % are shown in Fig. 3b. The frequency of features selected is shown in Fig. S2 and Table S1 in the SI. Thus, the dataset obtained after feature selection had only 80 wavenumbers. PCA is arguably one of the most used multivariate analysis tech niques. We compared the separation performance of PCA on the data before and after feature selection. The result of this comparison is shown in Fig. 4a and 4b. From Fig. 4a, it can be observed that there is no clear separation between the adulterated and pure palm oil samples. Fig. 4b shows the PCA score plot after feature selection using only 80 variables shows a better separation. It can also be observed that the sum of explained variance in the first two principal components (PC 1 and PC 2) were 31.73 % and 22.01 %, for before and after feature selection, respectively. As this represents the underlying information in the data, it implies that the application of the feature selection does not lead to a loss of information. Even though PCA is not a classification algorithm, knowledge of the sample class can aid in its use for classification. The lack of separation between the pure and adulterated samples in Fig. 4a makes it nearly impossible to use the PCA model incorporating all the variable for classification. In Fig. 4b where only the selected features were used it can be observed that only four (4) pure samples fell outside the 95 % confidence ellipse of the pure class. It can also be observed one samples of the validation set of the adulterated class was also mis classified. This suggests that CRFS improves the PCA model and more significantly using only 80 of the features. We further evaluated the performance of artificial neural network (ANN), gradient boosted decision tree (GBDT), k-nearest neighbour (KNN), logistic regression discriminant analysis (LRDA), partial least squares discriminant analysis (PLSDA) and support vector machine (SVM) on the identification of adulterated palm oil samples before and after feature selection. The model prediction sensitivity, specificity, accuracy, and MCC were used. Sensitivity and specificity measure the true positive rate and true negative rate, respectively while the overall model performance is evaluated by the accuracy. MCC is more infor mative than the accuracy for evaluating binary classification problems, as it considers the balance between all components of the confusion matrix (true positives, true negatives, false positives, and false nega tives) [38,39]. Fig. 1. Fig. 2. J.K. Kwao et al. Microchemical Journal 208 (2025) 112433 3 The model prediction plots for ML ANN, GBDT, KNN, LRDA, PLSDA and SVM for models generated using all the wavelengths (3551 vari ables) and CRFS selected wavelengths (80 variables) are shown in Figs. S3 – S8 in the supporting information. In general, the application of CRFS led to an improvement in the model quality parameter for the training set for KNN, PLSDA and SVM as shown in Table 1. From Table 1, TNR, ACC and MCC increased for training set increased after feature selection. In the case of PLSDA and SVM, excellent values were obtained for all quality parameters after feature selection. While applying the models to the external validation sets, all the algorithms except ANN and GBDT showed improved model results. It is known that high number of variables in machine learning applications risks over fitting and can also confuse learning algorithms [54–56]. The application of feature selec tion routines such as CRFS reduces redundancy, noise and computa tional requirements. An efficient variables selection contributes significantly to a parsimonious machine learning models which are desirable. 4. Conclusion Our findings show that feature selection, specifically using the CRFS method, significantly enhances the performance of principal component Fig. 3. Fig. 4. Table 1 Before Feature Selection (3351 variables) After Feature Selection (80) Training Model Quality Algorithm TNR TPR ACC MCC TNR TPR ACC MCC ANN 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 GBDT 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 KNN 0.8621 0.9444 0.9077 0.8136 1.0000 0.9444 0.9692 0.9400 LRDA 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 PLSDA 1.0000 0.9722 0.9846 0.9694 1.0000 1.0000 1.0000 1.0000 SVM 0.9655 0.9444 0.9538 0.9389 1.0000 1.0000 1.0000 1.0000 Validation Model Quality ANN 1.0000 0.8889 0.9394 0.8856 0.8667 1.0000 0.9394 0.8832 GBDT 0.9333 1.0000 0.9697 0.9403 0.9333 0.9444 0.9394 0.8778 KNN 0.8667 0.8333 0.8485 0.6974 1.0000 0.9444 0.9697 0.9410 LRDA 1.0000 0.8889 0.9394 0.8856 1.0000 0.9444 0.9697 0.9410 PLSDA 0.9333 0.8889 0.9091 0.8192 1.0000 0.9444 0.9697 0.9410 SVM 0.9333 0.8889 0.9091 0.8192 1.0000 0.9444 0.9697 0.9410 J.K. Kwao et al. Microchemical Journal 208 (2025) 112433 4 analysis (PCA). KNN, LRDA, PLSDA and SVM showed improved model generalization as better results were obtained during model validation wit external validation sets. These findings underscore the importance of appropriate feature selection and the choice of machine learning algo rithms to achieve reliable and accurate detection of Sudan dyes in palm oil. CRediT authorship contribution statement Joanna K. Kwao: . Cheetham Mingle: Writing – review & editing, Resources, Methodology, Data curation. John N. Addotey: . Kwabena F.M. Opuni: . Lawrence A. Adutwum: . Funding This research did not receive any specific grant from funding agencies. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi. org/10.1016/j.microc.2024.112433. Data availability Data will be made available on request. References [1] J. Spink, Safety of food and beverages: risks of food adulteration, Encyclopedia of Food Safety 3 (2014) 413–416, https://doi.org/10.1016/B978-0-12-378612- 8.00300-0. [2] Banerjee D, Chowdhary S, Chakraborty S, Bhattacharyya R. Recent advances in detection of food adulteration. Food Safety in the 21st Century: Public Health Perspective 2017:145–60. https://doi.org/10.1016/B978-0-12-801773-9.00011-X. [3] E. Hong, S.Y. Lee, J.Y. Jeong, J.M. Park, B.H. Kim, K. Kwon, et al., Modern analytical methods for the detection of food fraud and adulteration by food category, J Sci Food Agric 97 (2017), https://doi.org/10.1002/jsfa.8364. [4] T.M. Fonovich, Sudan dyes: Are they dangerous for human health, Drug Chem Toxicol 36 (2013) 343–352, https://doi.org/10.3109/01480545.2012.710626. [5] C.G. Reile, M.S. Rodríguez, D.D. de Fernandes, S, Gomes A de A, Diniz PHGD, Di Anibal CV., Qualitative and quantitative analysis based on digital images to determine the adulteration of ketchup samples with Sudan I dye, Food Chem 328 (2020), https://doi.org/10.1016/j.foodchem.2020.127101. [6] Hussain Khan M, Saleem Z, Ahmad M, Sohaib A, Ayaz H, Mazzara M. Hyperspectral imaging for color adulteration detection in red chili. Applied Sciences (Switzerland) 2020;10. https://doi.org/10.3390/app10175955. [7] S.Y.S.S. Adade, H. Lin, S.A. Haruna, A.O. Barimah, H. Jiang, A.A. Agyekum, et al., SERS-based sensor coupled with multivariate models for rapid detection of palm oil adulteration with Sudan II and IV dyes, J. Food Compos. Anal. 114 (2022), https:// doi.org/10.1016/j.jfca.2022.104834. [8] Opinion of the Scientific Panel on Food Additives, Flavourings, Processing Aids and Materials in Contact with Food on a request from the commission related to Lutein for use in particular nutritional uses. EFSA Journal 2007;5. https://doi.org/ 10.2903/j.efsa.2007.315. [9] C.H. Xing, Y. Wang, J.C. Liu, Z.N. Pan, H.L. Zhang, S.C. Sun, et al., Melatonin reverses mitochondria dysfunction and oxidative stress-induced apoptosis of Sudan I-exposed mouse oocytes, Ecotoxicol Environ Saf 225 (2021), https://doi.org/ 10.1016/j.ecoenv.2021.112783. [10] T.B. Zanoni, E. Silva de Paula, G.J. Zocolo, M.V.B. Zanoni, D. Palma de Oliveira, Identification of Sudan III-(deoxy)-guanosine adducts formed in situ in a reaction with no catalyst, Toxicol Environ Chem 95 (2013), https://doi.org/10.1080/ 02772248.2014.892748. [11] R.J. Bienstock, L. Perera, M.A. Pasquinelli, Molecular Modeling Study of the Genotoxicity of the Sudan I and Sudan II Azo Dyes and Their Metabolites, Front Chem 10 (2022), https://doi.org/10.3389/fchem.2022.880782. [12] T.C. Pham, X.T. Dang, B.N. Nguyen, T.T. Vu, Determination of Sudan i and II in Food by High-Performance Liquid Chromatography after Simultaneous Adsorption on Nanosilica, J Anal Methods Chem 2021 (2021), https://doi.org/10.1155/2021/ 6664463. [13] J.K. Adjei, V. Ahormegah, A.K. Boateng, H.K. Megbenu, O.S. Fast, easy, cheap, robust and safe method of analysis of Sudan dyes in chilli pepper powder, Heliyon 6 (2020), https://doi.org/10.1016/j.heliyon.2020.e05243. [14] E. Ertaş, H. Özer, C. Alasalvar, A rapid HPLC method for determination of Sudan dyes and Para Red in red chilli pepper, Food Chem 105 (2007) 756–760, https:// doi.org/10.1016/j.foodchem.2007.01.010. [15] M.M. Dar, Detection of Sudan Dyes in Red Chilli Powder by Thin Layer Chromatography, J Allergy Ther (2012;S1.), https://doi.org/10.4172/ scientificreports.586. [16] Blanco M, Villarroya I. NIR spectroscopy: A rapid-response analytical tool. TrAC - Trends in Analytical Chemistry 2002;21. https://doi.org/10.1016/S0165-9936(02) 00404-1. [17] Y. Xu, P. Zhong, A. Jiang, X. Shen, X. Li, Z. Xu, et al., Raman spectroscopy coupled with chemometrics for food authentication: A review. TrAC - T, Trends Anal. Chem. 131 (2020), https://doi.org/10.1016/j.trac.2020.116017. [18] L. Mandrile, L. Barbosa-Pereira, K.M. Sorensen, A.M. Giovannozzi, G. Zeppa, S. B. Engelsen, et al., Authentication of cocoa bean shells by near- and mid-infrared spectroscopy and inductively coupled plasma-optical emission spectroscopy, Food Chem 292 (2019) 47–57, https://doi.org/10.1016/j.foodchem.2019.04.008. [19] A.I. Ropodi, E.Z. Panagou, G.J.E. Nychas, Data mining derived from food analyses using non-invasive/non-destructive analytical techniques; determination of food authenticity, quality & safety in tandem with computer science disciplines, Trends Food Sci Technol 50 (2016) 11–25, https://doi.org/10.1016/J.TIFS.2016.01.011. [20] H.J. He, D.W. Sun, Microbial evaluation of raw and processed food products by Visible/Infrared, Raman and Fluorescence spectroscopy, Trends Food Sci Technol 46 (2015) 199–210, https://doi.org/10.1016/J.TIFS.2015.10.004. [21] Monago-Maraña O, Durán-Merás I, Muñoz de la Peña A, Galeano-Díaz T. Analytical techniques and chemometrics approaches in authenticating and identifying adulteration of paprika powder using fingerprints: A review. Microchemical Journal 2022;178. https://doi.org/10.1016/j.microc.2022.107382. [22] S. Yao-Say Solomon Adade, H. Lin, H. Jiang, S.A. Haruna, A. Osei Barimah, M. Zareef, et al., Fraud detection in crude palm oil using SERS combined with chemometrics, Food Chem 388 (2022), https://doi.org/10.1016/j. foodchem.2022.132973. [23] C.V. Di Anibal, L.F. Marsal, M.P. Callao, I. Ruisánchez, Surface Enhanced Raman Spectroscopy (SERS) and multivariate analysis as a screening tool for detecting Sudan I dye in culinary spices, Spectrochim Acta A Mol Biomol Spectrosc 87 (2012) 135–141, https://doi.org/10.1016/J.SAA.2011.11.027. [24] M. Shiddiq, Z. Zulkarnain, V. Asyana, H. Aliyah, Identification of Pure and Adulterated Honey Using Two Spectroscopic Methods, J Phys Conf Ser 1351 (2019), https://doi.org/10.1088/1742-6596/1351/1/012022. [25] D. Valinger, L. Longin, F. Grbeš, M. Benković, T. Jurina, J. Gajdoš Kljusurić, et al., Detection of honey adulteration – The potential of UV-VIS and NIR spectroscopy coupled with multivariate analysis, Lwt 145 (2021), https://doi.org/10.1016/j. lwt.2021.111316. [26] D.A. Magdas, F. Guyon, C. Berghian-Grosan, M.C. Muller, Challenges and a step forward in honey classification based on Raman spectroscopy, Food Control 123 (2021) 107769, https://doi.org/10.1016/j.foodcont.2020.107769. [27] A.A. Boateng, S. Sumaila, M. Lartey, M.B. Oppong, K.F.M. Opuni, L.A. Adutwum, Evaluation of chemometric classification and regression models for the detection of syrup adulteration in honey, LWT 163 (2022), https://doi.org/10.1016/j. lwt.2022.113498. [28] Fengou LC, Spyrelli E, Lianou A, Tsakanikas P, Panagou EZ, Nychas GJE. Estimation of Minced Pork Microbiological Spoilage through Fourier Transform Infrared and Visible Spectroscopy and Multispectral Vision Technology. Foods 2019, Vol 8, Page 238 2019;8:238. https://doi.org/10.3390/FOODS8070238. [29] A.I. Ropodi, E.Z. Panagou, G.J.E. Nychas, Multispectral imaging (MSI): A promising method for the detection of minced beef adulteration with horsemeat, Food Control 73 (2017) 57–63, https://doi.org/10.1016/J. FOODCONT.2016.05.048. [30] S.S. Andoh, K. Nyave, B. Asamoah, B. Kanyathare, T. Nuutinen, C. Mingle, et al., Optical screening for presence of banned Sudan III and Sudan IV dyes in edible palm oils, Food Addit Contam Part A Chem Anal Control Expo Risk Assess 37 (2020) 1049–1060, https://doi.org/10.1080/19440049.2020.1726500. [31] S.S. Andoh, T. Nuutinen, C. Mingle, M. Roussey, Qualitative analysis of Sudan IV in edible palm oil, Journal of the European Optical Society 15 (2019), https://doi. org/10.1186/s41476-019-0117-0. [32] E. Teye, C. Elliott, L.K. Sam-Amoah, C. Mingle, Rapid and nondestructive fraud detection of palm oil adulteration with Sudan dyes using portable NIR spectroscopic techniques, Food Addit Contam Part A Chem Anal Control Expo Risk Assess 36 (2019) 1589–1596, https://doi.org/10.1080/19440049.2019.1658905. [33] S. Boughorbel, F. Jarray, M. El-Anbari, Optimal classifier for imbalanced data using matthews correlation coefficient metric, PLoS One 12 (2017) e0177678, https:// doi.org/10.1371/journal.pone.0177678. [34] I. Guyon, An Introduction to Variable and Feature Selection 1 Introduction, J. Mach. Learn. Res. 3 (2003) 1157–1182. [35] I. Guyon, A. Elisseeff, Feature Extraction, Foundations and Applications: An introduction to feature extraction, Stud. Fuzziness Soft Comput. 207 (2006) 1–25, https://doi.org/10.1007/978-3-540-35488-8_1. [36] B.E. Boser, I.M. Guyon, V.N. Vapnik, Training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, 1992, pp. 144–152. [37] T. Loong, Clinical review Understanding sensitivity and specificity with the right, BMJ 327 (2003) 716–719, https://doi.org/10.1136/bmj.327.7417.716. J.K. Kwao et al. Microchemical Journal 208 (2025) 112433 5 https://doi.org/10.1016/j.microc.2024.112433 https://doi.org/10.1016/j.microc.2024.112433 https://doi.org/10.1016/B978-0-12-378612-8.00300-0 https://doi.org/10.1016/B978-0-12-378612-8.00300-0 https://doi.org/10.1002/jsfa.8364 https://doi.org/10.3109/01480545.2012.710626 https://doi.org/10.1016/j.foodchem.2020.127101 https://doi.org/10.1016/j.jfca.2022.104834 https://doi.org/10.1016/j.jfca.2022.104834 https://doi.org/10.1016/j.ecoenv.2021.112783 https://doi.org/10.1016/j.ecoenv.2021.112783 https://doi.org/10.1080/02772248.2014.892748 https://doi.org/10.1080/02772248.2014.892748 https://doi.org/10.3389/fchem.2022.880782 https://doi.org/10.1155/2021/6664463 https://doi.org/10.1155/2021/6664463 https://doi.org/10.1016/j.heliyon.2020.e05243 https://doi.org/10.1016/j.foodchem.2007.01.010 https://doi.org/10.1016/j.foodchem.2007.01.010 https://doi.org/10.4172/scientificreports.586 https://doi.org/10.4172/scientificreports.586 https://doi.org/10.1016/j.trac.2020.116017 https://doi.org/10.1016/j.foodchem.2019.04.008 https://doi.org/10.1016/J.TIFS.2016.01.011 https://doi.org/10.1016/J.TIFS.2015.10.004 https://doi.org/10.1016/j.foodchem.2022.132973 https://doi.org/10.1016/j.foodchem.2022.132973 https://doi.org/10.1016/J.SAA.2011.11.027 https://doi.org/10.1088/1742-6596/1351/1/012022 https://doi.org/10.1016/j.lwt.2021.111316 https://doi.org/10.1016/j.lwt.2021.111316 https://doi.org/10.1016/j.foodcont.2020.107769 https://doi.org/10.1016/j.lwt.2022.113498 https://doi.org/10.1016/j.lwt.2022.113498 https://doi.org/10.1016/J.FOODCONT.2016.05.048 https://doi.org/10.1016/J.FOODCONT.2016.05.048 https://doi.org/10.1080/19440049.2020.1726500 https://doi.org/10.1186/s41476-019-0117-0 https://doi.org/10.1186/s41476-019-0117-0 https://doi.org/10.1080/19440049.2019.1658905 https://doi.org/10.1371/journal.pone.0177678 https://doi.org/10.1371/journal.pone.0177678 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0170 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0170 https://doi.org/10.1007/978-3-540-35488-8_1 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0180 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0180 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0180 https://doi.org/10.1136/bmj.327.7417.716 [38] D. Chicco, G. Jurman, The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation, BMC Genom 21 (2020) 1–13, https://doi.org/10.1186/s12864-019-6413-7. [39] D. Chicco, N. Tötsch, G. Jurman, The matthews correlation coefficient (Mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation, BioData Min 14 (2021) 1–22, https://doi. org/10.1186/S13040-021-00244-Z/TABLES/5. [40] B.W. Matthews, Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochim. Biophys. Acta (BBA) Protein Struct. 405 (1975) 442–451, https://doi.org/10.1016/0005-2795(75)90109-9. [41] F. Savorani, G. Tomasi, S.B. Engelsen, icoshift: A versatile tool for the rapid alignment of 1D NMR spectra, J. Magn. Reson. 202 (2010) 190–202, https://doi. org/10.1016/j.jmr.2009.11.012. [42] G. Tomasi, F. Savorani, E.SB. Icoshift, An effective tool for the alignment of chromatographic data, J Chromatogr A 1218 (2011), https://doi.org/10.1016/j. chroma.2011.08.086. [43] A. Savitzky, M.J.E. Golay, Smoothing and differentiation of data by simplified least squares procedures, Anal Chem 36 (1964) 1627–1639. [44] N.A. Sinkov, J.J. Harynuk, Cluster resolution: a metric for automated, objective and optimized feature selection in chemometric modeling, Talanta 83 (2011) 1079–1087, https://doi.org/10.1016/j.talanta.2010.10.025. [45] N.A. Sinkov, J.J. Harynuk, Three-dimensional cluster resolution for guiding automatic chemometric model optimization, Talanta 103 (2013) 252–259. [46] M.S. Armstrong, A.P. de la Mata, J.J. Harynuk, An efficient and accurate numerical determination of the cluster resolution metric in two dimensions, J Chemom 35 (2021), https://doi.org/10.1002/cem.3346. [47] L.A. Adutwum, A.P. de la Mata, H.D. Bean, J.E. Hill, J.J. Harynuk, Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios, Anal Bioanal Chem 409 (2017) 6699–6708, https://doi.org/10.1007/s00216-017-0628-8. [48] F. Zhang, X. Tang, L. Li, Origins of Baseline Drift and Distortion in Fourier Transform Spectra, Molecules 27 (2022), https://doi.org/10.3390/ molecules27134287. [49] T. Shi, H. Liu, Y. Chen, T. Fei, J. Wang, G. Wu, Spectroscopic diagnosis of arsenic contamination in agricultural soils, Sensors (switzerland) 17 (2017), https://doi. org/10.3390/s17051036. [50] X. Zhang, Z. Gao, Y. Yang, S. Pan, J. Yin, X. Yu, Rapid identification of the storage age of dried tangerine peel using a hand-held near infrared spectrometer and machine learning, J near Infrared Spectrosc 30 (2022), https://doi.org/10.1177/ 09670335211057232. [51] R. Gill, T.S. Bal, A.C. Moffat, The Application of Derivative UV-Visible Spectroscopy in Forensic Toxicology, J. Forensic Sci. Soc. 22 (1982), https://doi. org/10.1016/S0015-7368(82)71466-5. [52] Å. Rinnan, F. van den Berg, S.B. Engelsen, Review of the most common pre- processing techniques for near-infrared spectra. TrAC - T, Trends Anal. Chem. 28 (2009), https://doi.org/10.1016/j.trac.2009.07.007. [53] A.O.O. Oliynyk, L.A.A. Adutwum, J.J.J. Harynuk, A. Mar, Classifying crystal structures of binary compounds AB through cluster resolution feature selection and support vector machine analysis, Chem. Mater. 28 (2016) 6672–6681, https://doi. org/10.1021/acs.chemmater.6b02905. [54] S.M. Vieira, J.M.C. Sousa, U. Kaymak, Fuzzy criteria for feature selection, Fuzzy Sets Syst 189 (2012) 1–18, https://doi.org/10.1016/j.fss.2011.09.009. [55] J.M. Cadenas, M.C. Garrido, R. Martínez, Feature subset selection Filter - Wrapper based on low quality data, Expert Syst Appl 40 (2013) 6241–6252, https://doi.org/ 10.1016/j.eswa.2013.05.051. [56] B. Duval, J.-K. Hao, Advances in metaheuristics for gene selection and classification of microarray data, Brief Bioinform 11 (2010) 127–141, https://doi. org/10.1093/bib/bbp035. J.K. Kwao et al. Microchemical Journal 208 (2025) 112433 6 https://doi.org/10.1186/s12864-019-6413-7 https://doi.org/10.1186/S13040-021-00244-Z/TABLES/5 https://doi.org/10.1186/S13040-021-00244-Z/TABLES/5 https://doi.org/10.1016/0005-2795(75)90109-9 https://doi.org/10.1016/j.jmr.2009.11.012 https://doi.org/10.1016/j.jmr.2009.11.012 https://doi.org/10.1016/j.chroma.2011.08.086 https://doi.org/10.1016/j.chroma.2011.08.086 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0215 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0215 https://doi.org/10.1016/j.talanta.2010.10.025 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0225 http://refhub.elsevier.com/S0026-265X(24)02546-3/h0225 https://doi.org/10.1002/cem.3346 https://doi.org/10.1007/s00216-017-0628-8 https://doi.org/10.3390/molecules27134287 https://doi.org/10.3390/molecules27134287 https://doi.org/10.3390/s17051036 https://doi.org/10.3390/s17051036 https://doi.org/10.1177/09670335211057232 https://doi.org/10.1177/09670335211057232 https://doi.org/10.1016/S0015-7368(82)71466-5 https://doi.org/10.1016/S0015-7368(82)71466-5 https://doi.org/10.1016/j.trac.2009.07.007 https://doi.org/10.1021/acs.chemmater.6b02905 https://doi.org/10.1021/acs.chemmater.6b02905 https://doi.org/10.1016/j.fss.2011.09.009 https://doi.org/10.1016/j.eswa.2013.05.051 https://doi.org/10.1016/j.eswa.2013.05.051 https://doi.org/10.1093/bib/bbp035 https://doi.org/10.1093/bib/bbp035 The impact of cluster resolution feature selection on pattern recognition and classification for detecting Sudan dye adulte ... 1 Introduction 2 Experimental 2.1 Data collection 2.2 Data processing and feature selection 2.3 Machine learning model and model quality Estimation 3 Results and discussions 4 Conclusion CRediT authorship contribution statement Funding Declaration of competing interest Appendix A Supplementary data Data availability References