Environ Eng Res > Volume 28(5); 2023 > Article
Ohanuba, Ismail, and Ali: Application of topological data analysis to flood disaster management in Nigeria

Abstract

The importance of this research to the literature lies in the ability to develop a hybrid method of Topological Data Analysis and Unsupervised Machine Learning (TDA-uML) for flood detection. The working method in TDA-uML entails collection and loading of datasets, feature representation, data preprocessing (i.e., training, testing, computation, and validation), and data classification. Three properties make TDA distinct from traditional methods: coordinate-invariance, deformative-invariance, and compressed-representative. Formerly utilized hydrologic, hydraulic, and statistical models for flood control were frequently erroneous in their forecasts, lacked the use of hybrid models, and were not validated. The main research objective is to develop a hybrid method for predicting floods. The motivation is to fill research gaps by using TDA-uML methods for persistent homology (PH) and synthetic k-means clustering. The results will be used to compare and categorize the features that are produced. Seven states were selected based on Nigeria’s flood history and affected population. The 7 states are located within the 8 hydrological areas of Nigeria. The efficiency of the resultant validity was 91%. The findings contributed to the development of a model for flood prediction and management; topological features were extracted from the data to predict and categorize the risk.

1. Introduction

The term “topological data analysis (TDA)” is a recent research study in statistics that uses algebraic topology tools to capture a dataset’s shape and structure. In their study in 2019, Christophe Biscio and Jesper Møller reported that the TDA as a method in statistics uses algebraic topological ideas to summarize and visualize complex datasets [1]. Their research was carried out at Aalborg University, Denmark. The TDA gradually emerged through time; the work of Patrizio Frosini in 1990, Letscher and Zomorodian in 2002, Robins in 1999, and Zomorodian and Carlsson in 2005 provide chronicle record of the TDA [24]. The development started with taking measurements of the distance between similar groups of submanifolds of a Euclidean space. Richeson’s work in 2019 illustrated the impact of Euler’s formula that led to the historical development of TDA. Euler’s formula was named after Leonhard Euler, its founder. The topological invariant, which is one of the characteristics of TDA, was first established in Euler’s characteristic. It is a quantity that can be calculated and gives back the same value on many different representations of the same topological shape [5]. The development of persistent homology (PH), k-meam clustering, and other tools in TDA has been driven by algorithms. The area flourished in the 80s and 90s by addressing different practical problems, enriching the area of discrete geometry.
The work of Muszynski et al. in 2019 applied TDA and ML techniques in Califonia to forecast climate conditions associated with an atmospheric river (Ar) [6]. Their models produce a constantly speedy and accurate separation of Ars, which shows that TDA and ML produce a highly robust and accurate result on climate change. In this recent era, all types of flood modelling (mathematical, statistical, and machine learning) have rapidly grown in number, value, and implementation procedure. This rapid progress has simultaneously reduced flood impact and increased confidence in flood prediction and has resulted in proper risk management in flood control. The recent increase in technology and industry brings about a corresponding increase in data points, such that automated and hybrid methods are needed. Most recent studies in time series are beginning to drift to a hybrid system of combining two methods in flood control [5,6]. Flooding has destroyed lives and properties, displacing families and farmland. Due to these mentioned problems, the researchers are motivated to fill the gaps and help minimize the flood hazard, vulnerability, risk, and disaster by applying a hybrid TDA-uML method.
Nigeria has its main sources of water flow from the Niger-river and Benue-river. The drainage areas in Nigeria have eight divisions known as hydrological areas (HAs). The drainage areas include i. Sokoto, ii. Kaduna, iii. Yola, iv. Makurdi, v. Port Harcourt, vi. Ibadan, vii. Enugu, and viii. Maiduguri. The government, through their agency, makes a lot of effort to mitigate flood disasters by providing education to Nigerians on flood management and mitigation strategies. The models used by the Nigeria Hydrological Services Agency (NIHSA) to predict floods are the Geo-Spatial Stream Flow Model (GeoSFM) and/or the Soil-Water-Assessment-Tool, SWAT. Both models use the Digital-Elevation-Model in conjunction with datasets from hydrology, hydrogeology, precipitation, topography, and water-balance indices, as published in Li et al. 2020 in the domain of hydrology and water resources [7]. Despite their efforts, the agency still encounters some challenges: the performance of their models has not been validated for accurate predictions; insufficient labor development; and lack of comparison of methods.
Historical trends (records) for the past 50 years revealed inevitable and annual flood disasters and risks to lives, animals, and properties, and they are expected to rise due to an increase in population, economic activity, etc. The annual flood disaster, according to the study by Tanoue, Hirabayashi, and Ikeuchi in 2016, could be monitored to reduce the havoc [8]. The impact of floods on the economies of most countries is colossal; some of the effects are displacement of individuals, high mortality rates, destruction of farmland and structures, etc. The total casualties were estimated at seventeen billion, with an affected population of 7.7 million in Nigeria in 2012 and recorded by Whitworth et al. in 2015 [9]. The forecast of the impending dangers brought by heavy rain to control related flood risks in flood zones has led to the recent development of research in which researchers were resolute to find more accurate, robust, and applicable flood models [9, 10].
The three traditional flood modeling types include: 1) one-dimension (1D) modeling approach that solves the 1D equation of river flow; 2) two-dimension (2D) modeling approach that solves the 2D equation of river flow; and 3) link (1D channel and 2D floodplain) combines 1D and 2D water flow models. The models were developed to permit vertical feature representation, describing fluid substances’ motion [11]. Many problems were encountered in the common flood models. For instance, they are non-predictive and have no direct linkage to hydrology. These have engineering, environmental, and processing limitations. They are computationally intensive, and the errors associated with the input can grow with time.
The advent of a hybrid method has recently received extra attention in other areas like biology, physics, and engineering, as recorded in the studies of Carlsson in 2014 and cited by Peter Bubenik in 2015 [12]. The significance of TDA is that the classifier performs well in all kinds of data sets from other areas of study. The algorithm and software are beneficial to a high number of supervised machine learning (ML) techniques; they are flexible and threshold-free [13]. The PH provided a concise representation of standard features throughout their life span. The tools in TDA integrate ML during the implementation procedure; the outcome produces robustness in flood management [14].
In extracting information from data sets, there is a need for preprocessing (pre-training of datasets), relation prediction (i.e. similarity in the metric distances), datasets, and evaluation metrics [15, 16]. The result is more robust when machine learning procedures are combined and integrated with a support vector machine (SVM) classifier [17]. In networking, the whole framework can be trained and retrained in such a manner that a deeply supervised framework structured in datasets could result in accurate information extraction [18]. The TDA-uML method has additional flexibility for analysis in statistics, object-shape detection, and metric-nsensitive. The mentioned properties equip the method with the power to detect hidden features in complex datasets.
The TDA-uML focuses on k-mean clustering and persistent homology to develop tools for flood prediction, management, and control in Nigeria. An acceptable algorithm in clustering techniques produces cluster groups that have satisfactory inter and intracluster homogeneities, quality separation of intercluster, and adequate connectedness in the close data points under internal validity measures [19]. Clustering of big datasets can locate better partitions that can increase the closeness of intracluster and reduce the resemblance of intercluster. Some advantages of using a hybrid TDA-uML method are that it uses a classifier in the binary classification; it performs best in large or noisy datasets; and it can extract intrinsic information from complex data sets in the form of topological features or shapes to make accurate predictions. Recently, researchers have started integrating machine learning (ML) models and other statistical tools into information extraction from datasets for managing flood disasters. The two composite models (Dagging and Random Subspace) that consist of Random Forest, the Supervised-Vector-Machine, and Artificial-Neural-Network were referred to as Futuristic ML models for flood modeling [20]. One of the main problems in a random forest, as mentioned in the work of Islam et al., 2021, includes poor performance of its function in the presence of undetermined sets of data [20]. Some literature strongly emphasized the power of relation in metrics, which is a strong assumption that generates long sentences; for more reading, see the works of Zeng 2015 and Huang 2021 [21, 22]. The method as revealed can be used to detect the feature formation existing in the fabricating scheme datasets and provides a similarly high prediction accuracy in the system parameter [23]. The only disadvantage is that it takes a long time during the training process.
The basic idea in PH is to study datasets through their low-dimension topological features, which translate into connected components (dimension 0) and loops (dimension 1) [24]. The PH is more efficient in the clustering of multivariate datasets by building the clustering similarity graph visualization via the Mirkin metric or Manhattan metric [25]. The PH aspect of this study used the 1970 dataset for this analysis and considered January and September as months of no flooding and flooding, respectively. Results were compared from both tools, and the uniqueness associated with each tool was discovered. In this study, a validity test was conducted to appropriately standardize the graph under a null reference distribution of the data [26]. The SVM is an encoded algorithm that understands, determines, and classifies the binary points during the training of a dataset [27]. According to Feng and Porter 2020, analyzing closely similar features in various groups of events (such as flooding) increases the likelihood of studying the patterns and mechanisms that help minimize their impacts through real-time shape extraction [28].

1.1. Problem Statement

Literature shows that no concrete research has found the pattern of flooding in different flood zones or estimated what it would take to bring a lasting solution to the menace. Besides, the models used by some agencies (e.g., the NIHSA) have shown some short-comings and inadequate. The work of Munzimi et al. in 2019 reported that the reliability of the geospatial streamflow model (GeoSFM) in flood prediction is not assured. It sometimes produces inaccurate predictions, especially in the daily magnitude of river outflows [29]. Islambekov and Gel in 2019 applied the Topological Data Analysis (TDA) in measuring water quality, but the problem is that they did not extend the application to other environmental scenarios [30]. Their application requires a hybrid with other methods for better performance. The study by Musa et al. in 2021 applied the combined approach of PH and the Critical Slowing Down model (CSD) to evaluate river levels at Kelantan. They predicted signals or warnings that could help the dwellers act appropriately, but their study did not consider conducting a validity test to measure the technique’s performance [31]. Besides, there is no classification in their study to reveal the degree of flood risk in the signal for flood detection and prediction. In Carlsson (2009), in the critical study of a dataset, it was pointed out that clustering should be employed as the statistical counterpart to the geometric construction of the connected components of a space. Carlsson stated that clustering is the vital building block of algebraic topology, but the problem is in the extension of other topological methods to point clouds or flood events [32]. The work of Zieliński et al. in 2022 introduced a new TDA method called the persistence codebook. The persistent code book was defined as a vectorized representation of PDs conveniently used to compute structures in statistics, data analytics, and machine learning, but the method still lacks real-time applications to a large dataset or extreme event [33]. Their approach yielded higher performance in less time than other alternative approaches tested on a heterogeneous dataset when it was compared with the common method.
Statistics showed that Nigerian states were devastated by floods in 2018. The fatility rate and affected population explain how this research specifically affects Nigeria. Besides, Nigeria tops the list of the 10 most devastating floods in Africa since 1985, as recorded by Whitworth Malcolm in his study in 2015 [9]. The 7 states selected were based on the number of populations affected annually. The impact occurs annually and brings much discomfort to the victims in their environment. Statistics of Nigerian states devastated by floods in 2018 revealed that the affected population was as follows: 118,199 in Kogi state with no fatality rate; 94,991 in Anambra state with no fatality rate; 31,113 in Edo with no fatality rate; 41,680 with no fatality rate; 51,719 in Niger state with a fatality rate of 15; 4,345 in Jigawa with a fatality rate of 7. There was no record of the affected population in Kano and Nasarawa states, but both had 3 people that were fatally affected. The exact number of those affected is difficult to discover in most cases due to inefficient data collection techniques, lack of professionalism, and errors in measurement. Teng et al. stated in their study in 2017 that 3D models were recently developed to permit vertical feature representation, but the models lack flow dynamics representation and are non-predictive[11].

2.1. Tools in the Hybrid TDA-uML Methodology

The TDA was discovered as a researchable area in statistics after many years of origin in mathematical topology. Since the discovery, some tools have been identified and used in the computation of TDA; among them are clustering, Persistence homology, Ridge estimation, and Manifold estimation. The hybrid TDA-uML in this study used two TDA’s tools: clustering and persistence homology, PH (the PH combined the Vietoris Rips and persistent diagram functions).
Seven states were selected based on the flood history in Nigeria. Those regions were recorded to be the flood zones as revealed by “European Commission’s Directorate-General for European Civil Protection and Humanitarian Aid Operation (ED ECHO); Emergency Response Coordination Centre (ERCC)”, on September 24, 2018 and retrieved from https://reliefweb.int/map/nigeria/nigeria-floods-situation-emergency-response-coordination-centre-ercc-dg-echo-daily-map. Data sets were collected for four weather parameters from the database of the Nigeria Hydrological Services Agency from 1970 to 2021, and the total data points were 904,488 for the 7 states. This study was conducted within the context of a quantitative research technique that detects the feature structure of datasets using a hybrid TDA-uML method. The hybrid TDA-uML method in this study applied both clustering and persistent homology as tools. The data sets comprise four variable parameters (precipitation, wind speed, maximum temperature, and minimum temperature and humidity), which are independent of each other.
Topological data analysis (TDA) according to Adler et al., 2017 has not used the statistical procedure as part of its approach, so there is an issue of statistical reliability despite being a recent research area in statistics [34]. The statistical behaviour of the features follows these 3 steps in calculating topological features for dimension 0: 1. First, create a separate matrix for dimension 0. 2. Calculate the lifetime of all features (i.e., the calculation of the difference between death and birth points). Third is the calculation of the topological features. This entails the use of the relevant function in a coded format. The same steps are repeated to obtain topological features for dimension 1.

2.1.1. Three key properties of topological data analysis

The three main properties that give Topological data analysis (TDA) the power to analyze and understand shape or structure of objects (data points/point clouds) are
1. Coordinate invariance (freeness); the principal idea says it should not matter how we represent the dataset in terms of coordinate, provided we keep in track with the internal similarities (distances). The measure of topological shape does not change if you rotate the shape.

2. Deformation invariance; the property of invariant deformation remains unchangeable even if there is stretching on the object. Example, letter ‘A’ remains a loop with two legs and a close triangle if apply stretching on it.

3. Compressed representation. If we observe more closely, the illustration of important attributes in the letter ‘A’ can be observed as having three bounded angles with two stands (legs) irrespective of the compression of millions of data points with similarity relationships in the object. For more detail on the properties of TDA, see the work of M Offroy and L Duponcel [35].

2.1.2. Design of the methods

The k-mean was first applied to the 7 states. Then, we proceeded to the stage that involves algorithm procedures in the 5 listed steps (See Sections 2.2.1 and 2.3 for both tools). Then, the procedure was repeated by applying the PH. Both methods produced better results in a big dataset. The validity measure was conducted using the Silhouette test which produced good performance in both the clustering and persistent homology before a conclusion was drawn. The stage at the testing cluster validity is where the algorithm was applied using software codes in the R programming language. The design of this study is illustrated in Fig. 1. The TDA-uML methodology starts with collection and loading of dataset into an application. Meanwhile the datasets have been arranged in Excel Spreadsheet, and interpolation done to fix the missing values observed in the collected dataset. Next, is the feature representation and preprocessing of the dataset. At this stage k-means clustering is run using its own algorithms and persistent homology is also run differently on its algorithms and steps. The k-means is then passed through testing and retesting to obtain the optimality values. For the persistent homology (PH), the filteration process is applied to produce optimal values. The optimal values were obtained as the topological summaries in the subsequent tables (Tables 3 and 4); the results were obtained at dimensions 0 and 1. On the left-hand side, computation of k-means clustering was carried out, silhouette analysis, and optimality values were obtained. Finally, the optimalities from both cases are classified into two classes, and that ends the design of the TDA-uML method as shown in Fig. 1.

2.1.3. Interpolation implementation

The process started with the plotting of the dataset in Microsoft Excel. Some values were observed to be missing from the data. The general interpolation is expressed as Eq. (1).
(1)
$Y=Y1+(Y2-Y1)(X2-X1)★(X-X1)$
In the above formula, there are unknown figures that will be fixed based on the other known values.
The interpolation function used in this work is obtained from the Excel function and it is given as Eq. (2).
(2)
$fx=B2+B2-B1(n+1)∀n≥2 or fx=B2+B2-B1(n+1),∀n=1$
where B1 = the value of the cell preceding the empty cell(s), B2 = the value of the cell following the empty cell(s), n = the number of empty cell(s) between (B2 and B1), (n + 1) = the gap between (B2 and B1). The symbol, \$ is used to fix the empty columns where n ≥ 2 by clicking and dragging along the row of the missing value where the function would be first applied. The snapshots show how interpolation was obtained in the Anambra dataset and saved in the Microsoft Excel file as Anambra11 [https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx]. On January 1, 2008, three missing values were fixed, while on February 24, 2013, one missing value was fixed. The fixed values obtained are inside the orange square box. The values obtained in the Excel file are (a)305.304, (b) 292.001, (c) 1.345, and (d) 2.515.

2.2. Methodology and Computation of Hybrid TDA-uML

2.2.1. Computation of k-means clustering

The implementation of k-mean clustering started after fixing missing values via interpolation. We first selected four numbers of clusters (k = 2, 3, 4, and 5) for conciseness, and then imported some relevant libraries that aided the computation. Then, loading of datasets, training and testing, data preprocessing, and computation of cluster centroids follow. We use 80% for data training and the remaining 20% for testing the model's accuracy. Next is testing of the algorithm using the steps in k-means and silhouette analysis. The five listed steps were completed, and optimum values were obtained. Finally, the support vector machines (SVM) classified the resultant feature into two classes or partitions. The hybrid TDA-uML technique permits systematic and automatic classification of the inputs (data), and so the inconvenience of manual assignment is completely avoided [36]. By extracting intrinsic information from data, Zulkepli 2020 stated in his study that the discovery of systematic and automatic classification has led to a revolutionary method of modelling environmental time series [37]. The five basic steps in computing the k-means aspect of TDA-uML are Step 1: Four clusters were chosen (k = 2, 3, 4, and 5) and relevant libraries were imported from the sklearn cluster. Step 2: Data Importation and selection of k random points from the dataset: Here we use code to import the dataset of each state from the Excel file. Step 3: Cluster Centroids Computation
The five basic steps in computating the k-means aspect of TDA-uML are Step 1: Four numbers of clusters (k = 2, 3, 4, 5) were chosen, and relevant libraries were imported from sklearn cluster. Step 2: Data Importation and selection of k random points from the dataset: here we use code to import the dataset of each state from the Excel file. Step 3: Computation of Cluster centroids. The code (Kmean.cluster_centers) was used to allocate each point to the nearest centroid of the cluster. Step 4: Recomputation of the produced cluster centroids: Testing was completed using codes in k-means and silhouette analysis. Step 5: Re-compute the cluster until an optimal value is obtained; otherwise, steps 3 and 4 must be repeated.
The metric is the function that measures distances or similarity between any two data points, and the standard metric measures considered in this study are Manhattan and Euclidean distances. This study coded the k-dimensional metric space in TDA using the joint principles of mathematical algebra and topology. The study of Adams et al. in 2017 explained that “In selecting a constant choice for k dimension, every feature of the homology stands for a point; the birth and death values are represented as parameters of the homological feature that first appears and disappears [38].” The persistence diagram (PD) is achieved through vectorization, which open up the integration of machine learning tools during the dataset classification. The work of Ji et al. in 2022 explained the network development behind the methodology of the topological data analysis [18]. Finally, the metrics of datasets are subjected to a filteration process to provide their dimensions' components and holes during transformation. This aspect uses persistent homology (PH), usually with a function known as the Vietoris–Rips filtration, that is combined with machine learning as a classifier; this end computation resulted in the formation of PDs (Fig. 4).

2.2.2. Computation of the cluster validity test

The main distance measure in the cluster validity computation is the classical metric equation that measures distances between two points and in the datasets. The Eucleadean and Manhattan functions were used in this study to compute the metrics.
Given the points xi = (xi1, xi2, …, xip ) and yi = (yi1, yi2, …, yip ) in a dataset, the general function is derived as expressed in Eq. (3).
(3)
$d (xi yj)=(|xi1-yj1|2+|xi2-yj2|2+,…,+|xip-yjp|2)d (xi yj)=∑i=1n∑j=1n|xi1-yj1|2$
The variables (i.e., the nth term) represent the total number of the variable points used, and the variables represent the variables in the horizontal and vertical domains, respectively.
Let be the cummulative density function, CDF of the uniform error, and assume c̄1−α = G −1 (1 − α) is the quantile, 1− α. Then the set (the confidence band) is obtained in Eq. (4).
(4)
$C¯(x)=[W^n(x)-c¯1-α,W^n(x)+c¯1-α]$
So that the approximated conversion result is Eq. (5)
(5)
$W(p(x)∈C¯(x)∀∈(k))=1-α$
Provided there is a satisfactory approximation of the distribution the approximation can be converted into a topological feature. The Eq. (6) contributed to the metric framework of this study. The function is represented with G(x)which is a density function of the dataset. In the expresion, x represents a set of points in a dataset, xi is a row vector defining sample for which a lens value is calculated, and xi is all the other samples in the dataset. The expression d2(xi, yj) represents the distance measure between the sample points. The number of variables is n, and d is the distances within the dataset.
(6)
$G(x)=∑xj∈Ze-d2(xi,yj)d(xi,yj)=(Σi=1n(xi-yi)2)12=Σi=1n(xi-yi)d2(xi,yj)=(Σi=1n(xi-yi)2∀xi, yj∈S$
where x = set of points in a dataset, xi = row vector of sample i for which a lens value is calculated and d2(xi, yj) = samples in the dataset, and yj = the square of distance measure between the sample points. The value(s) of the radius (r) will determine the kind of metric obtained/used in a study as expressed in Eq. (7), Eq. (8), and Eq. (9). Eqs. (7) and (8) were used as the metric functions in this study.
(7)
$Euclidean distance, d2(xi,yj)=Σi=1n(xi-yi)2, (r=2)$
(8)
$Manhattan distance, d1(xi,yj)=Σi=1n∣(xi-yi)∣, (r=1)$
(9)
$Maximum distance, d∞(xi,yj)=maxi=1︷n∣(xi-yi)∣, (r=∞)$
Let X = {x1, …, xn} be a set of n object from space χ, d be a dis-similarity or distance over χ. The Silhouette index (si) for an observation xiX is defined as Eq. (10). In this study, n = 4 (i.e., the total number of variables used in this study).
(10)
$si(C,d)=b(i)-a(i)max a(i),b(i)$
where $a(i)=1nl(i)-1Σl(i)=l(j)d(xi,xj) and b(i)=minr≠l(i)1nrΣl(j)=rd(xi,xj)$
• a(i) = mean lenght between variables i & j of a close cluster within group C.

• b(i) = mean lenght between variables i & j of others between different groups C.

The mathematical description of Dunn index is given as DIn and expressed as Eq. (11)
(11)
$DIn=min1≤i
where dist(ci, cj) = inter-cluster distance metric between ci and cj clusters, diam(cı) = mean distance between all the pairs. An intra-cluster distance metric is obtained within each group, but the mean distances are obtained between all the pairs.

2.2.3. Algorithm’s procedure for validity evaluation

The R programming language software was used to implement the k-means algorithm. The means of the datasets were obtained and were randomly selected, then the algorithm testing is iterated to ensure the best/optimum values were produced. The k-means applied the computing mechanism of the hybrid TDA-uML approach, involving testing and retesting to obtain the optimal cluster values. The evaluation stage is where the cluster validity algorithm is applied using codes in the R Software application. The algorithm starts by random selection of input from the mean of the dataset. The iteration is carried out by selecting the input from the set of mean values μ⃗1 …, μ⃗n and fixing the mean of the selected observation μ⃗I …, μ⃗K. Firstly, a set of input i is selected when the alternate set of input j is fixed. Then, the iterations are performed on γij. Secondly, the alternate input j is selected and the input is fixed. Iterations are then carried out. In both situations, the process is continued until convergence is created and the desired clusters are generated; at that point, the procedure is complete. If optimality is not attained, the process is repeated. The use of relevant codes and following the procedure helps to achieve the developed model. The steps are precisely itemized in the pseudocode and shown in Fig. 2.
The support vector machine (SVM) is a linear classifier that functions as an algorithm for supervised learning of binary classifiers. Given a linear classifier that is defined as h(x) = sign(wT x) + b, where b = set of points in the plane, wT x = matrx projection of x on h(x). The SVM is a separating hyperplane with mathematical formula given as Eq. (12).
(12)
$wTx+b=[+10-1],{(wTxi)+b>0if yi=1(wTxi)+b<0if yi=-1$
The best separating hyperplane among the various hyperplanes within the points is the one with the biggest margin. The fundamental goal of SVM in a dataset is to find the best separating hyperplane out of all the possible ones among the points. To accomplish this, codes included in the TDA-uML approach were employed.

2.3. Computation of Persistence Homology (PH) in the TDA-ML Approach

Persistent homology (PH) tracks the elements of the homology group as the scale parameter increases. Dimension 0 is the existence stage (birth) of the point cloud. At dimension 1, the abstract structures have changed (died) with the increase in the radius of the points (balls). The increase in the radius causes the points to connect to each other's edges or triangles; if their balls overlap, a hole (loop) forms and the point disappears (died). The appearance and disappearance processes in the formation caused its components to decrease, disappear, and transform into higher simplices. The same transformation from dimension 0 to 1 caused the sum of all lifetimes and the average of all holes to reduce.
Summary statistics of topological features are the relevant information obtained about the dataset. They were obtained for dimensions 0 and 1 respectively (i.e., for low-dimensional datasets). The structure at dimension 0 consists of all the points that are circular (i.e., the abstract structure of the dataset). In the continuous transformation due to the increase in the radius of the points in the dataset, various topological features are formed. During the transformation process, the points/balls at dimension 0 break up embed with other point via nodes to form different structures at dimension 1. In both dimensions, the first, second, third, and fourth topological features are obtained. The maximum hole lifetime is the first; the sum of all lifetimes is the second: The third is the average lifetime of all holes, and the fourth is “the maximum death point of holes.” The formation of these four features were described in section 2.3. The sum for each of the 7 states was obtained. For instance, for Adamawa, the sum, $Σi=1n(dI-bI)$ was $avgD=Σi=1n(dI-bI)n$ used to obtain the sum of all lifetimes and the average, for the average of lifetimes of all holes, at dimension 0. The sum of all lifetimes is all values (points) under the column of difference between birth and death. The stages in the formation process are illustrated in the supplementary material (https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx).
Persistent homology (PH) detects topological features that appear (birth) and disappear (death) as the filtration threshold increases. The combination of the theory of PH is summarized in the persistent diagram and clustering technique. The PH in this study investigated the dataset through their low-dimensional features, which translate into dimensions 0 and 1, called connected components and holes, respectively. The persistent homology combines two functions to achieve its purpose; a function to compute Vietoris-Rips simplicial complex construction and a function to plot barcode and diagram visualization.
The main theorem for Vietoris-Rips simplicial complex $ℛɛ(Sr1)$ of a circle for radius, r (equipped with Euclidean metric) is homotopy equivalent to (S2l + 1 i.e., product of Vietoris-Rips complexes). The theorem is obtained from Eq. (13) if
(13)
$2r sin (πl2l+1)<ɛ≤2r sin (πl+12l+3)$
The categorical product of N points (balls) in the dataset is expressed as $ℛ(Sr11)×ℛ(Sr21)×⋯×ℛ(SrN1)$
This theorem is equivalent to ℛ(T), and expressed in Eq. (14) as
(14)
$T=Sr11×Sr21×⋯×SrN1$
Further derivation produces its barcode ( $bcdpɛ$), and obtained as Eq. (15).
(15)
$Bcdpɛ(ℛ(T))={I1∩⋯∩IN|In∈bcdmn1(R(Srn1)),mn∈{0,1}, Σn=1Nmn=p}$
The p and ε are the minimum and maximum dimensions, respectively, in the PH. The dimensions are 0 and 1, respectively, in this study. The Vietoris-Rips complex, $ℛɛ(Sr1)$ of a circle for radius r, with points $T=Sr11×Sr21×⋯×SrN1$. The relevant scale parameter, ∝ is built on the function, $SrN1={s0,s1,s2,s3,s4}$, N = 4.
For further reading on the PH, the mathematical derivation and application of the theory for the Vietoris-Rips simplicial complex of a circle and its barcode, see these works [39, 40].
The five steps in the computation of PH in this study are
1. Installing the R-TDA package in R software and calling the library for TDA using code

2. The importation of our dataset that was saved in a.csv Excel file

3. Fixing of the maximum filtration value using codes and maximum dimension for features (dims 0 and 1)

4. Application of persistent homology using a Rips complex code

5. Plotting of the Barcode and Persistent Diagram (PD) using code for each case.

The 1970 datasets were used for computing the persistent homology (PH) in this study. In Nigeria, January and September are assumed to be months of no flooding and flooding, respectively. Finally, four topological summaries were obtained as the features. They are the sum, average, maximum hole lifetime, and maximum death point of a hole for dimension = 0 (connected component) and dimension = 1 (holes), respectively.
A set of features in PD is represented in this study by q = {q0, q1, …, qi}, and , in the dimension D (i.e., dim. 0 and 1). The eqs. (16) to (19) are the expressions of the four topological summaries obtained in PH.
• a) the first feature is the sum of all lifetimes

(16)
$sumD=Σi=1n(Vi-Ui)$
• b) the second summary (i.e., mean lifetime of all holes) is

(17)
$avgD=Σi=1n(Vi-Ui)n$
• c) the third feature is the maximum holes lifetime

(18)
$maxD=maxqi∈q(Vi-Ui)$
• d) the fourth feature is the maximum death point of holes

(19)
$maxd=maxVi∈qVi$

3.1. Discussions on K-mean Clustering of the Hybrid Method

The optimum values were obtained at the group k = 2 among the clusters (k = 2, 3, 4, 5). The Silhouette coefficient (SI) obtained in this study lie within the range (0.3 – 0.9) as shown in Table 1. At k = 2, the results fell between 0.7 and 0.9 (70% – 90%), with the value in Bayelsa topping the list. The efficiency of the result is 0.9117 (i.e., approximately 91%); this is efficient, and an improvement to the previous study. In their paper published in 2020, Riihimäki et al. stated that the more variables or data points, the better the result and efficiency [41]. This study used four rainfall parameters (maximum and minimum temperature, precipitation, and wind speed), which are in line with context. A combined method, TDA and ML, was applied to a climate event in Arizona [4]. Their models produced a constantly speedy and accurate separation of atmospheric rivers, Ars, which shows that integrating TDA and ML produces a highly robust and accurate result in flood modelling. The validity test results presented in Table 1 show there is a close similarity between the outcomes and the optimum values; see their range values. According to the benchmark, optimum results lie within negative ones (−1 for poor value) and positive ones (+1 for best value). The results of the analysis obtained and shown in Table 1 revealed that all the values obtained in the clusters lie between 0.5 and above (i.e., in the group k = 2).
In the study by Musa et al. in 2021, it was discovered that the use of persistent homology (PH) as a preprocessing step is more efficient and offers an advantage over flood early warning system (FLEWS) by producing fewer false alarms [31]. In their study in 2022, Alias et al. presented a new approach for streamflow data analysis by using PH. They detected more persistent topological features in the form of connected components and cycles in the wet periods compared to the dry periods at Sungai Kelantan, Malaysia [42]. However, their study lacks extension to other environments and classification of the possible hazard during wet periods or floods. In the study of Ighile et al. in 2022, they predicted the flood susceptible areas in Nigeria based on historical flood records from 1985–2020 and some factors [43]. They used artificial neural network (ANN) and logistic regression (LR) models to develop a flood susceptibility map and evaluated the link between flood events and fifteen explanatory variables. However, the disadvantage is their complexity in estimation, and they require extensive datasets collected and processed over a period.
The optimum values were used to perform the validation test. The support-vector machines (SVMs) automatically classified the resultant features into two classes. The pattern detection resulted in the discovery of the potential flooding and no flooding partitions in each of the 7 zones; the resultant features are shown in Fig. 3. The green and black dots in the groups represent different feature patterns in the 7 states and hence predict the flood feature pattern for each state. The results contained outliers, but the validity test procedure eliminated them. Those patterns belonging to the 0-group are no flood zones (green), and those belonging to the 1-group indicate possible flood areas (black) in the states. The unusual pattern in the Kogi is due to the large proportion of landmasses that were not affected by flooding, as revealed in the resultant features (See Fig. 3).
Some details were extracted using the Google Earth Map software application. The extracted details are the location of the seven selected flood states, their elevations, and the distances between the data stations to the nearest river. The information about the natural factors was used to establish a relationship between those factors and the findings in the clustering of the dataset. The information on the flood factors in the seven selected states is shown in Table 2. The locations of the flooded states predicted are found close to rivers Niger and Benue.
The map shows the location of the predicted flooding states in Nigeria. Their elevations were also extracted from the Google Earth map application as 104, 10, 116, 114, 431, 345, and 14, respectively. The unit of the elevation was obtained in metres. The elevation of the flood states and distances to the closest rivers/waterbodies were obtained from Google Earth, and Kogi has a high elevation on its land mass. A large proportion of land mass was not annually affected by flooding in Kogi compared to other states, despite being located close to the major rivers. The flooded areas are observed to be closed to the rivers Niger and Benue or to their tributaries. The figure that shows those predicted flood states in Nigerian map was provided in the Supplementary material [https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx]. The first small map at the top right side shows the flood areas in Nigeria. The second small map below it is the topographic map of Nigeria (TMN); it reveals the results of our finding in relation to the elevation of the two major rivers (Niger and Benue rivers) in Nigeria. The deep blue, sky blue, light blue, and light green areas in TMN are areas that are vulnerable to flooding. They represent the areas where those states predicted as flooding zones in this study are located.
The irregularity in the features' mean might have produced the outliers in the plot, but they were reduced after the implementation of the cluster validity test procedure. Two intra-cluster validity tests were used, and both attained the benchmark validity for our datasets. The Silhouette was found to attain a bit higher scores (outcomes) than the Dunn index. Therefore, we selected Silhouette as the best fit for evaluating our analysis [See https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx]. Close observation of their plots shows that the measure of topological shape remains unchanged if we spin the shape as enshrined in the first TDA’s property.
The outcome produced better connecdness in the clusters, indicating the performance is accurate. The SI (Silouette index) patterns were represented with A; the patterns for DI (Dunn index) are illustrated with B for each of the seven states. The plots for the two tests were placed side-by-side for comparison. Close observation shows that plots A and plots B follows the same feature patterns in each of the seven states. Table and plot are provided in the supplementary material for illustration and better explanation. This fundamental finding revealed that the two cluster validity tests share some features, thereby confirming that the analysis done using the Silhouette test is authentic and performed better.

3.2. Discussion of Persistent Homology (PH)

The results of the 4 topological features obtained from our datasets are shown in Tables 3, and 4. In January, the average lifetime of all connected components (dimension 0) is longer (Table 3) than that of dimension 1 (Table 4). The same happens in September for dimensions 0 and 1. However, at dimension 1, the values of the four topological features, “Max. holes lifetime”, “Max. death point of holes”, “Sum of all lifetimes”, and “Avg. of lifetime of all holes”. The transformation occurs due to changes in the radius of the data points, connecting and overlapping with other points to form a new shape; the process that happened during the formation of higher simplexes. The results at dimension 1 were less than those at dimension 0 in both January and September due to transformations during the creation process. The Summary Statistics Topological Feature for Dimensions 0 and 1 for the month of September are provided in supplementary material (https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx). This study utilizes TDA-uML methodology to extract the intrinsic shape from data set to identify flood structure, but the study of Musa et al. 2021, used the PH to detect dates in which the water level could overflow the usual level. However, their study is restricted to riverKalantin. This study utilizes TDA-uML methodology to extract the intrinsic shape from the data set to identify flood structures. In comparison, the study of Musa et al. 2021, used the PH to detect dates on which the water level could overflow the usual level [31]. However, their study is restricted in application to the Kalantan River alone.
The persistent homology is discovered in the 7 states studied to produce the values of four topological features of our datasets and represented them in the graph as barcodes and persistent diagrams (PDs). The patterns of the Barcodes of our dataset within the same state are also similar in shape irrespective of coming from different months (see Fig. 4). Considering the length of the bar produced in the two months, the Barcodes produced in the states detected more rainfall in September than in January. Finally, the relationship between death and birth during the transformation from dimension 0 to dimension 1 is positive, as shown in the outcome of the persistent diagram’s plots. The procedure in the algorithm for PH does not experience instability in the presence of noise regarding the shape of a cluster, so it is more stable and unbiased. In comparison, the outcome results are not far different in values, feature patterns, and outcome plots in this study.
The findings showed that September produced longer bars in their respective Barcode plots than January in all 7 states (See Fig. 4 (a) to Fig. 4 (g)). The interpretation is that September (the rainy season in Nigeria) has more significant topological information about the underlying structure of datasets than January (the dry season in Nigeria). The Barcode and PD plots from some zones showed that flooding occurs in September. However, there are some traces of rainfall due to the presence of bars in January that could not result in floods (Fig. 4). The PD shows the coordinate points corresponding to the birth and death values for distinct topological features. The black lines and dots in the PD signify the components (●), whereas the red triangles represent the loops ( ).

3.3. General Discussion of Results and Comparison of Tools

The hybrid TDA-uML methodology has been successfully applied using TDA tools (k-means clustering and persistent homology) to the study of flooding in Nigeria. The studies of Muszynski et al. in 2019, and Musa et al. in 2021, applied the hybrid method of TDA in flood control. Their models’ performances are robust, efficient, and accurate in flood prediction and in the detection of shape in data, but they did not compare the two methods in this study [6, 31, 35]. They did not identify any parameter in their study. The TDA, as revealed by Lei Yumiao in 2020, has dependent attributes on a dataset, which continue over various scales and produces a firm description of the dependable structure of the inputted dataset. It possesses high efficiency, especially during perturbation of the input dataset. Lei and Yumiao, in their study, added that TDA has joint topological and geometric attributes [44]. They used both methods to obtain a high-dimensional dataset, but their study did not compare tools in TDA. The study by Guo and Banerjee in 2017 selected resolution parameters: a number of intervals l and overlap percentage p, where p lies between 0 to 100 [23]. They encountered challenges in varying the resolution parameters appropriately so that the important sub-groups are correctly distinguished from artifacts in the generated topological networks. In this study, the resolution parameter is automatically selected in the procedure.
In this study, the main result in Fig. 4 shows that six out of the seven selected states have large areas that are flooded annually. Kogi state is the only state that has less flooded land mass considering the prediction. The outcome results were predicted according to the pattern of the features' outcomes and therefore achieved the main aim of developing a hybrid method for predicting flooding. The outcome result was used in classifying the hydrological areas of Nigeria to achieve our objective. The obtained result was used in classifying the hydrological areas of Nigeria to achieve our objective.
In comparison, persistent homology (PH) has various algorithms that extract the topological features from datasets according to their dimensions. In this study, the clustering algorithm in PH was used to study the datasets in low dimensions (i.e., dimensions 0 and 1). Four topological summaries were extracted from the dataset. The parameters were found to increase the performance and fix the possible outliers in the outcome. The k-means clustering algorithm procedure detected similar feature patterns on 6 out of the 7 states selected for this study. The algorithm classified the optimal clusters into flooding and non foooding in the 7 states, from which the topological features were extracted. The outcome of the plots, when closely observed, showed that the PH could extract inputs that share more similar features than the k-means clustering. This is clearly seen in the bars of the Barcode plots between the months of January and September. The PH does that by statistically calculating the birth point, death point, maximum lifetime of holes, averages and sums as the threshold values change.
The filtration and the corresponding barcode in the PH make the procedure unique in interpreting the resultant features; the results of the barcode plots are displayed in the persistence diagram (PD) plots, but k-means clustering cannot. Both tools identified months with no flooding and months with probable flooding in the 7 chosen zones. However, the PH was used to demonstrate that rainfall occurs during the month of dryness in Nigeria (known as the dry season) owing to the presumption that there would be no rainfall in the dry season. That is a contribution to the study even though it is not heavy enough to cause flooding in the dry season (months of dryness). The advantage of adopting the TDA-uML method to predict flood susceptibility areas is that they may be used for much more extensive coverage areas than hydrological models, which are typically limited in their coverage. Similarly, ML models can be applied and replicated in multiple locations with limited data availability.

3.4. Classification of the Hydrological Areas

This study classified the hydrological map of Nigeria according to the possible degree of flood risk. The information obtained in the clustering plots on the seven states showed that the Kogi area experienced the lowest flood risk among the 7 states. This study discovered that the high elevation surrounding most areas in Kogi state contributed to it, but those areas that are flooded have low elevation. The flooded states were located at low elevations close to rivers and watercourses. The elevations of the eight hydrological areas on the Nigerian map classify the states according to flood risk and vulnerability. The flood-hazard assessment and risk-based zones of a tropical flood plain are classified into five (violent, high, mild, moderate, and low flood risk) in this study. This classification will help inform better flood management and risk aversion. It will help in logistics estimation, particularly in hydrological areas and fiscal planning. Proper flood management helps to minimize cost and waste. The classifications are given as follow:
• I Sokoto = Mild

• III Yola = Moderate

• IV Makurdi = High

• V Port Harcourt = Violent

• VII Enugu = High

• VIII Maiduguri = High.

The hydrological Areas Map is provided and shown in the supplementary material (https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx).

4. Conclusions

A synthetic Topological Data Analysis and unsupervised Machine Learning (TDA-uML) were used to develop a model that successfully predicted flooding in Nigeria. The two TDA’s tools integrated machine learning, ML, and SVM procedures. The TDA-uML extracted information in terms of topological summaries (values), interpreted the results in barcodes and persistent diagrams to detect rain in dry periods. The clustering aspect could detect unusual feature patterns that were relevant in the detection and classification of flood hazards. The clustering aspect could detect unusual feature patterns that were relevant in the detection and classification of flood hazards. The clustering aspect could detect unusual feature patterns that were relevant in the detection and classification of flood hazards. The implementation of our model is unique, with higher performance when compared to other flood modeling in the literature. The hydrological areas of Nigeria were successfully classified as flood risk and vulnerability using the vital information obtained from the findings. The unusual pattern discovered in this study was observed to have a positive relationship with the elevation of the zone(s) it occurred in. Every flood section produces a feature that is useful in predicting flood and no-flood zones. The four parameters involved increased the data points and contributed to the high performance (91%) of the model. The findings could detect areas and/or months of high flood concentration. In flood management and mitigation, the outcome of this research would be relevant in reducing exposure to flood risk, vulnerability, and hazard associated with flooding. The findings provided relevant information about the degree of flood risk and allowed us to identify vulnerability and quantify risk in a new study.
Some limitations are peculiar to the hybrid TDA-uML methodology. The first is that the nature of what it explains is entirely complex to interpret. Secondly, the use of the relevant software is limited. Furthermore, due to the large number of clustering algorithms that have recently become available, there are still questions about which algorithm should be used. Therefore, the authors recommend that future research should be focused on determining the best cluster algorithm that would fit the procedure in the modeling. The method should be extended to modeling other environmental hazards as well. The code for the implementation of this study is made available at GitHub (https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx).

Acknowledgement

I like to extend a profound appreciation to a career advisor (my mentor) at the School of Mathematics, Universiti Sains Malaysia. Thanks to the University of Nigeria, Enugu State, Nigeria, for granting of study leave.

Nomenclecture

Symbols Definitions Dimensions
D Space Squared metre
Dim(s) Dimension(s) One and two dimensions
dist Distance Metre
E Threshold E = hf = Φ + ½mv2max
Max Maximum Metre
Min Minimum Metre
Real number dim 1 – 18 simplices – 7 vertices
d Dimension of real number One dimension
fx Interpolation function
h(x) Linear classifier

Notes

Conflict of Interest

The author(s) declared that there is no contradiction of interest.

Author contributions

F.O.O (PhD student) conducted all the analysis and wrote the manuscript. M.T.I (Associate Professor) revised the manuscript. M.K.M.A (PhD Senior Lecturer) provided the codes and revised the analysis.

Funding statement

There is no funding support for this research.

References

1. Biscio CAN, Møller J. The accumulated persistence function, a new useful functional summary statistic for topological data analysis, with a view to brain artery trees and spatial point process applications. J. Comput. Graph. Stat. 2019;28:671–681. https://doi.org/10.1080/10618600.2019.1573686

2. Frosini P. A distance for similarity classes of submanifolds of a euclidean space. Bull. Aust. Math. Soc. 1990;42:407–415. https://doi.org/10.1017/S0004972700028574

3. Letscher HED, Zomorodian A. Topological Persistence and Simplification. Discrete Comput. Geom. 2002;28:511–533. https://doi.org/10.1109/SFCS.2000.892133

4. Zomorodian A, Carlsson G. Computing persistent homology. Discrete Comput. Geom. 2005;33:249–274. https://doi.org/10.1007/s00454-004-1146-y

5. Richeson DS. Euler's gem:The polyhedron formula and the birth of topology. Princeton University Press; 2019. p. 120–121.

6. Muszynski G, Kashinath K, Kurlin V, Wehner M, Prabhat . Topological data analysis and machine learning for recognizing atmospheric river patterns in large climate datasets. Geosci. Model Dev. 2019;12:613–628. https://doi.org/10.5194/gmd-12-613-2019

7. Li D, Qu S, Shi P, et al. Development and integration of sub-daily flood modelling capability within the SWAT model and a comparison with XAJ model. Water Res. 2018;10:1263 https://doi:10.3389/fpls.2018.00553

8. Tanoue M, Hirabayashi Y, Ikeuchi H, et al. Global-scale river flood vulnerability in the last 50 years. Sci. Rep. 2016;6:1–9. https://doi.org/10.1038/srep36021

9. Whitworth NUC. Flooding and flood risk reduction in Nigeria: Cardinal gaps. J. of Geogr. Nat. Disast. 2015;05:1–12. 10.4172/2167-0587.1000136

10. Dottori F, Salamon P, Bianchi A, Alfieri L, Hirpa FA, Feyen L. Development and evaluation of a framework for global flood hazard mapping. Adv. Water Resour. 2016;94:87–102. http://dx.doi.org/10.1016/j.advwatres.2016.05.002

11. Teng J, Jakeman AJ, Vaze J, Croke BFW, Dutta D, Kim S. Flood inundation modelling: A review of methods, recent advances and uncertainty analysis. Environ. Model. Softw. 2017;90:201–216. https://dx.doi.org/10.1016/j.envsoft.2017.01.006

12. Bubenik P. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 2015;16:77–102.

13. Ohanuba FO, Ismail MT, Ali MKM. Topological data analysis via unsupervised machine learning for recognizing atmospheric river patterns on flood detection. Sci. Afr. 2021;13:1–8. https://doi.org/10.1016/j.sciaf.2021.e00968

14. Dettinger MD, Ingram BL. The coming megafloods. Sci. Am. 2013;308:64–71. https://www.jstor.org/stable/10.2307/26017895

15. Shang Y, Huang H, Sun X, Wei W, Mao X. A pattern-aware self-attention network for distant supervised relation extraction. Inf. Sc. 2022;584:269–279. https://doi.org/10.1016/j.ins.2021.10.047

16. Pereira CMM, de-Mello RF. Persistent homology for time series and spatial data clustering. Expert Syst. Appl. 2015;42:6026–6038. https://dx.doi.org/10.1016/j.eswa.2015.04.010

17. Ambrosio J, Lazzaretti AE, Pipa DR, da-Silva MJ. Two-phase flow pattern classification based on void fraction time series and machine learning. Flow Meas. Instrum. 2021;83:1–10. https://doi.org/10.1016/j.flowmeasinst.2021.102084

18. Ji Y, Zhang H, Gao F, Sun H, Wei H, Wang N. LGCNet: A local-to-global context-aware feature augmentation network for salient object detection. Inf. Sc. 2022;584:399–416. https://doi.org/10.1016/j.ins.2021.10.055

19. Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. J. Bioinform. 2005;21:3201–3212. https//doi.org/10.1093/bioinformatics/bti517

20. Islam ARMT, Talukdar S, Mahato S, Kundu S. Flood susceptibility modelling using advanced ensemble machine learning models. Geosci. Front. 2021;12:101075 https://doi.org/10.1016/j.gsf.2020.09.006

21. Zeng D, Liu K, Chen Y, Zhao J. Distant supervision for relation extraction via piecewise convolutional neural networks. In : Proceedings of the 2015 conference on empirical methods in natural language processing; 17–21 September 2015; Portugal. 1753–1762. https://www.freebase.com/

22. Huang W, Mao Y, Yang L, Yang Z, Long J. Local-to-global gcn with knowledge-aware representation for distantly supervised relation extraction. Knowl. Based Syst. 2021;234:107565 https://doi.org/10.1016/j.knosys.2021.107565

23. Guo W, Banerjee AG. Identification of key features using topological data analysis for accurate prediction of manufacturing system outputs. J. Manuf. Syst. 2017;43:225–234. https://dx.doi.org/10.1016/j.jmsy.2017.02.015

24. Lazar N, Ryu H. The shape of things: Topological data analysis. Krebs. als. Chance. 2021;34:59–64. https://doi.org/10.1080/09332480.2021.1915036

25. Rieck B, Leitte H. Exploring and comparing clusterings of multivariate data sets using persistent homology. Computer Graphics Forum: Wiley Online Library. 2016. 81–90. http://onlinelibrary.wiley.com/doi/10.1111/cgf.12884/abstract

26. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc., B: Stat. Methodol. 2001;63(2)411–423. https://doi.org/10.1111/1467-9868.00293

27. Moughal T. Hyperspectral image classification using support vector machine. J. Phys. Conf. Ser. 2012;24:439–440. https//doi.org/10.1088/1742-6596/439/1/012042

28. Feng M, Porter MA. Spatial applications of topological data analysis: Cities, snowflakes, random structures, and spiders spinning under the influence. Phys. Rev. Res. 2020;2:426–428. https//doi.org/10.1103/PhysRevResearch.2.033426

29. Munzimi YA, Hansen MC, Asante KO. Estimating daily stream-flow in the Congo Basin using satellite-derived data and a semi-distributed hydrological model. Hydrol. Sc. J. 2019;64:1472–1487. https://doi.org/10.1080/02626667.2019.1647342

30. Islambekov U, Gel YR. Unsupervised space–time clustering using persistent homology. Environmetrics. 2019;30:1–6. https://doi.org/10.1002/env.2539

31. Musa SMSS, Noorani MS, Razak FA, Ismail M, Alias MA, Hussain SI. Using persistent homology as preprocessing of early warning signals for critical transition in flood. Sci. Rep. 2021;11:1–14. https://doi.org/10.1038/s41598-021-86739-5

32. Carlsson G. Topology and data. Bull. Am. Math. Soc. 2009;46:255–308. https://www.ams.org/journal-terms-of-use

33. Zieliński B, Lipiński M, Juda M, Zeppelzauer M, Dłotko P. Persistence codebooks for topological data analysis. Artif. Intell. Rev. 2022;54:1969–2009. https://doi.org/10.1007/s10462-020-09897-4

34. Adler RJS, Agami P, Pranav P. Modeling and replicating statistical topology and evidence for CMB nonhomogeneity. Proc. Natl. Acad. Sci. 2017;114:11878–11883. https://doi.org/10.2307/26485685

35. Offroy M, Duponchel L. Topological data analysis: A promising big data exploration tool in biology, analytical chemistry and physical chemistry. Anal. Chim. Acta. 2016;910:1–11. https://dx.doi.org/10.1016/j.aca.2015.12.037

36. Wang Y, Zhang B, Gao SH, Li CH. Investigation on the effect of freeze-thaw on fracture mode classification in marble subjected to multi-level cyclic loads. Theor. Appl. Fract. Mech. 2021;111:1–19. https://doi.org/10.1016/j.tafmec.2020.102847

37. Zulkepli NFS, Noorani MSM, Razak FA, Ismail M, Alias MA. Cluster analysis of haze episodes based on topological features. J. Sustain. Dev. 2020;12:1–17. https://doi.org/10.3390/su12103985

38. Adams H, Emerson T, Kirby M, et al. Persistence images: A stable vector representation of persistent homology. J Mach. Learn. Res. 2017;18:1–35. http://jmlr.org/papers/v18/16-337.html

39. Adamaszek M, Adams H. The vietoris–rips complexes of a circle. Pac. J. Math. 2017;290:1–40. htpps://doi.org/10.2140/pjm.2017.290.1

40. Arnold M, Baryshnikov Y, Mileyko Y. Typical representatives of free homotopy classes in multi-punctured plane. J. Topol. Anal. 2019;11:623–659. https://doi.org/10.1142/S1793525319500262

41. Riihimäki H, Chachólski W, Theorell J, Hillert , Ramanujam R. A topological data analysis based classification method for multiple measurements. BMC bioinform. 2020;21:1–18. https://doi.org/10.1186/s12859-020-03659-3

42. Alias MA, Musa SMSS, Noorani MSM, Razak FA, Ismail M. Streamflow data analysis for flood detection using persistent homology. Sains. Malays. 2022;51:2211–2222. http://doi.org/10.17576/jsm-2022-5107-22

43. Ighile EH, Shirakawa H, Tanikawa H. A study on the application of gis and machine learning to predict flood areas in Nigeria. J. Sustain. Dev. 2022;14:50–59. https://doi.org/10.3390/su14095039

44. Lei Y. Topological methods for the analysis of applications. In : International Conference on Modern Educational Technology, Innovation and Entrepreneurship (ICMETIE 2020); Atlantis Press;; 15–16 July 2020; Shanxi. 6–9. https://creativecommons.org/licenses/by-nc/4.0/

Study Design
Fig. 2
The Algorrithm Pseudocode
Fig. 3
Summary of Clustered Feature Pattern on the Seven Selected States.
Fig. 4
(a – g) Plots of Barcodes and PD for Months of January and September for the Seven States.
Table 1
Silhouette Output score with their Validity
State Number of clusters (k) Cluster Validity

2 3 4 5

Benue 0.698 0.612 0.612 0.540 0.8619
Kogi 0.880 0.710 0.511 0.553 0.8758
Kwara 0.790 0.658 0.401 0.395 0.3794
Anambra 0.762 0.670 0.602 0.510 0.6810
River 0.719 0.694 0.687 0.397 0.6118
Edo 0.698 0.682 0.398 0.388 0.3804
Bayelsa 0.889 0.721 0.677 0.612 0.9117

Range = (0.7–0.9) (0.6–0.8) (0.3–0.7) (0.3–0.7) (0.3–0.9)
Table 2
Summary of Flood Factors in the Seven States
s/n State Location Distance to closest River Elevation
1 Anambra 5°58′22.42″N 7°14′29.41″E 51.40km 104m
2 Bayelsa 6°27′45.936″N °24′44.604″E 11.59km 10m
3 Benue 7°10′12.252″N 9°17′4.2″E 2.18km 116m
4 Edo 6°23′59.28″N 5°36′35.6″E 46.43km 114m
5 Kogi 7°16′13.62″N 5°13′12.648″E 74.19km 431m
6 Kwara 8°28′42.78″N 4°31′53.688″E 14.85km 345m

7 River 9°4′56.352″N 6°00.108′E 51.48km 14m
Table 3
Summary Statistics of Topological Features for Dimension 0 on January.
State Sum of all lifetimes (sumD) Avg. lifetime of all holes (avgD)
Anambra 564.1333 21.6974
Bayelsa 555.3351 19.8334
Benue 560.0228 28.0011
Edo 562.7852 23.0211
Kogi 565.3397 22.6135
Kwara 562.4016 26.7810
River 578..6685 21.4321
Table 4
Summary Statistics of Topological Features for Dimension 1 for January
State Max. holes lifetime Max. death point of holes (maxd) Sum of all lifetimes (sumD) Avg. of lifetime of all holes (avgD)
Anambra 0.8800 4.1502 0.9966 0.4983
Bayelsa 0.9825 3.5642 0.9825 0.4912
Benue 0.3377 19.4161 0.3377 0.3377
Edo 0.4142 1.4142 1.6568 0.4144
Kogi 0.3377 13.4142 1.2726 0.4146
Kwara 0.6563 29.0172 1.0431 0.3477
River 0.4432 3.6055 0.6055 0.3027
TOOLS
Full text via DOI