### 1. Introduction

### 1.1. Problem Statement

### 2. Materials and Methods

### 2.1. Tools in the Hybrid TDA-uML Methodology

#### 2.1.1. Three key properties of topological data analysis

Coordinate invariance (freeness); the principal idea says it should not matter how we represent the dataset in terms of coordinate, provided we keep in track with the internal similarities (distances). The measure of topological shape does not change if you rotate the shape.

Deformation invariance; the property of invariant deformation remains unchangeable even if there is stretching on the object. Example, letter ‘A’ remains a loop with two legs and a close triangle if apply stretching on it.

Compressed representation. If we observe more closely, the illustration of important attributes in the letter ‘A’ can be observed as having three bounded angles with two stands (legs) irrespective of the compression of millions of data points with similarity relationships in the object. For more detail on the properties of TDA, see the work of M Offroy and L Duponcel [35].

#### 2.1.2. Design of the methods

#### 2.1.3. Interpolation implementation

##### (1)

$$\text{Y}=\text{Y}1+{\scriptstyle \frac{(\text{Y}2-\text{Y}1)}{(X2-\text{X}1)\u2605(X-\text{X}1)}}$$##### (2)

$${f}_{x}=B2+{\scriptstyle \frac{\$B2-\$B1}{(n+1)}}\forall n\ge 2\hspace{0.17em}\text{or\hspace{0.17em}}{f}_{x}=B2+{\scriptstyle \frac{B2-B1}{(n+1)}},\forall n=1$$*B*1 = the value of the cell preceding the empty cell(s),

*B*2 = the value of the cell following the empty cell(s),

*n*= the number of empty cell(s) between (

*B*2 and

*B*1), (

*n*+ 1) = the gap between (

*B*2 and

*B*1). The symbol, $ is used to fix the empty columns where

*n*≥ 2 by clicking and dragging along the row of the missing value where the function would be first applied. The snapshots show how interpolation was obtained in the Anambra dataset and saved in the Microsoft Excel file as Anambra11 [https://github.com/Felix-Obi/web100/blob/main/supplmat/a.docx]. On January 1, 2008, three missing values were fixed, while on February 24, 2013, one missing value was fixed. The fixed values obtained are inside the orange square box. The values obtained in the Excel file are (a)305.304, (b) 292.001, (c) 1.345, and (d) 2.515.

### 2.2. Methodology and Computation of Hybrid TDA-uML

#### 2.2.1. Computation of k-means clustering

#### 2.2.2. Computation of the cluster validity test

*x*

*= (*

_{i}*x*

_{i}_{1},

*x*

_{i}_{2}, …,

*x*

*) and*

_{ip}*y*

*= (*

_{i}*y*

_{i}_{1},

*y*

_{i}_{2}, …,

*y*

*) in a dataset, the general function is derived as expressed in Eq. (3).*

_{ip}##### (3)

$$\begin{array}{c}d\hspace{0.17em}\left({x}_{i}\hspace{0.17em}{y}_{j}\right)=\sqrt{\left({\left|{x}_{i1}-{y}_{j1}\right|}^{2}+{\left|{x}_{i2}-{y}_{j2}\right|}^{2}+,\dots ,+{\left|{x}_{ip}-{y}_{jp}\right|}^{2}\right)}\\ d\hspace{0.17em}\left({x}_{i}\hspace{0.17em}{y}_{j}\right)=\sqrt{\sum _{i=1}^{n}\sum _{j=1}^{n}{\left|{x}_{i1}-{y}_{j1}\right|}^{2}}\end{array}$$^{th}term) represent the total number of the variable points used, and the variables represent the variables in the horizontal and vertical domains, respectively.

_{1−α}= G

^{−1}(1 − α) is the quantile, 1−

*α*. Then the set (the confidence band) is obtained in Eq. (4).

##### (4)

$$\overline{C}(x)=\left[{\widehat{\mathcal{W}}}_{n}(x)-{\overline{\text{c}}}_{1-\mathrm{\alpha}},{\widehat{\mathcal{W}}}_{n}(x)+{\overline{\text{c}}}_{1-\mathrm{\alpha}}\right]$$*G(x)*which is a density function of the dataset. In the expresion,

**represents a set of points in a dataset,**

*x*

*x**is a row vector defining sample for which a lens value is calculated, and*

_{i}

*x**is all the other samples in the dataset. The expression*

_{i}*d*

^{2}(

*x**,*

_{i}

*y**) represents the distance measure between the sample points. The number of variables is*

_{j}*n*, and

*d*is the distances within the dataset.

##### (6)

$$\begin{array}{l}\hfill G(x)=\sum _{{x}_{j}\in Z}{e}^{-{d}^{2}({x}_{i},{y}_{j})}\hfill \\ d\left({x}_{i},{y}_{j}\right)={({\mathrm{\Sigma}}_{i=1}^{n}{({x}_{i}-{y}_{i})}^{2})}^{{\scriptstyle \frac{1}{2}}}={\mathrm{\Sigma}}_{i=1}^{n}({x}_{i}-{y}_{i})\\ {d}^{2}\left({x}_{i},{y}_{j}\right)=({\mathrm{\Sigma}}_{i=1}^{n}{({x}_{i}-{y}_{i})}^{2}\forall {x}_{i},\hspace{0.17em}{y}_{j}\in S\end{array}$$**= set of points in a dataset,**

*x*

*x**= row vector of sample*

_{i}*i*for which a lens value is calculated and

*d*

^{2}(

*x**,*

_{i}

*y**) = samples in the dataset, and*

_{j}

*y**= the square of distance measure between the sample points. The value(s) of the radius (r) will determine the kind of metric obtained/used in a study as expressed in Eq. (7), Eq. (8), and Eq. (9). Eqs. (7) and (8) were used as the metric functions in this study.*

_{j}##### (7)

$$\text{Euclidean\hspace{0.17em}distance},\hspace{0.17em}{d}_{2}\left({x}_{i},{y}_{j}\right)=\sqrt{{\mathrm{\Sigma}}_{i=1}^{n}{({x}_{i}-{y}_{i})}^{2}},\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}(r=2)$$##### (8)

$$\text{Manhattan\hspace{0.17em}distance},\hspace{0.17em}{d}_{1}\left({x}_{i},{y}_{j}\right)={\mathrm{\Sigma}}_{i=1}^{n}\mid ({x}_{i}-{y}_{i})\mid ,\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}(r=1)$$##### (9)

$$\text{Maximum\hspace{0.17em}distance},\hspace{0.17em}{d}_{\infty}\left({x}_{i},{y}_{j}\right)=\stackrel{n}{\overbrace{\underset{i=1}{\text{max}}}}\mid ({x}_{i}-{y}_{i})\mid ,\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}(r=\infty )$$*X*= {

*x*

_{1}, …,

*x*

*} be a set of*

_{n}*n*object from space

*χ*,

*d*be a dis-similarity or distance over

*χ*. The Silhouette index (

*s*

*) for an observation*

_{i}*x*

*∈*

_{i}*X*is defined as Eq. (10). In this study,

*n*= 4 (i.e., the total number of variables used in this study).

*a*(*i*) = mean lenght between variables*i*&*j*of a close cluster within group*C*.*b*(*i*) = mean lenght between variables*i*&*j*of others between different groups*C*.

##### (11)

$$D{I}_{n}={\scriptstyle \frac{\underset{1\le i<j\le n}{\text{min}}dist({c}_{i},{c}_{j})}{\underset{1\le i\le n}{\text{min}}diam({c}_{l})}}$$*dist*(

*c*

_{i}*, c*

*) = inter-cluster distance metric between*

_{j}*c*

*and*

_{i}*c*

*clusters,*

_{j}*diam*(

*c*

*) = mean distance between all the pairs. An intra-cluster distance metric is obtained within each group, but the mean distances are obtained between all the pairs.*

_{ı}#### 2.2.3. Algorithm’s procedure for validity evaluation

_{1}…, μ⃗

_{n}and fixing the mean of the selected observation μ⃗

_{I}…, μ⃗

_{K}. Firstly, a set of input

*i*is selected when the alternate set of input

*j*is fixed. Then, the iterations are performed on

*γ*

*. Secondly, the alternate input*

_{ij}*j*is selected and the input is fixed. Iterations are then carried out. In both situations, the process is continued until convergence is created and the desired clusters are generated; at that point, the procedure is complete. If optimality is not attained, the process is repeated. The use of relevant codes and following the procedure helps to achieve the developed model. The steps are precisely itemized in the pseudocode and shown in Fig. 2.

*h*(

*x*) =

*sign*(

*w*

^{T}*x*) +

*b*, where b = set of points in the plane,

*w*

^{T}*x*= matrx projection of

*x*on

*h*(

*x*). The SVM is a separating hyperplane with mathematical formula given as Eq. (12).

##### (12)

$${w}^{T}x+b=\left[\begin{array}{c}+1\\ 0\\ -1\end{array}\right],\{\begin{array}{cc}({w}^{T}{x}_{i})+b>0& \text{if\hspace{0.17em}}{y}_{i}=1\\ ({w}^{T}{x}_{i})+b<0& \text{if\hspace{0.17em}}{y}_{i}=-1\end{array}$$### 2.3. Computation of Persistence Homology (PH) in the TDA-ML Approach

*r*(equipped with Euclidean metric) is homotopy equivalent to (

*S*

^{2}

^{l}^{ + 1}i.e., product of Vietoris-Rips complexes). The theorem is obtained from Eq. (13) if

##### (13)

$$2r\hspace{0.17em}sin\hspace{0.17em}\left(\pi {\scriptstyle \frac{l}{2l+1}}\right)<\varepsilon \le 2r\hspace{0.17em}sin\hspace{0.17em}\left(\pi {\scriptstyle \frac{l+1}{2l+3}}\right)$$##### (15)

$${\text{Bcd}}_{p}^{\varepsilon}\left(\mathcal{R}(\text{T})\right)=\left\{{I}_{1}\cap \cdots \cap {I}_{N}|{I}_{n}\in {\text{bcd}}_{{m}_{n}}^{1}\left(R\left({S}_{{r}_{n}}^{1}\right)\right),{m}_{n}\in \{0,1\},\hspace{0.17em}{\mathrm{\Sigma}}_{n=1}^{N}{m}_{n}=p\right\}$$*p*and

*ε*are the minimum and maximum dimensions, respectively, in the PH. The dimensions are 0 and 1, respectively, in this study. The Vietoris-Rips complex, ${\mathcal{R}}_{\varepsilon}({S}_{r}^{1})$ of a circle for radius

*r*, with points $T={S}_{r1}^{1}\times {S}_{r2}^{1}\times \cdots \times {S}_{rN}^{1}$. The relevant scale parameter, ∝ is built on the function, ${S}_{rN}^{1}=\{{s}_{0},{s}_{1},{s}_{2},{s}_{3},{s}_{4}\}$,

*N*= 4.

Installing the R-TDA package in R software and calling the library for TDA using code

The importation of our dataset that was saved in a.csv Excel file

Fixing of the maximum filtration value using codes and maximum dimension for features (dims 0 and 1)

Application of persistent homology using a Rips complex code

Plotting of the Barcode and Persistent Diagram (PD) using code for each case.

*q*= {

*q*

_{0},

*q*

_{1}, …,

*q*

*}, and , in the dimension D (i.e., dim. 0 and 1). The eqs. (16) to (19) are the expressions of the four topological summaries obtained in PH.*

_{i}### 3. Results and Discusion

### 3.1. Discussions on K-mean Clustering of the Hybrid Method

*k*= 2 among the clusters (

*k*= 2, 3, 4, 5). The Silhouette coefficient (SI) obtained in this study lie within the range (0.3 – 0.9) as shown in Table 1. At

*k*= 2, the results fell between 0.7 and 0.9 (70% – 90%), with the value in Bayelsa topping the list. The efficiency of the result is 0.9117 (i.e., approximately 91%); this is efficient, and an improvement to the previous study. In their paper published in 2020, Riihimäki et al. stated that the more variables or data points, the better the result and efficiency [41]. This study used four rainfall parameters (maximum and minimum temperature, precipitation, and wind speed), which are in line with context. A combined method, TDA and ML, was applied to a climate event in Arizona [4]. Their models produced a constantly speedy and accurate separation of atmospheric rivers, Ars, which shows that integrating TDA and ML produces a highly robust and accurate result in flood modelling. The validity test results presented in Table 1 show there is a close similarity between the outcomes and the optimum values; see their range values. According to the benchmark, optimum results lie within negative ones (−1 for poor value) and positive ones (+1 for best value). The results of the analysis obtained and shown in Table 1 revealed that all the values obtained in the clusters lie between 0.5 and above (i.e., in the group k = 2).

### 3.2. Discussion of Persistent Homology (PH)

### 3.3. General Discussion of Results and Comparison of Tools

### 3.4. Classification of the Hydrological Areas

I Sokoto = Mild

II Kaduna = High

III Yola = Moderate

IV Makurdi = High

V Port Harcourt = Violent

VI Ibadan = Violent

VII Enugu = High

VIII Maiduguri = High.