### 1. Introduction

_{3}, bromoform, and dichloroacetic acid, as Group B2 to categorize them as potentially carcinogenic toxic substances [16]. Therefore, it is crucial to predict and suppress the formation of disinfection byproducts in the water treatment process. In this respect, the US EPA proposed disinfection byproduct formation models based on chlorine injection points in the water treatment process. However, those models are based on operational characteristics of water purification plants in the US, which are different from conditions in Korea, and application scopes of models are inappropriate for operation features of water purification plants in Korea; hence, there are limitations in directly applying these models in Korea.

### 2. Research Methods

### 2.1. Raw Data & Preprocessing

^{3}/d of purified water. This plant uses the standard treatment process of admixture, coagulation, precipitation, and filtration, and the advanced treatment process of post-ozone and activated carbon filtration are linked; purified water is finally produced through chlorine disinfection.

### 2.2. AI Algorithm Model

#### 2.2.1. AI algorithms

### 2.2.1.1. DT

### 2.2.1.2. RF

### 2.2.1.3. SVM

### 2.2.1.4. KNN

### 2.2.1.5. GBR

### 2.2.1.6. MLP

#### 2.2.2. Data separation

#### 2.2.3. Machine learning and testing

### 2.3. Empirical Model

#### 2.3.1. Empirical DOC removal model

*f*

*is the fraction of nonsorbable DOC,*

_{NDOC}*K*

_{1}is the empirical fitting constant 1[mg·cm/L],

*K*

_{2}is the empirical fitting constant 2 and

*SUVA*

*is the specific UV absorbance of raw water [L/mg·cm]. Following determination of K*

_{raw}_{1}and K

_{2}, sorbable DOC can be calculated as following Eq. (2).

*DOC*

*is the sorbable DOC concentration [mg/L],*

_{sorb}*DOC*

*is the raw water DOC concentration. When coagulants come in contact with DOC, the ultimate DOC distribution between adsorbents and solutes occurs when the system is in equilibrium. The DOC removed from water is the value after deducting the remaining sorbable DOC in equilibrium from the sorbable DOC. Based on the Langmuir model, the adsorption equilibrium between coagulant dosage and DOC can be expressed as Eq. (3).*

_{raw}*DOC*

*is the sorbable DOC in solution at equilibrium [mg/L],*

_{eq}*C*

*is the coagulant dosage [mM Al*

_{d}^{3+}or Fe

^{3+}],

*a*is the maximum DOC sorption/mM coagulant [mg/mM Al

^{3+}or Fe

^{3+}], b is the sorption constant for sorbable DOC [L/mg]. The amount of DOC remaining after coagulation is the sum of the remaining

*DOC*

*in equilibrium and*

_{eq}*DOC*

*in raw water multiplied by*

_{raw}*f*

*(i.e., the ratio of non-sorbable DOC), which can be determined by following Eq. (4).*

_{NDOC}*a*, as a function of pH and defined the empirical relationship between them as following the Eq. (5).

*x*

*represents the fitting constants,*

_{n}*pH*represents the final process pH value. However, DOC concentrations in water are influenced by diverse operational and environmental conditions, such as raw water features, seasonal influences, and operational characteristics of water purification plants. An accurate prediction of DOC removal should consider diverse empirical coefficients that can affect DOC variations [13].

*a*

*) as following the Eq. (6).*

_{proposed}#### 2.3.2. Empirical disinfection byproduct formation model

##### (7)

$$THM=A\xb7{(DOC\xb7UVA)}^{a}\xb7{({Cl}_{2})}^{b}\xb7{(B{r}^{-})}^{c}\xb7{(d)}^{(pH-7.5)}\xb7{(e)}^{(Temp-20)}\xb7{(t)}^{f}$$*THM*is the trihalomethane concentration [μg/L],

*DOC*is the dissolved organic concentration,

*UVA*is the ultraviolet absorbance at 254nm [1/cm],

*Cl*

_{2}is the chlorine dose [mg/L],

*Br*

^{−}is the bromide concentration [μg/L], is the reaction time [h], (

*A*,

*a*,

*b*,

*c*,

*d*,

*e*,

*f*) are the fitting constants.

#### 2.3.3. Model estimation

*Y*

*represents the i-th observed value measured in the actual water treatment process, and*

_{i}*Ŷ*

*(*

_{i}*ξ*

*:*

_{i}*χ*) represents the i-th predicted value calculated by the model. χ = (

*χ*

^{0},

*χ*

^{1}, …

*χ*

*)is the vector of unknown parameters that we want to estimate, and*

^{K}*ξ*

*= (*

_{i}*ξ*

_{i}^{1},

*ξ*

_{i}^{2}, …,

*ξ*

_{i}*)represents the explanatory variable vector representing the operational characteristics that affect the variability of the target water quality. Here, the parameter*

^{K}*χ*minimizing the sum of the error squares is calculated according to the following Eq. (9).

##### (9)

$$\frac{\partial A(\overline{\chi})}{\partial {\chi}^{k}}=0,\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}\mathrm{\hspace{0.17em}\u200a\u200a}(k=1,\cdots ,K)$$*χ̄*that satisfies the above two equations at the same time can be called (

*χ̄*

^{0},

*χ̄*

^{1}, …,

*χ̄*

*), and in this study, it was derived using the Newton Rapson method.*

^{K}### 2.4. Model Performance Evaluation

^{2}), Mean Absolute Percentage Error (MAPE), and Root Mean Square Error (RMSE) for the linear regression of the measured values and predicted values. In the equation (Eq. (10–12)),

*Y*

*and*

_{i}*Ȳ*

*represent the actual values and predicted values of the data set, respectively,*

_{i}*Ȳ*

*represents the mean of the actual values, and n represents the number of data sets. R*

_{i}^{2}indicates the goodness of fit of the prediction performance, with higher values closer to +1 indicating better performance. MAPE and RMSE indicate better prediction performance as they approach 0, and increasing positive values indicate lower prediction performance.

### 3. Results and Discussion

### 3.1. DOC Removal

#### 3.1.1. Empirical model

^{2}was −0.270. As shown in Fig. 2, there was a general tendency that predicted values overestimated compared to observed values. Therefore, we used the pH and temperature of raw water, and DOC, UV254, and coagulant dosage that measured at the receiving well and settling basin in the M water purification plant to estimate the appropriate Edwards–Calib model for the M water purification plant. Furthermore, we suggested a proposed model for more precise DOC removal prediction in actual water plants by additionally considering pH, as suggested by Edwards [11] and empirical coefficients such as temperature and coagulant dosage in the process of quantifying maximum sorption of flocs. As the M water purification plant employs alum-based coagulants of PSO-M, PAC, and Al

_{2}(SO

_{4})

_{3}, we developed a model in consideration of only alum. Table 1 shows the details of parameter values and error estimates derived for each model.

#### 3.1.2. AI algorithm model

^{2}> 0.94, RMSE < 0.066 mg/L, and MAPE < 2.49%). In particular, RF showed the optimal prediction performance among the models (R

^{2}= 0.9972, RMSE = 0.0142 mg/L, and MAPE = 0.572%). On the other hand, as a result of testing for prediction performance by utilizing a trained DOC removal model, and data that were not used for training the model, all AI algorithm techniques showed excellent performances (R

^{2}> 0.901, RMSE < 0.0783 mg/L, and MAPE < 3.36%). When training the model, RF showed the optimal performance; however, after the test, MLP showed the best prediction performance among the models (R

^{2}= 0.9795, RMSE = 0.0365 mg/L, and MAPE = 1.513%). Comparing the results of predicting DOC removal using the AI algorithm technique to those of the empirical models (Fig. 2), the performances of AI algorithm prediction techniques were generally consistently superior to those of the empirical models.qkr

### 3.2. Disinfection Byproduct Formation Model

#### 3.2.1. Empirical model

_{3}, CHCl

_{2}Br, CHBr

_{2}Cl, C

_{2}H

_{3}Cl

_{3}O

_{2}, dibromoacetonitril (C

_{2}HBr

_{2}N), dichloroacetonitrile (C

_{2}HCl

_{2}), and HAA. Among them, we estimated models regarding the formation of disinfection byproducts such as THM, CHCl

_{3}, and CHCl

_{2}Br, as we were able to obtain accumulated water quality data regarding the byproducts with significance for analysis during the period from January 5, 2015 to December 30, 2019.

_{3}, and CHCl

_{2}Br resulted in R

^{2}values of −109.5, −108.7, and −7.506, respectively. These negative results indicate a tendency of overestimating the observed values in the predictions. Furthermore, the MAPE values are 311.24%, 247.62%, and 88.26% for the respective disinfection byproduct models, indicating significantly low prediction accuracy. In other words, it can be concluded that applying the parameter suggested in the literature to the M water purification plants might be challenging.

_{3}, and CHCl

_{2}Br resulted in R

^{2}values of 0.692, 0.628, and 0.596, respectively. It can be assumed that the prediction reliability was high considering that water quality measurement data from the actual water purification plant were utilized rather than experimental data. On the other hand, when confirming the signed of each model, the value of f(time) (i.e., a parameter of contact time between treated water and chlorine) was positive, which explains the phenomenon in which disinfection byproduct formation increased along with longer contact time, and which is consistent with the experimental results of US EPA [21]. Similarly, it explained the tendency in which disinfection byproduct formation increased when the DOC concentration (i.e., the precursor of disinfection byproducts in water), residual chlorine concentration, temperature, and pH were higher.

#### 3.2.2. AI algorithm model

^{2}> 0.88, RMSE < 1.9043 μg/L, and MAPE < 22.91%. In particular, the GBR technique showed the optimal prediction performance among the models (R

^{2}= 0.9999, RMSE = 0.0407 μg/L, and MAPE = 0.244%). On the other hand, as a result of testing for prediction performance by utilizing a trained model on disinfection byproduct formation, and data that were not used for training the model, all techniques except DT showed excellent performances (R

^{2}> 0.92, RMSE < 1.5707 μg/L, and MAPE < 9.50%). When training the model, GBR showed the optimal performance; however, after the test, MLP showed the optimal prediction performance among the models (R

^{2}= 0.9781, RMSE = 0.8486 μg/L, and MAPE = 5.05%). Comparing the results of predicting disinfection byproduct formation using the AI algorithm technique to those of the empirical models (Fig. 5(d, e, f)), the performances of AI algorithm prediction techniques were generally consistently superior to those of the empirical models.

^{2}> 0.938); there were fluctuations in the error range between predicted and measured values (between −1.4~5.1%) in line with larger number of data samples. Even with datasets with 90 data samples, it was possible to create a model with similar performances to those if the empirical models, and when datasets with > 270 data samples were used, the model showed improved performances by a maximum of 41.3%, a minimum of 28.6%, and an average of 35.2% based on R

^{2}, compared to the performances of empirical models. Regarding the AI algorithm model for DOC prediction, the number of data increased, more information about the non-linear relationship between variables was provided, and as a result, each model was able to identify basic patterns of the data, leading to more accurate predictions. In conclusion, if a dataset of a minimum of 9 months is able to be secured regarding water quality, it is possible to create a model with a similar performance to those of empirical models, implying that it is possible to develop a model on disinfection byproduct formation with a higher performance if data for > 9 months are secured.

### 4. Conclusion

_{3}, and CHCl

_{2}Br based on water quality data in clear well in the M water purification plant. All models showed high prediction reliability when considering the prediction of water quality of actual water purification plants. It also explained the tendency in which disinfection byproduct formation increased, when DOC concentrations, residual chlorine concentrations, temperature, and pH were higher and the response time was longer. On the other hand, as a result of predicting disinfection byproduct formation using AI algorithm techniques, GBR presented the optimal training performance and MLP showed the optimal test result. All AI algorithm techniques were confirmed to have superior prediction performances to those of empirical models.