Multisite algal bloom predictions in a lake using graph attention networks
Article information
Abstract
The algal blooms caused by eutrophication is a major concern in water and aquatic resource management, and various attempts have been made to accurately predict it. Algal blooms and the factors influencing them exhibit high spatial variability depending on the characteristics of the water body and water flow. However, traditional machine learning and deep learning methods have limitations to account for the spatial interactions of various influencing factors across multiple monitoring sites. In addition, attempts to predict multiple sites simultaneously using a single model are limited. In this study, we proposed a model that considers spatial interactions and performs multisite predictions based on a graph attention network (GAT). The GAT–DNN, which combines a deep neural network (DNN) after GAT layer, was applied to forecast chlorophyll-a levels at multiple sites. The proposed model accurately captured the high variability and peak chlorophyll-a levels. Moreover, the GAT–DNN consistently outperformed two baseline DNNs in both cases. Additionally, we examined the optimal forecast horizon by comparing the performance of the model across various forecast horizons. Therefore, the proposed model can be applied to a wide range of prediction models to capture spatial interactions and obtain the benefits of performance outcomes for each site.
Abstract
Graphical Abstract
1. Introduction
Eutrophication caused by rapid urbanization and industrialization has been a long-standing concern in water quality and resource management [1]. The occurrence of algal blooms due to eutrophication threatens the health of aquatic organisms and humans by causing oxygen depletion, odor, and toxins, and it also undermines the recreational value of water bodies [2]. Therefore, eutrophication has been considered a major concern in water and aquatic resource management due to its ecological, economic, and social losses [3]. Chlorophyll-a, which is a representative photosynthetic pigment, has been widely used as an indicator of phytoplankton biomass and trophic state of water bodies. Accurate prediction of chlorophyll-a concentration provides a basis for decision-making on proactive measures to prevent algal blooms [4]. Therefore, a predictive model that can reflect the complex interactions among various influencing factors plays a crucial role in conserving aquatic ecosystems and ensuring sustainable utilization of water resources.
2. Methods
2.1. Study Area and Data Description
2.1.1. Study area
Daecheong Reservoir (36.35°–36.52°N, 127.48°–127.60°E) in South Korea is a multipurpose dam constructed for water supply and flood control (Fig. 1). The watershed area of Daecheong Reservoir is 72.8 km2, the length of the reservoir is 86 km, and its total storage capacity is 1.49 × 109 km3. Due to the inflow of nutrients and long residence time, the reservoir frequently experiences harmful algal blooms caused by eutrophication [42]. There are a total of six water quality monitoring sites operating in Daecheong Reservoir (Fig. 1). Sites 1, 3, and 5 have an algae alert system that is operational.
2.1.2. Data description
For modeling, monitoring data from January 2016 to August 2022 were obtained. The dependent variable, chlorophyll-a concentration (mg/m3), was collected for each water quality monitoring site. In addition, data from Site 7 and Site 8, which are the inflow sites to Daecheong Reservoir, were also utilized to reflect the influence of inflow water quality factors. Note that the data obtained from Site 7 and Site 8 were only used as input data. Among the water quality monitoring sites in the study area, Site 1, Site 3, and Site 5 are monitored weekly as part of the algae alert system, while Site 2, Site 4, Site 6, Site 7, and Site 8 are monitored on a monthly basis. The input variables for modeling included environmental, hydrological, and meteorological factors (Table 1). Environmental factors included water temperature (Wtemp; °C), dissolved oxygen (DO; mg/L), total organic carbon (TOC; mg/L), total nitrogen (TN; mg/L), total phosphorus (TP; mg/L), and suspended solids (SS; mg/L) and are measured at each monitoring site. Environmental factors were obtained from Water Environmental Information System of the National Institute of Environmental Research. Meteorological factors included precipitation (mm) measured by the nearest Automated Surface Observing System by Korea Meteorological Administration. Hydrological factors used the total discharge (m3) and water level (EL.m) of Daecheong Dam. The total discharge and water level data were obtained from K-water. Thus, the model’s input data comprised monitoring data from Site 1, Site 3, and Site 5 on a weekly basis; furthermore, missing or unmeasured values were imputed using Kalman filtering (Table S1). Site 2, Site 4, Site 6, Site 7, and Site 8 were monitored on a monthly basis and their data were only used as input variables.
2.2. Model Development
In this study, we applied GAT–DNN for forecasting chlorophyll-a concentration at six monitoring sites in Daecheong Reservoir (Fig. 2). GAT–DNN comprises a GAT that learns spatial interactions among monitoring sites and a fully-connected layer that performs multisite prediction with the learned features.
2.2.1. Graph attention network
GAT assumes that the mutual interactions between nodes that make up the graph are not predetermined by the graph structure and that the magnitude of their interactions is different. Within GAT, the self-attention mechanism learns the degree to which a particular node influences and is influenced by other nodes for the model’s output [43]. The input for the attention layer that makes up GAT includes the feature matrix for each node and an adjacency matrix that represents the connections between nodes. The adjacency matrix is based on graph theory and represents the connectivity between nodes. For a graph data with N nodes, the adjacency matrix is an N × N square matrix, where a value of 1 indicates that node i influences node j and a value of 0 indicates that node i did not resolve node j.
The feature matrix for N nodes with F features is represented as
Here, W indicates the weight matrix of linear transformation, represents the attention coefficient, and h denotes the feature matrix. eij indicates the importance of node j’s features to node i, ⊕ represents concatenation of two node representations, and LeakyReLu is a nonlinear activation function. Then, using the Softmax function, we normalize the interaction between all vertices to obtain the attention score (aij) as follows:
Then, masking is applied to reflect the connection status between nodes based on the adjacency matrix. Furthermore, using aij and the parameterized feature matrix of node j (Wh⇀j), the feature matrix of node i is updated as follows:
Here,
GAT uses multihead attention to stabilize the process of calculating attention scores and improve performances [44]. Multihead attention repeats the process of calculating attention scores for a specified number of heads (K) and aggregates the obtained scores using average or concatenate. The feature matrix of the i-th node obtained through K-head attention (2 heads in this study) can be expressed as follows:
where
In this study, the node-specific feature matrix obtained through the GAT layer is concatenated and fed into a fully-connected layer, performing multisite prediction as follows:
where FC denotes the fully-connected layer and hconcat is the aggregated feature matrix with N × F features. The GAT–DNN was developed and implemented using the Pytorch [46] and Sklearn [47] libraries in Python 3.8.10 [48].
2.2.2. Deep neural network for comparison
To compare the performance with GAT–DNN, two types of input data were used to develop DNNs. Each case was designed to verify the usefulness of GAT for multisite prediction. In the first case, DNNconcat was developed by using all data from Site 1 to Site 6 as input variables. In the second case, DNNseparate was developed by separately developing DNNs for each monitoring site except for the upstream sites. Generally, each DNN includes the same number of hidden layers (i.e. two hidden layers) as FC in GAT. The baseline DNNs were developed and implemented using Pytorch [46] and Sklearn [47] libraries in Python 3.8.10 [48].
2.3. Model Implementation
2.3.1. Data preparation and pre-processing
GAT–DNN and baseline DNNs were trained and tested for one-week chlorophyll-a concentration forecasting using the data obtained from six monitoring sites, excluding the inflow sites(Sites 7 and 8). The adjacency matrix for GAT–DNN development was constructed using Site 1 to Site 8 as nodes. The connectivity between monitoring sites (i.e., nodes) was determined based on previous studies [49] and the Korea Reach File of the Water Environment Information System (Fig. S1). The input data for GAT–DNN were processed weekly to match the monitoring frequency of the algae alert system-operated sites. The baseline DNNs utilized various input data formats. For DNNseparate, the input data for each monitoring site (from Site 1 to Site 6) was used to forecast chlorophyll-a concentrations at the corresponding site. In contrast, for DNNconcat, the input data for all sites (Site 1–Site 6) was used to forecast chlorophyll-a concentrations at all sites. Missing values in the model input data were imputed using a Kalman filter [50]. The input features for GAT–DNN and baseline models were the previous time step values of precipitation, total discharge, DO, water level, TOC, water temperature, TN, TP, SS, and chlorophyll-a. The input features, except for water temperature, were log-transformed. Each feature was scaled to a range of 0 to 1 using min–max normalization based on the training data to minimize the influence of scale.
2.3.2. Model raining, validation, and test
The input data for each model was randomly split into training (70%) and test (30%) sets. The root mean squared error (RMSE) was used as the loss function during the model training process, and the number of epochs was set to 500. Imputed chlorophyll-a concentrations during the training process were excluded for GAT–DNN, and only measured values were used.
In this study, we adopted a tree-structured Parzen estimator (TPE), which is one of the representative Bayesian optimization methods, for hyperparameter optimization. TPE sequentially searches for the hyperparameter set with the largest expected improvement (EI) value based on previous results:
where x is the selected set of hyperparameters and y is the value of the loss function. In this case, y* is the value exceeding the threshold of the critical value of the previously set percentile (0.15 in this study); that is, p(y < y*) = γ. The TPE expresses the surrogate model p(x | y) through two density functions estimated from the set of hyperparameters for which the loss function is smaller or greater than the given threshold after searching:
The EI is then easily re-expressed as follows:
Therefore, TPE focused on the exploration results that showed the greatest improvement in performance in the past and converges to the optimal set of hyperparameters.
GAT–DNN performed hyperparameter optimization for 8 hyperparameters (learning rate, weight decay, batch size, epsilon, GAT dropout rate, dropout rate for FC, hidden dimension for 1st FC layer, and hidden dimension for 2nd FC layer) (Table S2 and S3). The objective function for hyperparameter optimization used the average RMSE obtained through 5-fold cross-validation with the training data. The number of hyperparameter search iterations using TPE was set to 50, and the Hyperopt [51] library of Python 3.8.10 [48] was used for implementation.
2.4. Performance Metrics
The performance of GAT–DNN and baseline models was evaluated using only the measured chlorophyll-a concentration in the test data. The evaluation metrics used were RMSE and R2:
where n indicates the number of measured data in the test set, and yobs and ypred denote the measured and forecasted output value corresponding to Site 1 to Site 6, respectively. Also, yobs indicate the mean of measured output values.
3. Results and Discussion
3.1. Spatial and Temporal Distribution Characteristics of Water Quality at the Monitoring Sites
All monitoring sites of the Daecheong Reservoir were analyzed for their chlorophyll-a concentrations during the summer months (June, July, and August) and during other periods. The chlorophyll-a concentration was higher during the summer months than during other periods (Fig. 4). More specifically, the average chlorophyll-a concentration was 1.55 mg/m3 during the summer months (Fig. 4a) and 1.46 mg/m3 during other periods (Fig. 4b). However, outliers in the total number of monitoring sites were observed during both the summer months and other periods (Fig. 4b). Additionally, the seasonal characteristics indicate that during both periods (summer and other periods), the chlorophyll-a concentrations were higher at the algal-alert sites 1, 3, and 5 than at other sites (Sites 2, 4, 6, 7, and 8) (Fig. 4). Due to its long residence time (mean 145 days) and high nutrient input, frequent algal blooms occur in the Daecheong Reservoir [42]. Site 1 and Site 3, which are the algae alert system-monitored sites, have longer residence times and shallower depths compared to other monitoring sites [52]. Also, compared to other sites, Site 5, which is located in the upper part of the Daecheong Reservoir, is known to be relatively vulnerable to harmful algal blooms attributable to an increase in nonpoint source nutrient loads with stormwater runoff [52].
3.2. Performance Evaluation
The combined GAT–DNN model predicted the chlorophyll-a values at six monitoring sites. In general, the one-week forecasts at all sites were provided with high accuracy (Fig. 5 and Fig. 6). The training performance of the GAT–DNN was accurate at all sites (R2 = 0.60–0.79, RMSE = 0.07–0.09) but exhibited a slight underestimation tendency (Fig. 5 and Fig. 6a–d). Furthermore, the test results generally matched the measured values with no signs of overfitting or underfitting (Fig. 5 and Fig. 6). Specifically, GAT–DNN showed strong test performance at each site with two exceptions: prominent deviations from the measured chlorophyll-a concentrations at Site 2 (R2 = 0.63, RMSE = 0.06) and Site 5 (R2 = 0.63, RMSE = 0.08) (see Fig. 5e and Fig. 6e). The strong performance was indicated by powerful results at each site (R2 = 0.67–0.77, RMSE = 0.06–0.08). The one-week forecasts from GAT–DNN successfully captured the significant temporal changes in not only the training sample but also in the testing sample (Fig. 5b, d, f and Fig. 6b, d, f). Although the timings and magnitudes of the maximum and minimum forecasts occasionally deviated from the measured values (Fig. 5b, d, f and Fig. 6b, d, f), the deviations reduced over time and the GAT–DNN predictions generally captured the fast and short decreases in the measured data (Fig. 5b, d, f and Fig. 6b, d, f).
3.3. Temporal Performance Changes in GAT–DNN
Weekly forecasting of chlorophyll-a concentration using GAT–DNN was conducted over different periods (1 week to 6 weeks). Among the temporal horizons, the one-week ahead forecast delivered the highest performance at all sites (average R2 = 0.69, average RMSE = 0.07) (Fig. 7, Table S5). The performance dropped abruptly after the one-week forecast but remained consistent from the three-week forecast onward, apart from an abrupt drop in the five-week forecast (Fig. 7). After excluding the one-week forecasts at Sites 2 and 4, the one-week forecast from the remaining sites again showed the strongest test performance (Fig. 7, Table S5). The 1–6 week forecasts from GAT–DNN also successfully captured the significant temporal changes. The test performances at Sites 1, 3, and 5 showed a decreasing trend after the one-week forecast, whereas those at Sites 2, 4, and 6 slightly decreased and then abruptly increased after the one-week forecast. Additionally, the test performance at Site 2 sharply decreased in the five-week forecast. Forecasting aims to understand the past and predict the future based on historical measurement data, thus supporting proactive responses to future events [53]. However, the most rational forecasting period must balance the trade-off relationship between the forecasting period and model performance [54]. To this end, we aimed to optimize the forecasting period at the monitoring sites within the Daecheong Reservoir.
4. Conclusions
This study introduces GAT–DNN for predicting weekly algal blooms at each monitoring site within the Daecheong Reservoir. Acting as a single model, GAT–DNN captures the spatial interactions among the monitoring sites and predicts the chlorophyll-a concentrations at each of the six sites one-week ahead. To improve the prediction performance, we exploited the synergistic effect of GAT, which can capture the spatial interactions, and DNN, which performs additional training based on the common information of each site. GAT–DNN achieved a high prediction accuracy at all sites and significantly outperformed the baseline DNN models. Moreover, by accurately predicting the temporal variations in chlorophyll-a concentration, GAT–DNN well represented the trend of algal blooms. Therefore, the prediction results of GAT–DNN provide useful quantitative evidence for decision-making in algal management. Comparing the performances of GAT–DNN models with different forecast horizons, the one-week forecast horizon generally yielded the highest accuracy at most sites, as it optimized the trade-off relationship between the forecast horizon and model performance. In future studies, we plan to develop GAT–LSTM, which will replace DNN with LSTM to capture not only the spatial but also the temporal correlations and hence improve the prediction performance. We also expect that by considering the attention scores, which represent the degree of spatial interactions among the sites derived from GAT, we can clarify the spatial interactions and distribution characteristics of algal blooms, which will support decision-making for proactive management of algal blooms. It is noted that missing values of input variables were imputed using the Kalman filtering in the study. The missing values comprised substantial proportions (40%) of water quality variables, which are major input variables in predicting chlorophyll-a concentrations. Therefore, instead of the Kalman filter, more advanced data imputation methods, such as a decay mechanism [57] that can account for inter-variable correlations and temporal patterns of missingness, need to be adopted in future studies.
Supplementary Information
Acknowledgment
This work was supported by the 2021 sabbatical year research grant of the University of Seoul, and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1009961).
Notes
Conflict-of-Interest
The authors declare that they have no conflict of interest.
Authors Contributions
N.K. (Master student) conducted data collection, modeling, and wrote manuscript. J.S. (Ph.D. candidate) assisted manuscript writing, Y.K.C (Professor) revised the manuscript.