### 1. Introduction

Due to the discontinuity of data sequence in time dimension, the problem of data missing often occurs in the real scene. The proposed model uses cubic spline function interpolation method, which can smooth the transition fitting curve at the junction of each sub interval, and obtain an ideal data set.

In order to improve the prediction accuracy, the proposed model adopts an encoder decoder network structure, and uses a threshold cyclic unit network which is more suitable for processing long time series to encode and decode features. The temporal attention mechanism is integrated into the model to improve the generalization ability of the model.

### 2. Related Work

### 3. Data Analysis and Preprocessing

### 3.1. Data Analysis

### 3.2. Data Preprocessing

##### (1)

$$\begin{array}{c}S\left(d\right)={a}_{i}\frac{{\left({d}_{i+1}-d\right)}^{3}}{6{h}_{i}}+{a}_{i+1}\frac{{\left(d-{d}_{i-1}\right)}^{3}}{6{h}_{i}}+\\ \left(\frac{{y}_{i}}{{h}_{i}}-\frac{{a}_{i}}{6}{h}_{i}\right)\left({d}_{i+1}-d\right)+\left(\frac{{y}_{i+1}}{{h}_{i}}-\frac{{a}_{i+1}}{6}{h}_{i}\right)\left(d-{d}_{i+1}\right)\\ {y}_{i}=S\left({d}_{i}\right),{h}_{i}={d}_{i+1}-{d}_{i}\\ d\in \left[{d}_{i},{d}_{i+1}\right],i=0,1,\cdots ,N\end{array}$$*d*

*represents the point to be interpolated,*

_{i}*N*represents the number of points to be interpolated. The second derivative ${S}^{\u2033}({d}_{i})={a}_{i}\frac{{d}_{i+1}-{d}_{i}}{{h}_{i}}+{a}_{i+1}\frac{{d}_{i}-{d}_{i-1}}{{h}_{i}}$ of is a piecewise linear function. Utilize the boundary conditions of

*N*+ 1 critical points to find all the

*a*

*values of the undetermined coefficients, that is, to realize the interpolation.*

_{i}*s*(

*d*) needs to have a continuous second derivative in all segment intervals and connections, that is, the first derivative of

*s*(

*d*) is derivable. Therefore, the fitting curve of the cubic spline function smoothly transitions at the junction of each sub-interval, which can simulate the evolution process of the gradual change of the physical phenomenon to the greatest extent, and then obtain the ideal interpolation effect. So far, for the problem of missing data in the multi-site multi-modal air quality data, the software function based on cubic spline function interpolation is used to realize the missing value filling.

### 4. Research Method

### 4.1. Proposed Prediction Model

*x′*is the normalized data,

*x*is the original input data,

*x*

_{min}and

*x*

_{max}are the minimum and maximum values in

*x*, respectively.

*yprime;*is the predicted value of the model with a value in [0,1].

*y*is the actual predicted value of the model after denormalization.

### 4.2. TA-RNN Model Construction

*X̃*and multivariate data

*X*of the target sequence need to be input, and the output is the predictor variable value

*Y*. Where,

*X*= (

*x*

_{1},

*x*

_{2}, ···,

*x*

*)*

_{T}*= (*

^{T}*x*

^{1},

*x*

^{2}, ···,

*x*

*),*

^{N}*X̃*= (

*x̃*

_{1},

*x̃*

_{2}, ···,

*x̃*

*),*

^{T}*x*

*∈*

_{t}*R*

*,*

^{N}*x*

*∈*

^{n}*R*

*,*

^{T}*Y*,

*y*

*∊*

_{t}*R*.

*N*is the number of factors that affect the target sequence.

*T*is the length of time window of model input. Therefore, the model TA-RNN can be expressed as a nonlinear function

*F*that needs to be trained to obtain specific parameters:

##### (4)

$$\begin{array}{c}Y=F\hspace{0.17em}\left(\tilde{X},X\right)\\ =F\hspace{0.17em}\left({\tilde{x}}_{1},{\tilde{x}}_{2},\cdots ,{\tilde{x}}_{T},{x}^{1},{x}^{2},\cdots ,{x}^{N}\right)\end{array}$$*h*

*in the GRU, so that the GRU has a very concise structure. The calculation process is as follows:*

_{t}##### (5)

$$\begin{array}{c}{r}_{t}=\sigma \hspace{0.17em}\left(\sum {\omega}_{xr}{x}_{t}+\sum {\omega}_{hr}{h}_{t-1}+{b}_{r}\right)\\ {g}_{t}=\sigma \hspace{0.17em}\left(\sum {\omega}_{xg}{x}_{t}+\sum {\omega}_{hg}{h}_{t-1}+{b}_{g}\right)\\ {\tilde{h}}_{t}=\text{tanh}\hspace{0.17em}\left(\sum {\omega}_{xh}{x}_{t}+\sum {\omega}_{rh}\hspace{0.17em}\left({r}_{t}\otimes {h}_{t-1}\right)+{b}_{c}\right)\\ {h}_{t}=\left(1-{g}_{t}\right)\otimes {h}_{t-1}+{g}_{t}\otimes {\tilde{h}}_{t}\end{array}$$*ω*and

*b*respectively represent the weight and bias in the neuron. The subscript clearly indicates the position of the weight in the network. For example,

*ω*

*represents the weight matrix between the input and reset gates.*

_{xr}*b*

*represents the offset vector of the update gate.*

_{g}*h*

*and*

_{t}*h̃*

*respectively represent the output state and candidate output state of the hidden layer neuron at time*

_{t}*t*.

*σ*is the activation function, and the sigmoid function is selected by default. The output value of the two gate structures is compressed to the range of [0,1], which can represent an activation probability.

*g*

*and*

_{t}*r*

*represent the update gate and reset gate respectively.*

_{t}### 4.3. Encoder-decoder

*X*= (

*x*

_{1},

*x*

_{2}, ···,

*x*

*) can be updated, where ${x}_{t}={\left({u}_{t}^{1}{x}_{t}^{1},{u}_{t}^{2}{x}_{t}^{2},\cdots ,{u}_{t}^{n}{x}_{t}^{n}\right)}^{T}$. Through time series attention, the network can pay more attention to the important information in the input multivariable data. Reduce the adverse effect of interference information on the network, thereby improving the feature extraction ability of the network.*

_{T}*h*

*of the GRU unit at different times*

_{t}*t*is obtained. And it is used as the encoded feature matrix:

##### (7)

$${\widehat{y}}_{t-1}={\omega}^{T}\hspace{0.17em}\left[{\widehat{y}}_{t-1};{h}_{t-1}\right]+\widehat{b}$$*ω*∈

*R*

^{1+m}and

*b̂*∈

*R*are the parameters that the network needs to obtain through training. [

*ŷ*

_{t}_{−1};

*h*

_{t}_{−1}] ∈

*R*

^{1+m}is the layer input after data fusion.

*ŷ*

_{t}_{−1}obtained through the initial decoding can be used to update the hidden state

*H*

*of the feature decoder at time :*

_{t}*f*

_{2}.

*H*

*is updated as follows:*

_{t}##### (9)

$$\{\begin{array}{l}{r}_{t}^{\prime}=\sigma \hspace{0.17em}\left({\omega}_{r}^{\prime}\hspace{0.17em}\left[{H}_{t-1};{\widehat{y}}_{t-1}\right]+{b}_{r}^{\prime}\right)\hfill \\ {g}_{t}^{\prime}=\sigma \hspace{0.17em}\left({\omega}_{g}^{\prime}\hspace{0.17em}\left[{H}_{t-1};{\widehat{y}}_{t-1}\right]+{b}_{g}^{\prime}\right)\hfill \\ {\tilde{h}}_{t}=\text{tanh\hspace{0.17em}}\left(\sum {\omega}_{xh}^{\prime}{x}_{t}+\sum {\omega}_{rh}^{\prime}\left({r}_{t}^{\prime}\otimes {h}_{t-1}\right)+{b}_{c}^{\prime}\right)\hfill \\ {h}_{t}=\hspace{0.17em}\left(1-{g}_{t}^{\prime}\right)\otimes {h}_{t-1}+{g}_{t}^{\prime}\otimes {\tilde{h}}_{t}\hfill \end{array}$$*ω′*and

*b′*are the weights and biases of the network.

*H*

*, perform the final decoding to obtain the predicted value*

_{t}*Y*of the model:

*ω*

_{0}∈

*R*

*and*

^{m}*b*

_{0}∈

*R*are the parameters that the network needs to obtain through training.

### 4.4. Timing Attention

*x*

*of different influencing factors of the time*

_{t}*t*of the input data

*X*and the hidden state

*h*

^{k}^{−1}of the GRU unit, the different attention values ${e}_{t}^{k}$ of the input data

*X*at different time steps can be constructed:

##### (11)

$${e}_{t}^{k}={\gamma}_{e}^{T}\hspace{0.17em}\text{tanh\hspace{0.17em}}\left({\omega}_{e}\hspace{0.17em}\left[{h}^{k-1}\right]+{\zeta}_{e}{x}_{t}\right)$$*γ*

*∈*

_{e}*R*

*,*

^{n}*ω*

*∈*

_{e}*R*

^{n}^{×2}

*,*

^{T}*ζ*

*∈*

_{e}*R*

^{n}^{×}

*are the parameters that the network needs to obtain through training. In order to make the sum of attention weights 1, it is necessary to use the Softmax function to obtain the final time series attention value ${u}_{t}^{k}$:*

^{n}### 5. Experiment and Analysis

### 5.1. Experimental Data

_{2}, NO

_{2}, O

_{3}, CO, and PM10. The file format is csv, and the data format is: the first line is the column name (respectively the date, hour and time, data type, and each monitoring site name). Each of the remaining lines is the measured data of a certain pollutant at a certain time, and the data unit is μg/m

^{3}. The 35 air quality monitoring sites in Beijing consist of 12 urban environmental assessment points, 11 suburban environmental assessment points, 7 control points and regional points, and 5 traffic pollution monitoring points, as shown in Table 1.

_{2}, NO

_{2}, O

_{3}, CO, and PM10. The data missing rates of these six pollutants were 3.12, 6.48, 4.13, 3.24, 3.08 and 32.10%. After removing pollutants with a data missing rate greater than the threshold, some monitoring sites are removed based on the remaining pollutant data missing rate at the monitoring site. If there is no less than one pollutant on the monitoring site with a data missing rate greater than 5%, just remove this monitoring site. In the end, 22 air quality monitoring stations were retained in this article. The spatial distribution of 22 monitoring stations and their spatial topology are shown in Fig. 6. After eliminating pollutants and monitoring stations with serious data loss, 282246 original air quality record data were formed. The 282246 records are arranged with time as the row index and site ID as the column index, and are regular into 12830 rows. The 22 stations in each line are separated by “#”, and the pollutant concentration attribute in each station is separated by “,”. The remaining missing values are completed by time linear interpolation. Finally, 17410 data are divided into 8:1:1 training set, verification set and test set.

### 5.2. Experimental Settings and Evaluation Index

##### (13)

$$O\hspace{0.17em}\left({Y}^{t~t+k},{\widehat{Y}}^{t~t+k}\right)=\frac{1}{N}\sum _{i=1}^{N}\left({Y}^{t~t+i},{\widehat{Y}}^{t~t+i}\right)$$*N*is the number of training samples.

*Y̆*

*of PM2.5 and the predicted value*

_{t}*Y*

*. The mean value of PM2.5 is expressed as*

_{t}*Ȳ*, which specifically includes:

*R*

^{2}measures the ability of the forecast result to represent the actual data. The larger the value, the better the forecast effect.

### 5.3. Performance Comparison of Different Time Window Sizes

^{2}changes of the proposed model are shown in Fig. 7.

^{2}evaluation value increases. Because when the window is too small, the historical feature information input to the model is very limited, so the prediction performance is low. When the window gradually increases, the model obtains more and more historical features from the input, which can be used to learn more nonlinearities and dependencies in the sequence and improve the predictive ability. On the other hand, when the window size is greater than 36, the evaluation value of RMSE and MAE increases, the evaluation value of R

^{2}decreases, and then gradually stabilizes. This is because when the window is large enough, it will increase the input of unnecessary information, generate more noise, and thus interfere with the performance of the model. Therefore, the historical time window size is set to 36 in the experiment.

### 5.4. Anti Interference Ability Detection

*R*

^{2}, RMSE and Mae of TA-RNN are 0.946, 18.01 and 10.38, respectively, which are better than the comparison model. The proposed model uses Gru network to transmit long-term dependence history information, and preprocesses the data such as missing filling, and combines with temporal attention mechanism, which can improve the anti-interference ability of the model.

### 5.5. Comparison of Prediction Errors of PM2.5 Concentration Hourly Predicted by Different Models

### 5.6. Comparison of PM2.5 Concentration Prediction Errors of Different Models at Nine Stations

^{3}and Mae no less than 14 μg/m

^{3}.

^{3}and 11.52 μg/m

^{3}, respectively, which are better than the other three models. In this paper, the model is an encoder decoder structure, and the better performance GRU module and timing attention mechanism are used, so the overall prediction accuracy is better. Yang et al. [22] combined with LSTM and convolutional neural network to achieve PM2.5 concentration at a specific location in the next 24 h, has certain restrictions on time and place, so the generalization ability is poor. In Ayturan et al. [19], the deep learning method was used to predict the pollutant concentration, but the model was simple, so RMSE and MAE increased by 23.48% and 26.20%, respectively compared with the proposed model. Compared with the other three deep learning algorithms, the traditional dynamic Markov algorithm in Zhang et al. [14] has poor overall performance. In a comprehensive way, the prediction effect of the model is the best.

### 6. Conclusions

^{3}and 11.52 μg/m

^{3}, respectively, which were better than other models, and greatly improved the stability and generalization ability of PM2.5 concentration prediction model.