Deep-Learning Model for Influenza Prediction From Multisource Heterogeneous Data in a Megacity: Model Development and Evaluation

Background: In megacities, there is an urgent need to establish more sensitive forecasting and early warning methods for acute respiratory infectious diseases. Existing prediction and early warning models for influenza and other acute respiratory infectious diseases have limitations and therefore there is room for improvement.

Objective: The aim of this study was to explore a new and better-performing deep-learning model to predict influenza trends from multisource heterogeneous data in a megacity.

Methods: We collected multisource heterogeneous data from the 26th week of 2012 to the 25th week of 2019, including influenza-like illness (ILI) cases and virological surveillance, data of climate and demography, and search engines data. To avoid collinearity, we selected the best predictor according to the weight and correlation of each factor. We established a new multiattention-long short-term memory (LSTM) deep-learning model (MAL model), which was used to predict the percentage of ILI (ILI%) cases and the product of ILI% and the influenza-positive rate (ILI%×positive%), respectively. We also combined the data in different forms and added several machine-learning and deep-learning models commonly used in the past to predict influenza trends for comparison. The R2 value, explained variance scores, mean absolute error, and mean square error were used to evaluate the quality of the models.

Results: The highest correlation coefficients were found for the Baidu search data for ILI% and for air quality for ILI%×positive%. We first used the MAL model to calculate the ILI%, and then combined ILI% with climate, demographic, and Baidu data in different forms. The ILI%+climate+demography+Baidu model had the best prediction effect, with the explained variance score reaching 0.78, R2 reaching 0.76, mean absolute error of 0.08, and mean squared error of 0.01. Similarly, we used the MAL model to calculate the ILI%×positive% and combined this prediction with different data forms. The ILI%×positive%+climate+demography+Baidu model had the best prediction effect, with an explained variance score reaching 0.74, R2 reaching 0.70, mean absolute error of 0.02, and mean squared error of 0.02. Comparisons with random forest, extreme gradient boosting, LSTM, and gated current unit models showed that the MAL model had the best prediction effect.

Conclusions: The newly established MAL model outperformed existing models. Natural factors and search engine query data were more helpful in forecasting ILI patterns in megacities. With more timely and effective prediction of influenza and other respiratory infectious diseases and the epidemic intensity, early and better preparedness can be achieved to reduce the health damage to the population.