Use of daily Internet search query data improves real-time projections of influenza epidemics

Seasonal influenza causes millions of illnesses and tens of thousands of deaths per year in the USA alone. While the morbidity and mortality associated with influenza is substantial each year, the timing and magnitude of epidemics are highly variable which complicates efforts to anticipate demands on the healthcare system. Better methods to forecast influenza activity would help policymakers anticipate such stressors. The US Centers for Disease Control and Prevention (CDC) has recognized the importance of improving influenza forecasting and hosts an annual challenge for predicting influenza-like illness (ILI) activity in the USA. The CDC data serve as the reference for ILI in the USA, but this information is aggregated by epidemiological week and reported after a one-week delay (and may be subject to correction even after this reporting lag). Therefore, there has been substantial interest in whether real-time Internet search data, such as Google, Twitter or Wikipedia could be used to improve influenza forecasting. In this study, we combine a previously developed calibration and prediction framework with an established humidity-based transmission dynamic model to forecast influenza. We then compare predictions based on only CDC ILI data with predictions that leverage the earlier availability and finer temporal resolution of Wikipedia search data. We find that both the earlier availability and the finer temporal resolution are important for increasing forecasting performance. Using daily Wikipedia search data leads to a marked improvement in prediction performance compared to weekly data especially for a three- to four-week forecasting horizon.