The surging data generation capabilities of modern sensors and networked systems and the vastly increased data processing power of computers and storage media have led to the accumulation of enormous volumes of disparate data. The nascent field of data science focuses on developing scalable and robust algorithms for extracting knowledge from these stores of information. The growing need for powerful and novel methods to extract information from data, in a form that is useful to individuals, society, researchers, and industry, has led to a groundswell in machine learning. Recent progress has been remarkable. This success has been in large part driven by the increasing availability of large-scale training data sets, more powerful computers, and sophisticated algorithms for analyzing extremely large data sets. There is now intense interest in leveraging machine learning in many fields: automatic recognition of image content, identification of best practices in health care, improvement of agricultural yields, understanding how the human brain encodes information, and more.
However, many modern machine-learning algorithms lack interpretability, and can also be surprisingly fragile. Furthermore, training data can be skewed, resulting in unexpectedly “unfair” algorithms which can lead to bias. Although the developments to date, driven primarily by phenomenological considerations, have been remarkably successful, substantial work remains to be done in order to reach a fundamental understanding of why these methodologies actually succeed.
Progress can only come from the development of new and sophisticated mathematics and statistics. The study of data and information cuts across a myriad of disciplines, including computer science, statistics, optimization, and signal processing, and reaches into classical areas of mathematics. Furthermore, application-specific models and constraints in fields such as astrophysics, particle physics, biology, economics, and sociology present additional exciting opportunities for the mathematical and statistical analysis of data.