Diverse data generation for machine learning potentials
Danny Perez, Los Alamos National Laboratory (LANL)
Machine-learning interatomic potentials (MLIAPs) make it feasible to aim for both accuracy and transferability, something that the earlier generations of potentials struggled to achieve. However, given their flexibility, these potentials often fail at extrapolating to properties beyond the training data, which makes the quality of the training set one of the determining factors in their performance. Training datasets typically consist of DFT energies and forces of relatively small systems, traditionally selected manually or randomly from subsets of the configuration space of interest. The need for human intervention in the curation of training sets makes their generation labor-intensive and time-consuming. Here, we present a generalization of a previously developed method based on the automated maximization of the information entropy of the descriptor distribution. The diversity of the entropy-optimized training dataset is compared to several other datasets from the literature. This approach is demonstrated for several elements, which produced potentials that are very robust in a broad range of conditions, highlighting the desirable characteristics of an optimal training data, irrespective of the material chemistry.