Extracting Complex Concentrated Alloy Properties from Scientific Literature with LLMs
Name
Prof. K.A. Christofidou
Affiliation
Department of Materials Science & Engineering, University of Sheffield
Abstract
The development of new alloys is critical for the realisation of novel technologies as well as instrumental in enabling improved performance. Complex concentrated alloys (CCAs), also often known as multi-principal element alloys or high entropy alloys, offer a promising avenue for alloy development, as their deviation away from a single base element opens up a vast compositional space that challenges traditional approaches, offering the possibility of significant material performance improvements. However, this vast compositional space occupied by CCAs also poses a significant challenge, as efficient screening and downselection of promising compositions necessitates materials informatics tools, that in turn rely on the availability and robustness of large experimental databases that are currently scarce.
To address this need, interrogation of the literature using natural language processing (NLP) and large language models (LLMs) offers a unique opportunity to create, collate and curate databases. Since 2014, publications on CCAs have increased exponentially, from a few dozen to over 3000 in 2023. Despite this wealth of papers, the diverse contexts and complex data dissemination within these papers pose significant challenges to the application of NLP and LLMs. To understand the challenges, opportunities and limitations of such techniques for developing relevant databases for alloy design, this work presents an LLM-based construction of a large database consisting of compositional information and material properties.
The methodology of this work involves several critical stages. Initially, relevant publications are gathered and preprocessed to prepare them for analysis. Text classification techniques are then used to pinpoint content related to CCAs and filter out irrelevant content. Subsequently, LLMs are used to extract pertinent data, including the elemental compositions, phase information and hardness data. This process begins with prompt engineering to evaluate existing LLM capabilities [1], followed by full fine-tuning of LLMs [2] on curated, CCA-related data to improve accuracy in domain-specific contexts. To enable robust comparisons, the literature used to construct the database presented by Machaka et al. [3] and Gorsse et al. [4] was used, and the results were compared against the manually collated databases.
Through iterative refinement and optimisation, this methodology can improve the efficiency of data extraction from the growing volume of CCA literature. This work will enable the creation, collation and curation of comprehensive datasets essential for advancing research in alloy development and materials discovery, thereby facilitating the accelerated exploration of novel alloy compositions and the realisation of high-performance applications.
This work was supported by Oerlikon AM Europe GmbH (website: https:// www.oerlikon.com/am/en/), Engineering and Physical Sciences Research Council UK (EP/S022635/1) (website: https://www.ukri.org/councils/ epsrc/), and Science Foundation Ireland (18/EPSRC- CDT/3584) (website: https://www.sfi.ie/).
References
[1] Polak, M. P., and Morgan, D., Nature Communications 15.1 2024: 1569.
[2] Dagdelen, J., et al. Nature Communications 15.1 2024: 1418.
[3] R. Machaka, G. T. Motsi, L. M. Raganya, P. M. Radingoana, S. Chikosha, Data Brief 2021, 38, 107346.
[4] S. Gorsse, M. H. Nguyen, O. N. Senkov, D. B. Miracle, Data Brief 2018, 21, 2664.