August 16, 2023 to December 1, 2023
Online
US/Central timezone

Extracting and utilizing multimodal datasets of images and text with large language models

Nov 15, 2023, 10:30 AM
1h 30m
Full day hands-on workshop (Online)

Full day hands-on workshop

Online

Zoom will be used for this virtual hands-on training session.
Presentation

Speaker

Aikaterini Vriza (ANL)

Description

Abstract:
With the recent exponential growth in publication rates, it has become impossible for a scientist to keep up with all publications related to a specific topic. Although there are notable efforts to automate text parsing from literature, there are many instances where important information is communicated through images or tables in papers.1 In this talk, I will present the latest developments in two software tools developed at the Center of Nanoscale Materials (CNM): i) EXSCLAIM! for data mining from scientific literature2, and ii) Plot2Spectra for image segmentation related to spectral images, with the aim of creating metadata.3 EXSCLAIM! has been enhanced with Large Language Models (LLMs), i.e., ChatGPT and appropriate prompt engineering to extract image-text pairs from scientific journals, which can be foundational for creating multimodal models and advancing semantic searches. In this presentation, I will demonstrate various applications of the extracted multimodal datasets in building knowledge graphs, conducting semantic searches, and performing topic modelling. Additionally, I will illustrate how to utilize the image segmentation workflow in Plot2Spectra to extract additional metadata and create datasets suitable for machine learning (ML) and high-throughput experimentation.

(1) Olivetti, E. A.; Cole, J. M.; Kim, E.; Kononova, O.; Ceder, G.; Han, T. Y.-J.; Hiszpanski, A. M. Data-Driven Materials Research Enabled by Natural Language Processing and Information Extraction. Appl Phys Rev 2020, 7 (4), 041317. https://doi.org/10.1063/5.0021106.
(2) Schwenker, E.; Jiang, W.; Spreadbury, T.; Ferrier, N.; Cossairt, O.; Chan, M. K. Y.; Chan, M. EXSCLAIM!-An Automated Pipeline for the Construction of Labeled Materials Imaging Datasets from Literature. Patterns (2023). https://arxiv.org/abs/2103.10631.
(3) Jiang, W., Li, K., Spreadbury, T., Schwenker, E., Cossairt, O., & Chan, M. K. Y. (2022). Plot2Spectra: an automatic spectra extraction tool. Digital Discovery, 1(5), 719–731. https://doi.org/10.1039/d1dd00036e.

Bio:
Aikaterini Vriza is a postdoctoral appointee at the Center of Nanoscale Materials at Argonne National Laboratory. She obtained her PhD from the Material Innovation Factory at the University of Liverpool in 2022 and a Master in Green Chemistry and Sustainable Industrial Technology from the University of York. Prior to that she was an Aviation engineer in the Hellenic Airforce. Her research expertise lies between AI/ML, ‘green’ chemistry, and laboratory automation and has worked on several related projects in both industrial and academic settings.

Presentation materials

There are no materials yet.