DEVELOPMENT OF A METHODOLOGY FOR EXCTRACTING
CHEMICAL REACTION CONDITIONS FROM TEXT WITHIN IMAGE
Скачать PDF
Annotation: This paper discusses a methodology for extracting key information regarding the conditions of a chemical reaction
from unstructured textual data present in illustrations within scientific articles. The proposed methodology aims
to expedite the process of acquiring and organizing data on the synthesis of compounds presented in scientific
literature. To solve this task, a module was developed that performs text recognition in an image, as well as
identification and classification of reaction parameters in the recognized text using neural networks. In order to
reduce the amount of labeled data necessary for training a robust text recognition model, an application for
generating synthetic images and corresponding labels has been created. For the same purpose a pre-training
strategy has been applied for named entities recognition model by utilizing a large publicly available dataset of
chemical patents. During training of the text recognition model, input image augmentations were used to simulate
various features in the target data, increase the size of the training set, increase its variety, and enhance
generalizability of the model. A modified algorithm of BERT embeddings extraction was proposed to incorporate
verbal information when using character-level tokenization. After training the models, the module was deployed
and tested, performance and resource consumption measurements were performed.
Keywords: machine learning, deep learning, key information extraction, optical character recognition, named entities
recognition, pre-training, synthetic data
Page numbers: 32-41.
For citation: Skredenas D.A., Novikova O.A. Development of a methodology for exctracting
chemical reaction conditions from text within image // Electronic Scientific Journal IT-Standard. – 2024. – No. 2. – pp. 32-41.