Annotation

DEVELOPMENT OF A METHODOLOGY FOR EXCTRACTING CHEMICAL REACTION CONDITIONS FROM TEXT WITHIN IMAGE
Скачать PDF
Annotation: This paper discusses a methodology for extracting key information regarding the conditions of a chemical reaction from unstructured textual data present in illustrations within scientific articles. The proposed methodology aims to expedite the process of acquiring and organizing data on the synthesis of compounds presented in scientific literature. To solve this task, a module was developed that performs text recognition in an image, as well as identification and classification of reaction parameters in the recognized text using neural networks. In order to reduce the amount of labeled data necessary for training a robust text recognition model, an application for generating synthetic images and corresponding labels has been created. For the same purpose a pre-training strategy has been applied for named entities recognition model by utilizing a large publicly available dataset of chemical patents. During training of the text recognition model, input image augmentations were used to simulate various features in the target data, increase the size of the training set, increase its variety, and enhance generalizability of the model. A modified algorithm of BERT embeddings extraction was proposed to incorporate verbal information when using character-level tokenization. After training the models, the module was deployed and tested, performance and resource consumption measurements were performed.
Page numbers: 32-41.
For citation: Skredenas D.A., Novikova O.A. Development of a methodology for exctracting chemical reaction conditions from text within image // Electronic Scientific Journal IT-Standard. – 2024. – No. 2. – pp. 32-41.