Building materials information retrieval systems using text mining methods
Every year,
a large number of scientific articles from the materials domain are published,
but relevant information and informative entities
are usually not easy to extract from them because such information is usually
poorly structured and scattered throughout the document. Recently, more and
more researchers have started to address this issue and have put effort into
developing tools capable of performing this information extraction over a large
corpus and providing useful structured summaries.
Recent
works introduce automated database generation tools in the materials domain,
systems capable of extracting material science information and storing it in a
structured manner. Material Science Information Extractor (MatScIE) [1] is an
example of this, where the extracted data consists of interesting information
such as the material details, code, parameter, method and structure of the
published research article, along with a summary of the main research findings.
They also created a web application where users can upload published articles
and view/download a variety of information of interest obtained from this tool.
The
purposes of the work are twofold: On one hand, they create an information
extraction system by applying Named Entity Recognition (NER) in order to
extract entities from the articles belonging to the following five categories:
material, code, parameter, method and structure. On the other hand, they also
develop and present a sentence classification model for the purpose of
providing a short summary level information from the article.
To enable
the extraction of information, a database has been created by collecting
materials science articles published between 2010 and 2019 from arXiv,
considering only articles that contain at least one of the following keywords:
“ab initio simulation”, “density functional study”, “density functional theory”
and “first principles”. For the identification and extraction of entities of
interest, the researchers made use of a sequence labelling approach (neural
network), using pre-trained word embeddings in the material science domain. To
train this neural network, they randomly selected 214 arXiv articles, which
were annotated by materials experts under the aforementioned labels (material,
code, parameter, method and structure). Once the model is trained, they apply
NER to a corpus of more than 10k articles.
With
respect to the sentence classification model, the researchers selected 90
abstracts from the initial 214 arXiv annotated articles as training dataset and
fine-tuned the model using both bert-base-uncased and scibert-base-uncased
pretrained embeddings. After applying the appropriate text preprocessing steps
and sentence selection, they annotated each sentence with positive or negative
labels depending on whether the corresponding sentence matched a result from
the previous entity extraction step.
Finally,
they created a useful tool with its corresponding online interface, from which
one can obtain predicted entities from an uploaded published article, index the
material science documents according to the material used or parameter used for
the specific methods, and generate a short summary of the published article.
Similarly,
the work of Mullick et al. [2] provides another example of efficient and useful
extraction of relevant information in the material domain. Their purpose is to
try to identify parts of the article text that contain informative entities
using sentence-level classifiers in order to provide useful summaries and
extract entities of interest based on certain rules. In this case, the
sentences are considered as “informative” if they contain at least one of the
following types of informative entities: material names, method names, code or
simulation software names, parameters of the simulation software, and material
structure type.
The dataset
is similar to the previously commented work, where approximately 10k articles
were selected, from which 214 were randomly selected and annotated. From this
subset of collected articles, a sentence classification was trained based on
two labels, “informative” if the sentence contains an entity, and
“uninformative if this is not the case. This resulted in 35% of informative
sentences out of a total of 49,610 sentences.
In terms of
methodology and experiments, they proposed deep neural network-based binary
sentence classification to identify sentences within two classes, informative
and uninformative, where informative will refer to sentences that contain any
of the five entity categories defined similarly to [1] (material, method, code,
parameter, and structure). By applying sentence classification and removing the
uninformative ones, they observe a significant improvement in the next entity
extraction step. For this classification task, they start with traditional
machine learning methods such as Support Vector Machine (SVM), Logistic
Regression, Random Forest or Bag of Words (BoW) and use features from four
categories: Parts of Speech tag-based, Tf-idf based, Dependency parse based and
others. They then explore deep neural approaches such as common network models
(BiLSTM, CNN, Transformers) and use the fine-tune versions of BERT, SciBERT and
DistilBERT models, from which they obtain the best results in terms of metrics
such as recall, F1 score, accuracy, and precision using BERT embeddings.
After
identifying the types of sentences from the classification step, they train NER
models using only informative sentences and observe a notable increase in
performance for this entity extraction task. In this case, they use different
models: SciBERT, BERT, DistilBERT and Bi-LSTM-CRF Elmo model, where the best
performance is given by the BILSTM-CRF Elmo model.
Authors: Miguel Rodríguez, Jan Rodríguez
References
[1]. Souradip Guha, Ankan Mullick, Jatin
Agrawal. MatScIE: An automated tool for the generation of databases of methods
and parameters used in the computational materials science literature.
https://doi.org/10.1016/j.commatsci.2021.110325.
Available from:
https://www.sciencedirect.com/science/article/abs/pii/S0927025621000501
[2]. Ankan Mullick, Shubhraneel Pal, Tapas
Nayak. Using Sentence-level Classification Helps Entity Extraction from
Material Science Literature. Available from:
https://paperswithcode.com/paper/using-sentence-level-classification-helps
Keywords
Information Extraction, Material Scientific
Articles, Text Preprocessing, Materials Entities Extraction.