Building materials information retrieval systems using text mining methods

Every year, a large number of scientific articles from the materials domain are published, but relevant information and informative entities are usually not easy to extract from them because such information is usually poorly structured and scattered throughout the document. Recently, more and more researchers have started to address this issue and have put effort into developing tools capable of performing this information extraction over a large corpus and providing useful structured summaries.

Recent works introduce automated database generation tools in the materials domain, systems capable of extracting material science information and storing it in a structured manner. Material Science Information Extractor (MatScIE) [1] is an example of this, where the extracted data consists of interesting information such as the material details, code, parameter, method and structure of the published research article, along with a summary of the main research findings. They also created a web application where users can upload published articles and view/download a variety of information of interest obtained from this tool.

The purposes of the work are twofold: On one hand, they create an information extraction system by applying Named Entity Recognition (NER) in order to extract entities from the articles belonging to the following five categories: material, code, parameter, method and structure. On the other hand, they also develop and present a sentence classification model for the purpose of providing a short summary level information from the article.

To enable the extraction of information, a database has been created by collecting materials science articles published between 2010 and 2019 from arXiv, considering only articles that contain at least one of the following keywords: “ab initio simulation”, “density functional study”, “density functional theory” and “first principles”. For the identification and extraction of entities of interest, the researchers made use of a sequence labelling approach (neural network), using pre-trained word embeddings in the material science domain. To train this neural network, they randomly selected 214 arXiv articles, which were annotated by materials experts under the aforementioned labels (material, code, parameter, method and structure). Once the model is trained, they apply NER to a corpus of more than 10k articles.

With respect to the sentence classification model, the researchers selected 90 abstracts from the initial 214 arXiv annotated articles as training dataset and fine-tuned the model using both bert-base-uncased and scibert-base-uncased pretrained embeddings. After applying the appropriate text preprocessing steps and sentence selection, they annotated each sentence with positive or negative labels depending on whether the corresponding sentence matched a result from the previous entity extraction step.

Finally, they created a useful tool with its corresponding online interface, from which one can obtain predicted entities from an uploaded published article, index the material science documents according to the material used or parameter used for the specific methods, and generate a short summary of the published article.

Similarly, the work of Mullick et al. [2] provides another example of efficient and useful extraction of relevant information in the material domain. Their purpose is to try to identify parts of the article text that contain informative entities using sentence-level classifiers in order to provide useful summaries and extract entities of interest based on certain rules. In this case, the sentences are considered as “informative” if they contain at least one of the following types of informative entities: material names, method names, code or simulation software names, parameters of the simulation software, and material structure type.

The dataset is similar to the previously commented work, where approximately 10k articles were selected, from which 214 were randomly selected and annotated. From this subset of collected articles, a sentence classification was trained based on two labels, “informative” if the sentence contains an entity, and “uninformative if this is not the case. This resulted in 35% of informative sentences out of a total of 49,610 sentences.

In terms of methodology and experiments, they proposed deep neural network-based binary sentence classification to identify sentences within two classes, informative and uninformative, where informative will refer to sentences that contain any of the five entity categories defined similarly to [1] (material, method, code, parameter, and structure). By applying sentence classification and removing the uninformative ones, they observe a significant improvement in the next entity extraction step. For this classification task, they start with traditional machine learning methods such as Support Vector Machine (SVM), Logistic Regression, Random Forest or Bag of Words (BoW) and use features from four categories: Parts of Speech tag-based, Tf-idf based, Dependency parse based and others. They then explore deep neural approaches such as common network models (BiLSTM, CNN, Transformers) and use the fine-tune versions of BERT, SciBERT and DistilBERT models, from which they obtain the best results in terms of metrics such as recall, F1 score, accuracy, and precision using BERT embeddings.

After identifying the types of sentences from the classification step, they train NER models using only informative sentences and observe a notable increase in performance for this entity extraction task. In this case, they use different models: SciBERT, BERT, DistilBERT and Bi-LSTM-CRF Elmo model, where the best performance is given by the BILSTM-CRF Elmo model.


Authors: Miguel Rodríguez, Jan Rodríguez


[1]. Souradip Guha, Ankan Mullick, Jatin Agrawal. MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature.
Available from:

[2]. Ankan Mullick, Shubhraneel Pal, Tapas Nayak. Using Sentence-level Classification Helps Entity Extraction from Material Science Literature. Available from:


Information Extraction, Material Scientific Articles, Text Preprocessing, Materials Entities Extraction.