Data Mining and Information Extraction Methods
for Large-Scale High-Quality Representations of Scientific Publications
LaTeX project documenting the writing of my PhD thesis.
$ make
: full build (latexmk)$ make quick
: quick build (pdflatex)$ make clean
: remove all intermediate generated files$ make cleanall
: remove all generated files (including the PDF)
- LaTeX
- used SDQ Dissertation Template
- uni access
- public access (see “Dissertationen” → Overleaf)
- added support for Japanese and Russian content
- used SDQ Dissertation Template
- Tools
- written with vim ♡
- illustrations created in Inkscape
- bibliography managed in JabRef
- proofreading done with TeXtidote
▸ Title: Data Mining and Information Extraction Methods for Large-Scale
High-Quality Representations of Scientific Publications
Abstract (click to expand)
This dissertation addresses the challenge of generating high-quality, machine-readable representations of scientific publications at a large scale. Structured data representing scientific publications is the basis for vital infrastructure in academia, such as academic search and bibliometric performance indicators. Generating such data involves information extraction from publications’ natural language content, which makes it a challenging and error-prone process. Existing extraction methods and the data they produce are limited in several ways. This is problematic, because it means that applications and research based on currently available data are of limited scope and validity.
Among the limitations of currently available methods and data, three areas are of particular importance due to their relevance in the academic context. (1) Citation networks are a key characteristic of scientific literature, and are vital for common use cases such as trend analyses and recommender systems. Despite this importance, citation networks of widely used data sets are highly incomplete. (2) Language coverage: science is a global and therefore inherently multi-lingual endeavor. Despite a growing awareness of this, important platforms, approaches, and data sets in the scholarly domain are still limited to English publications only. (3) Research artifacts, such as methods and data sets, become more and more important, as science is increasingly driven by curated data and algorithmic processing. Fine-grained representations of research artifacts bear large potential for applications like faceted academic search and automated reproduction. However, existing extraction methods only yield shallow representations of research artifacts, not sufficient for these use cases.
To address these issues, we develop data mining and information extraction approaches, that enable the creation of machine-readable publication corpora. We furthermore quantify the improvements we achieve in terms of data quality in each area of limitation. In particular, we make the following contributions. As the foundation of our research, we develop a method for creating a large-scale corpus of interlinked, full-text documents from publications’ LaTeX sources. Applying our method to all of arXiv.org, we create the first corpus of interlinked publications with extensive coverage in physics, mathematics, and computer science. Utilizing our corpus, we further present approaches yielding advances in all of the three aforementioned areas of limitation. (1) We develop a methodology for linking bibliographic references, which achieves state-of-the-art citation network completeness. Based on this, we perform novel types of citation analyses. (2) We present a method for identifying cross-lingual citations and, utilizing it, perform the largest analysis of this type of citation to date. Through our analysis, we are able to identify challenges for integrating non-English publications. (3) We develop information extraction approaches for fine-granular representations of research artifacts and their parameters. Our methods achieve an improvement over strong baselines, and their utilization enables novel types of analyses and applications.
Overall, our approaches address key shortcomings of existing methods for the creation of structured data representing publications. Through their use, we achieve significant improvements in terms of data quality. For each of our approaches, we demonstrate its viability and benefits through evaluations and practical large-scale applications. Our methods have already been adopted in several parts of the research community, which further confirms their utility.
- Research period: 2019 – 2024
- Research group: Web Science group at institute AIFB, KIT, Germany
- Writing schedule
Topic | Venue | Paper | Author Copy* | Code & Data |
---|---|---|---|---|
Publication Corpus Creation | Scientometrics 2020 | Springer* | KIT | GitHub, Zenodo |
Cross-lingual Citations | ICADL 2020 | Springer | KIT | GitHub |
Cross-lingual Citations (ext.) | IJDL 2021 | Springer* | arXiv | GitHub |
Inter-Reference Matching | ULITE@JCDL 2022 | CEUR* | KIT | GitHub |
Corpus Creation (improved) | JCDL 2023 | IEEE | arXiv | GitHub, Zenodo |
Hyperparameter IE | ECIR 2024 | Springer | arXiv | GitHub |
*open access
- share openly & be transparent
- make author copies of publications freely accessible
- share code and data
- put author contributions sections in publications
- publish dissertation under a CC license
- write locally (e.g. no web-hosted Overleaf instance)
- create illustrations as vector graphics
- André Greiner-Petter
- Highlight boxes with icons
- Structure with research objective and research tasks (as opposed to research hypotheses and/or research questions)
- Overview tables of primary and secondary publications
- Dedicated reference section for own publications
- Tobias Weller
- Chapter marks
- Someone on the internet
- Thanking the reader at the end of the acknowledgements section
@phdthesis{Saier2024phdthesis,
author = {Saier, Tarek},
year = {2024},
title = {Data Mining and Information Extraction Methods for Large-Scale High-Quality Representations of Scientific Publications},
doi = {10.5445/IR/1000170262},
publisher = {{Karlsruher Institut für Technologie (KIT)}},
pagetotal = {151},
school = {Karlsruher Institut für Technologie (KIT)},
language = {english}
}