PhD Thesis

Data Mining and Information Extraction Methods
for Large-Scale High-Quality Representations of Scientific Publications

PhD Thesis

LaTeX project documenting the writing of my PhD thesis.

Tech

Usage

$ make: full build (latexmk)
$ make quick: quick build (pdflatex)
$ make clean: remove all intermediate generated files
$ make cleanall: remove all generated files (including the PDF)

Background

LaTeX
- used SDQ Dissertation Template
  - uni access
  - public access (see “Dissertationen” → Overleaf)
- added support for Japanese and Russian content
Tools
- written with vim ♡
- illustrations created in Inkscape
- bibliography managed in JabRef
- proofreading done with TeXtidote

Content

▸ Title: Data Mining and Information Extraction Methods for Large-Scale
High-Quality Representations of Scientific Publications

Abstract (click to expand)

This dissertation addresses the challenge of generating high-quality, machine-readable representations of scientific publications at a large scale. Structured data representing scientific publications is the basis for vital infrastructure in academia, such as academic search and bibliometric performance indicators. Generating such data involves information extraction from publications’ natural language content, which makes it a challenging and error-prone process. Existing extraction methods and the data they produce are limited in several ways. This is problematic, because it means that applications and research based on currently available data are of limited scope and validity.

Among the limitations of currently available methods and data, three areas are of particular importance due to their relevance in the academic context. (1) Citation networks are a key characteristic of scientific literature, and are vital for common use cases such as trend analyses and recommender systems. Despite this importance, citation networks of widely used data sets are highly incomplete. (2) Language coverage: science is a global and therefore inherently multi-lingual endeavor. Despite a growing awareness of this, important platforms, approaches, and data sets in the scholarly domain are still limited to English publications only. (3) Research artifacts, such as methods and data sets, become more and more important, as science is increasingly driven by curated data and algorithmic processing. Fine-grained representations of research artifacts bear large potential for applications like faceted academic search and automated reproduction. However, existing extraction methods only yield shallow representations of research artifacts, not sufficient for these use cases.

To address these issues, we develop data mining and information extraction approaches, that enable the creation of machine-readable publication corpora. We furthermore quantify the improvements we achieve in terms of data quality in each area of limitation. In particular, we make the following contributions. As the foundation of our research, we develop a method for creating a large-scale corpus of interlinked, full-text documents from publications’ LaTeX sources. Applying our method to all of arXiv.org, we create the first corpus of interlinked publications with extensive coverage in physics, mathematics, and computer science. Utilizing our corpus, we further present approaches yielding advances in all of the three aforementioned areas of limitation. (1) We develop a methodology for linking bibliographic references, which achieves state-of-the-art citation network completeness. Based on this, we perform novel types of citation analyses. (2) We present a method for identifying cross-lingual citations and, utilizing it, perform the largest analysis of this type of citation to date. Through our analysis, we are able to identify challenges for integrating non-English publications. (3) We develop information extraction approaches for fine-granular representations of research artifacts and their parameters. Our methods achieve an improvement over strong baselines, and their utilization enables novel types of analyses and applications.

Overall, our approaches address key shortcomings of existing methods for the creation of structured data representing publications. Through their use, we achieve significant improvements in terms of data quality. For each of our approaches, we demonstrate its viability and benefits through evaluations and practical large-scale applications. Our methods have already been adopted in several parts of the research community, which further confirms their utility.

General Info

Research period: 2019 – 2024
Research group: Web Science group at institute AIFB, KIT, Germany
Writing schedule
- 2023/06: set up LaTeX project (8f5bac9)
- 2023/10: start writing (d4a92c1)
- 2023/12: first complete version, review by supervisor (3fb520d)
- 2024/02: print, submit to examination committee (4b84dec)
- 2024/04: make last tweaks, hand in for publication (2bb7dda)

Publications Used

Topic	Venue	Paper	Author Copy*	Code & Data
Publication Corpus Creation	Scientometrics 2020	Springer*	KIT	GitHub, Zenodo
Cross-lingual Citations	ICADL 2020	Springer	KIT	GitHub
Cross-lingual Citations (ext.)	IJDL 2021	Springer*	arXiv	GitHub
Inter-Reference Matching	ULITE@JCDL 2022	CEUR*	KIT	GitHub
Corpus Creation (improved)	JCDL 2023	IEEE	arXiv	GitHub, Zenodo
Hyperparameter IE	ECIR 2024	Springer	arXiv	GitHub

*open access

Principles & Preferences

share openly & be transparent
- make author copies of publications freely accessible
- share code and data
- put author contributions sections in publications
- publish dissertation under a CC license
write locally (e.g. no web-hosted Overleaf instance)
create illustrations as vector graphics

Sources of Inspiration

André Greiner-Petter
- Highlight boxes with icons
- Structure with research objective and research tasks (as opposed to research hypotheses and/or research questions)
- Overview tables of primary and secondary publications
- Dedicated reference section for own publications
Tobias Weller
- Chapter marks
Someone on the internet
- Thanking the reader at the end of the acknowledgements section

Cite As

@phdthesis{Saier2024phdthesis,
    author       = {Saier, Tarek},
    year         = {2024},
    title        = {Data Mining and Information Extraction Methods for Large-Scale High-Quality Representations of Scientific Publications},
    doi          = {10.5445/IR/1000170262},
    publisher    = {{Karlsruher Institut für Technologie (KIT)}},
    pagetotal    = {151},
    school       = {Karlsruher Institut für Technologie (KIT)},
    language     = {english}
}

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
bib		bib
chapters		chapters
figures		figures
logos		logos
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cover.tex		cover.tex
dis.tex		dis.tex
enumitem.sty		enumitem.sty
latexml.sty		latexml.sty
sdqdiss.cls		sdqdiss.cls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhD Thesis

Tech

Usage

Background

Content

General Info

Publications Used

Principles & Preferences

Sources of Inspiration

Cite As

About

Releases

Packages

Languages

License

IllDepence/phd-thesis

Folders and files

Latest commit

History

Repository files navigation

PhD Thesis

Tech

Usage

Background

Content

General Info

Publications Used

Principles & Preferences

Sources of Inspiration

Cite As

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages