Skip to content

IllDepence/phd-thesis

Repository files navigation


Data Mining and Information Extraction Methods
for Large-Scale High-Quality Representations of Scientific Publications


PhD Thesis

LaTeX project documenting the writing of my PhD thesis.

Tech

Usage

  • $ make: full build (latexmk)
  • $ make quick: quick build (pdflatex)
  • $ make clean: remove all intermediate generated files
  • $ make cleanall: remove all generated files (including the PDF)

Background

  • LaTeX
    • used SDQ Dissertation Template
    • added support for Japanese and Russian content
  • Tools
    • written with vim
    • illustrations created in Inkscape
    • bibliography managed in JabRef
    • proofreading done with TeXtidote

Content

Title: Data Mining and Information Extraction Methods for Large-Scale
     High-Quality Representations of Scientific Publications

Abstract (click to expand)

This dissertation addresses the challenge of generating high-quality, machine-readable representations of scientific publications at a large scale. Structured data representing scientific publications is the basis for vital infrastructure in academia, such as academic search and bibliometric performance indicators. Generating such data involves information extraction from publications’ natural language content, which makes it a challenging and error-prone process. Existing extraction methods and the data they produce are limited in several ways. This is problematic, because it means that applications and research based on currently available data are of limited scope and validity.

Among the limitations of currently available methods and data, three areas are of particular importance due to their relevance in the academic context. (1) Citation networks are a key characteristic of scientific literature, and are vital for common use cases such as trend analyses and recommender systems. Despite this importance, citation networks of widely used data sets are highly incomplete. (2) Language coverage: science is a global and therefore inherently multi-lingual endeavor. Despite a growing awareness of this, important platforms, approaches, and data sets in the scholarly domain are still limited to English publications only. (3) Research artifacts, such as methods and data sets, become more and more important, as science is increasingly driven by curated data and algorithmic processing. Fine-grained representations of research artifacts bear large potential for applications like faceted academic search and automated reproduction. However, existing extraction methods only yield shallow representations of research artifacts, not sufficient for these use cases.

To address these issues, we develop data mining and information extraction approaches, that enable the creation of machine-readable publication corpora. We furthermore quantify the improvements we achieve in terms of data quality in each area of limitation. In particular, we make the following contributions. As the foundation of our research, we develop a method for creating a large-scale corpus of interlinked, full-text documents from publications’ LaTeX sources. Applying our method to all of arXiv.org, we create the first corpus of interlinked publications with extensive coverage in physics, mathematics, and computer science. Utilizing our corpus, we further present approaches yielding advances in all of the three aforementioned areas of limitation. (1) We develop a methodology for linking bibliographic references, which achieves state-of-the-art citation network completeness. Based on this, we perform novel types of citation analyses. (2) We present a method for identifying cross-lingual citations and, utilizing it, perform the largest analysis of this type of citation to date. Through our analysis, we are able to identify challenges for integrating non-English publications. (3) We develop information extraction approaches for fine-granular representations of research artifacts and their parameters. Our methods achieve an improvement over strong baselines, and their utilization enables novel types of analyses and applications.

Overall, our approaches address key shortcomings of existing methods for the creation of structured data representing publications. Through their use, we achieve significant improvements in terms of data quality. For each of our approaches, we demonstrate its viability and benefits through evaluations and practical large-scale applications. Our methods have already been adopted in several parts of the research community, which further confirms their utility.

General Info

  • Research period: 2019 – 2024
  • Research group: Web Science group at institute AIFB, KIT, Germany
  • Writing schedule
    • 2023/06: set up LaTeX project (8f5bac9)
    • 2023/10: start writing (d4a92c1)
    • 2023/12: first complete version, review by supervisor (3fb520d)
    • 2024/02: print, submit to examination committee (4b84dec)
    • 2024/04: make last tweaks, hand in for publication (2bb7dda)

Publications Used

Topic Venue Paper Author Copy* Code & Data
Publication Corpus Creation Scientometrics 2020 Springer* KIT GitHub, Zenodo
Cross-lingual Citations ICADL 2020 Springer KIT GitHub
Cross-lingual Citations (ext.) IJDL 2021 Springer* arXiv GitHub
Inter-Reference Matching ULITE@JCDL 2022 CEUR* KIT GitHub
Corpus Creation (improved) JCDL 2023 IEEE arXiv GitHub, Zenodo
Hyperparameter IE ECIR 2024 Springer arXiv GitHub

*open access

Principles & Preferences

  • share openly & be transparent
    • make author copies of publications freely accessible
    • share code and data
    • put author contributions sections in publications
    • publish dissertation under a CC license
  • write locally (e.g. no web-hosted Overleaf instance)
  • create illustrations as vector graphics

Sources of Inspiration

  • André Greiner-Petter
    • Highlight boxes with icons
    • Structure with research objective and research tasks (as opposed to research hypotheses and/or research questions)
    • Overview tables of primary and secondary publications
    • Dedicated reference section for own publications
  • Tobias Weller
    • Chapter marks
  • Someone on the internet
    • Thanking the reader at the end of the acknowledgements section

Cite As

@phdthesis{Saier2024phdthesis,
    author       = {Saier, Tarek},
    year         = {2024},
    title        = {Data Mining and Information Extraction Methods for Large-Scale High-Quality Representations of Scientific Publications},
    doi          = {10.5445/IR/1000170262},
    publisher    = {{Karlsruher Institut für Technologie (KIT)}},
    pagetotal    = {151},
    school       = {Karlsruher Institut für Technologie (KIT)},
    language     = {english}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published