Skip to content

Latest commit

 

History

History
547 lines (494 loc) · 24.1 KB

disambr.analysis.org

File metadata and controls

547 lines (494 loc) · 24.1 KB

Variations on EVA algorithm

The EVA algorithms described in van den Akker et al. (2020) justifies and proposes a number of general disambiguation procedures and the sequence of their execution. This general algorithm can be tuned in many ways for specific needs at hand. Below we demonstrate a comparison between different versions of the algorithm and discuss some implications.

The sample test data is based on the data used to test another disambiguation algorithm described by in Tekles & Bornmann (2019). The test data is a bibliographic records from Web of Science. We reproduced the smaller random sample of the Tekles & Bornmann (2019) by using the same Researcher IDs provided to us courtesy by the authors. We use these Researcher IDs and the same query parameters to export the sample data from Web of Science.

The table below presents comparison of EVA algorithms. The motivation for these versions is described in van den Akker et al. (2020). We used several common metrics to assess the efficiency and performance of each version.

nametrue_positivesfalse_positivesfalse_negativestrue_negativespw_presisionpw_recallpw_f1pw_accuracydur_mins
eva-slow-dl112977892460-55240.9710.5480.7-127435
eva-slow-dl012976382461-54730.9870.5470.704-124836.3
eva-slow-dl12429771022460-55370.9670.5480.699-128037.6
eva-slow-dl1252977892460-55240.9710.5480.7-127434.2
eva-slow-dl1262977892460-55240.9710.5480.7-127433.9
eva-slow-dl11-c8345354NA0.9540.01530.0301NA0.73
eva-slow-dl11-e129404143166770510.2380.3840.9980.741
eva-slow-dl11-k50364934NA0.9880.09250.169NA3.98
eva-slow-dl11-x23305204NA10.04290.0822NA14.4
eva-slow-dl11-eic2s2977892460-55240.9710.5480.7-127433.9
eva-slow-dl11-eic2sk22981922456-55270.970.5480.701-127334
eva-slow-dl11-eicsk22981922456-55270.970.5480.701-127333.9
eva-slow-dl11-i236277307559407680.9680.4340.60.9990.757

The barchart below shows duration profiles of procedure for different EVA algorithms. ./disambr.analysis.png

The tentative results shows that the baseline version of the algorithm eva-slow-dl01 has relatively better performance (based on F1 index) but it is also one of the most time consuming.

The reproducible code snippets for each variations to EVA algorithm that we considered for the analysis are below. More in depth investigation will follow.

eva-dl11

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11")

ds <-
    disambr_read("../data/wos-slow-export-subset"
               , save_sets_as = "wos-slow-export-subset.rds"
               , save_sets_dir = "../data") %>% 
    disambr_set_tekles_bornmann(
        file_path = "./inst/testdata/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl01

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl01")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>% 
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 1
                                 , max_dist_short = 0
                                 , min_length = 4) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl124

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl124")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>% 
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 2
                                 , max_dist_short = 1
                                 , min_length = 4) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl125

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl125")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 2
                                 , max_dist_short = 1
                                 , min_length = 5) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl126

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl126")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 2
                                 , max_dist_short = 1
                                 , min_length = 6) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-eic2sk

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-eic2s")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 1
                                 , max_dist_short = 0
                                 , min_length = 0) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references(references_in_common = 2) %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-eicsk2

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-eicsk2")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 1
                                 , max_dist_short = 0
                                 , min_length = 0) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords(keywords_in_common = 2) %>%
    disambr_set_same_researcher_ids

eva-dl11-eic2sk2

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-eic2sk2")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names(max_dist = 1
                                 , max_dist_short = 0
                                 , min_length = 0) %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references(references_in_common = 2) %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords(keywords_in_common = 2) %>%
    disambr_set_same_researcher_ids

eva-dl11-e

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../data/disambr-data/eva-slow-dl11-e")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../data/disambr-data/disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "./inst/testdata/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    disambr_set_same_email %>% 
    ## disambr_set_same_affiliation %>%
    ## disambr_set_cite_others_paper %>%
    ## disambr_set_common_references %>%
    ## disambr_set_cite_self_citation %>%
    ## disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-i

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-i")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    ## disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    ## disambr_set_cite_others_paper %>%
    ## disambr_set_common_references %>%
    ## disambr_set_cite_self_citation %>%
    ## disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-c

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-c")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    ## disambr_set_same_email %>% 
    ## disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    ## disambr_set_common_references %>%
    ## disambr_set_cite_self_citation %>%
    ## disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-x

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-x")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    ## disambr_set_same_email %>% 
    ## disambr_set_same_affiliation %>%
    ## disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    ## disambr_set_cite_self_citation %>%
    ## disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-s

This one is skipped as the procedure requires prior matched (strong sets) authors to identify self citations.

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-s")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    ## disambr_set_same_email %>% 
    ## disambr_set_same_affiliation %>%
    ## disambr_set_cite_others_paper %>%
    ## disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    ## disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

eva-dl11-k

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
options(disambr_save_set_dir = "../eva-slow-dl11-k")

ds <-
    disambr_read("../data/wos-slow"
               , save_sets_as = "wos-slow.rds"
               , save_sets_dir = "../disambr-data") %>%  
      disambr_set_tekles_bornmann(file_path = "../data/tekles-bornmann-researcher-ids.txt") %>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    ## disambr_set_same_email %>% 
    ## disambr_set_same_affiliation %>%
    ## disambr_set_cite_others_paper %>%
    ## disambr_set_common_references %>%
    ## disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids

Run benchmarking

stats <- c("../eva-slow-dl01"
         , "../eva-slow-dl11"
         , "../eva-slow-dl124"
         , "../eva-slow-dl125"
         , "../eva-slow-dl126"
         , "../eva-slow-dl11-c"
         , "../eva-slow-dl11-e"
         , "../eva-slow-dl11-i"
         , "../eva-slow-dl11-k"
         , "../eva-slow-dl11-x"
         , "../eva-slow-dl11-eic2s"
         , "../eva-slow-dl11-eic2sk2"
         , "../eva-slow-dl11-eicsk2"
  ) %>%
      lapply(function(d) {
          print(d)
          disambr_stats(sets_dir = d, save_rds = FALSE)
      }) %>%
      rbindlist
    

stats_dur <- 
    as.data.table(stats$dur_sets) %>%
    setnames(stats$name) %>%
    as.matrix %>%
    barplot(legend.text = paste("procedure", 1:(nrow(.))), las = 2)
library(magrittr)
library(data.table)
  source("R/disambr_sets.r")
  source("R/disambr_stats.r")
options("digits" = 3)
c("../eva-slow-dl01"
  , "../eva-slow-dl11"
  , "../eva-slow-dl11-c"
  , "../eva-slow-dl11-e"
  , "../eva-slow-dl11-eic2s"
  , "../eva-slow-dl11-eic2sk2"
  , "../eva-slow-dl11-eicsk2"
  , "../eva-slow-dl11-i"
  , "../eva-slow-dl11-k"
  , "../eva-slow-dl11-x"
  , "../eva-slow-dl124"
  , "../eva-slow-dl125"
  , "../eva-slow-dl126") %>%
      lapply(function(d) {
          print(d)
          disambr_stats(sets_dir = d, save_rds = FALSE)
      }) %>%
      rbindlist %>%
      as.matrix
nametrue_positivesfalse_positivesfalse_negativestrue_negativespw_presision pw_recall pw_f1 pw_accuracy dur_minsw_presisionpw_recallpw_f1pw_accuracy dur_mins
eva-slow-dl01”2976382461-54730.9870.5470.704-124816.3
eva-slow-dl11”2977892460-55240.9710.5480.7-127435
eva-slow-dl11-c”8345354NA0.9540.01530.0301NA0.73
eva-slow-dl11-e”129404143166770510.2380.3840.9980.741
eva-slow-dl11-eic2s”2977892460-55240.9710.5480.7-127433.9
eva-slow-dl11-eic2sk2”2981922456-55270.970.5480.701-127334
eva-slow-dl11-eicsk2”2981922456-55270.970.5480.701-127333.9
eva-slow-dl11-i”236277307559407680.9680.4340.60.9990.757
eva-slow-dl11-k”50364934NA0.9880.09250.169NA3.98
eva-slow-dl11-x”23305204NA10.04290.0822NA14.4
eva-slow-dl124”29771022460-55370.9670.5480.699-128037.6
eva-slow-dl125”2977892460-55240.9710.5480.7-127434.2
eva-slow-dl126”2977892460-55240.9710.5480.7-127433.9

[2022-04-15 Fri]

library(magrittr)
library(data.table)
  source("R/disambr_sets.r")
  source("R/disambr_stats.r")
options("digits" = 3)

"/mnt/raid5/data/disambr-data" |>
    file.path(c("eva-slow-dl01"
              , "eva-slow-dl11"
              , "eva-slow-dl11-c"
              , "eva-slow-dl11-e"
              , "eva-slow-dl11-eic2s"
              , "eva-slow-dl11-eic2sk2"
              , "eva-slow-dl11-eicsk2"
              , "eva-slow-dl11-i"
              , "eva-slow-dl11-k"
              , "eva-slow-dl11-x"
              , "eva-slow-dl124"
              , "eva-slow-dl125"
              , "eva-slow-dl126")) |>
    lapply(function(d) {
        print(d)
        disambr_stats(sets_dir = d, save_rds = FALSE)
    }) |>
    data.table::rbindlist() |>
    as.matrix()

on blade

setwd("S:/disambr")

  options(disambr_get_output_set = TRUE)
  options(disambr_mess_pretty = TRUE)
  options(disambr_save_as = TRUE)
  options(disambr_save_set_dir = "eva-1")

ts_eva_full <-
      disambr_read("wos-slow-export-subset"
                 , save_sets_as = "wos-slow-export-subset-2.rds"
                 , save_sets_dir = "eva-1") %>% 
      disambr_set_tekles_bornmann %>% #59149  vs_1962896
      disambr_set_on_same_paper %>%
      disambr_set_similar_initials %>% 
      disambr_set_similar_last_names %>%
      disambr_set_same_email %>% 
      disambr_set_same_affiliation %>%
      disambr_set_cite_others_paper %>%
      disambr_set_common_references %>%
      disambr_set_cite_self_citation %>%
      disambr_set_common_keywords %>%
      disambr_set_same_researcher_ids


ts_eva_full_try <-
      disambr_read("../data/wos-slow-export-subset"
                 , save_sets_as = "wos-slow-export-subset-2.rds"
                 , save_sets_dir = "../data")

References

van den Akker, O. R., Epskamp, Sacha, & Vlasov, S. A. (2020). The AEV Algorithm—Author name disambiguation for large Web of Science datasets.

Tekles, A., & Bornmann, L. (2019). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. ArXiv:1904.12746 [Cs]. http://arxiv.org/abs/1904.12746