-
Notifications
You must be signed in to change notification settings - Fork 7
EntrezDirect scripting
The "Digital Object Identifier” uniquely identifies a research paper (and recently it's being co-opted to reference associated datasets). There're interesting and troublesome exceptions, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.
Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to prepend the DOI to "dx.doi.org/" to generate a working redirection link.
Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle 'switches' to tailor the output just as you would from the web service (albeit with a bit more of the inner workings on show).
I've pieced together a bit of a home pipeline, which has a function to generate citations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically.
Entrez Direct has concise documentation, and this setup can also be used to access genetic, oncology (OMIM), protein, and other types of data.
When installing the setup script added a "source .bashrc
" command to my .bashrc, 'sourcing' my .bash_profile, which was already in turn 'sourcing' my .bashrc, effectively putting every new terminal command prompt in an infinite loop - watch out for this if your terminals freeze then quit after installation!
The scripts below are available here, I'll update them on the GitHub Gist if I make amendments:
function cutf (){ cut -d $'\t' -f "$@"; }
function striptoalpha (){ for thisword in $(echo "$@" | tr -dc "[A-Z][a-z]\n" | tr [A-Z] [a-z]); do echo $thisword; done; }
function pubmed (){ esearch -db pubmed -query "$@" | efetch -format docsum | xtract -pattern DocumentSummary -present Author -and Title -element Id -first "Author/Name" -element Title; }
function pubmeddocsum (){ esearch -db pubmed -query "$@" | efetch -format docsum; }
function pubmedextractdoi (){ pubmeddocsum "$@" | xtract -pattern DocumentSummary -element Id -first "Author/Name" -element Title SortPubDate -block ArticleId -match "IdType:doi" -element Value | awk '{split($0,a,"\t"); split(a[4],b,"/"); print a[1]"\t"a[2]"\t"a[3]"\t"a[5]"\t"b[1]}'; }
function pubmeddoi (){ pubmedextractdoi "$@" | cutf 4; }
function pubmeddoimulti (){
xtracted=$(pubmedextractdoi "$@")
if [[ $(echo "$xtracted" | cutf 4) == '' ]]
then
xtractedpmid=$(echo "$xtracted" | cutf 1)
pmid2doirestful "$xtractedpmid"
else
echo "$xtracted" | cutf 4
fi
}
function pmid2doi (){ curl -s www.pmid2doi.org/rest/json/doi/"$@" | awk '{split($0,a,",\"doi\":\"|\"}"); print a[2]}'; }
function pmid2doimulti (){
curleddoi=$(pmid2doi "$@")
if [[ $curleddoi == '' ]]
then
pmid2doincbi "$@"
else
echo "$curleddoi"
fi
}
function pmid2doincbi (){
xtracteddoi=$(pubmedextractdoi "$@")
if [[ $xtracteddoi == '' ]]
then
echo "DOI NA"
else
echo "$xtracteddoi"
fi
}
function AddPubTableDOIsSimple () {
old_IFS=$IFS
IFS=$'\n'
for line in $(cat "$@"); do
AddPubDOI "$line"
done
IFS=$old_IFS
}
# Came across NCBI rate throttling while trying to call AddPubDOI in parallel, so added a second attempt for "DOI NA"
# and also writing STDOUT output to STDERR as this function will be used on a file (meaning STDOUT will get silenced)
# so you can see progress through the lines, as in:
# AddPubTableDOIs table.tsv > outputfile.tsv
# I'd recommend it's not wise to overwrite unless you're using version control.
function AddPubTableDOIs () {
old_IFS=$IFS
IFS=$'\n'
for line in $(cat "$@"); do
DOIresp=$(AddPubDOI "$line" 2>/dev/null)
if [[ $DOIresp =~ 'DOI NA' ]]; then
# try again in case it's just NCBI rate throttling, but just the once
DOIresp2=$(AddPubDOI "$line" 2>/dev/null)
if [[ $(echo "$DOIresp2" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then
echo "$DOIresp2"
>&2 echo "$DOIresp"
else
DOIinput=$(echo "$line" | cutf 1-3)
echo -e "$DOIinput\tDOI NA: Parse error"
>&2 echo "$DOIinput\tDOI NA: Parse error"
fi
else
if [[ $(echo "$DOIresp" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then
echo "$DOIresp"
>&2 echo "$DOIresp"
else
DOIinput=$(echo "$line" | cutf 1-3)
echo -e "$DOIinput\tDOI NA: Parse error"
>&2 echo "$DOIinput\tDOI NA: Parse error"
fi
fi
done
IFS=$old_IFS
}
function AddPubDOI (){
if [[ $(echo "$@" | cutf 4) != '' ]]; then
echo "$@"
continue
fi
printf "$(echo "$@" | cutf 1-3)\t"
thistitle=$(echo "$@" | cutf 3)
if [[ $thistitle != 'Title' ]]; then
thisauthor=$(echo "$@" | cutf 1)
thisyear=$(echo "$@" | cutf 2)
round1=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR]")
round1hits=$(echo "$round1" | wc -l)
if [[ "$round1hits" -gt '1' ]]; then
round2=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR] AND ("$thisyear"[Date - Publication] : "$thisyear"[Date - Publication])")
round2hits=$(echo "$round2" | wc -l)
if [[ "$round2hits" -gt '1' ]]; then
round3=$(
xtracted=$(pubmedextractdoi "$@")
xtractedtitles=$(echo "$xtracted" | cutf 3 | tr -dc "[A-Z][a-z]\n")
alphatitles=$(striptoalpha "$xtractedtitles")
thistitlealpha=$(striptoalpha "$thistitle")
presearchIFS=$IFS
IFS=$'\n'
titlecounter="1"
for searchtitle in $(echo "$alphatitles"); do
(( titlecounter++ ))
if [[ "$searchtitle" == *"$thistitlealpha"* ]]; then
echo "$xtracted" | sed $titlecounter'q;d' | cutf 4
fi
done
IFS=$presearchIFS
)
round3hits=$(echo "$round3" | wc -l)
if [[ "$round3hits" -gt '1' ]]; then
echo "ERROR multiple DOIs after 3 attempts to reduce - "$round3
else
echo $round3
fi
else
echo $round2
fi
else
echo $round1
fi
fi
}
function pmid2doirestful (){
curleddoi=$(pmid2doi "$@")
if [[ $curleddoi == '' ]]
then
echo "DOI NA"
else
echo "$curleddoi"
fi
}
function mmrlit { cat ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; }
function mmrlitedit { vim ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; }
function mmrlitgrep (){ grep -i "$@" ~/Dropbox/Y3/MMR/Essay/literature_table_with_DOIs.tsv; }
function mmrlitdoi (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | tr -d '\n' | xclip -sel p; clipconfirm; }
function mmrlitdoicite (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | awk '{print "`r citet(\""$0"\")`"}' | tr -d '\n' | xclip -sel p; clipconfirm; }
The main functions in the script are AddPubDOI
and AddPubTableDOIs
, the former being executed for every line in the input (reading from a table). Weird bug/programming language feature who knows where - you can't use the traditional while read variable;
do function(variable);
done < inputfile
construction to handle a file line by line, so I resorted to cat
trickery. I blame Perl.
-
cutf
is my shorthand to tell thecut
command I want a specific column in a tab-separated file or variable. -
striptoalpha
is a function I made here to turn paper titles into all-lowercase squished together strings of letters (no dashes, commas etc that might get in the way of text comparison) in a really crude way of checking one name against another. This part of the script could easily be improved, but I was just sorting out one funny case - usually matching author and year and using a loose title match will be sufficient to find the matching Pubmed entry, for which a DOI can be found. -
pubmed
chains together:esearch
to search pubmed for the query;efetch
to get the document (i.e. article) summaries as XML; andxtract
to get the basic info. I don't use this in my little pipeline setup, rather I kept my options open and chose to get more information, and match within blocks of the XML for the DOI. It's not so complicated to follow, as well as my code there's this example on Biostars. -
pubmeddocsum
just does the first 2 of the steps above: providing full unparsed XML 'docsums' -
pubmedextractdoi
gets date and DOI information as columns, then uses GNU awk to rearrange the columns in the output -
pubmeddoi
gives just the DOI column from said rearranged output -
pubmeddoimulti
has 'multiple' ways to try and get the DOI for an article matched from searching Pubmed: firstly from the DOI output, then attempting to use the pmid2doi service output. -
pmid2doimulti
does as forpubmeddoimulti
but from a provided PMID -
pmid2doi
handles the pmid2doi.org response,pmid2doincbi
the Entrez Direct side, both feed intopmid2doimulti
.
Rookie's disclaimer: I'm aware pipelines are suposed to contain more um, pipes, but I can't quite figure out an easy way to make these functions 'pipe' to one another, so I'm sticking with passing the output to the next as input ("$@"
in bash script).