Skip to content

Commit

Permalink
Repo reorganization (castorini#1202)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored May 16, 2020
1 parent 86f2dce commit 5407b03
Show file tree
Hide file tree
Showing 7 changed files with 29 additions and 9 deletions.
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -44,5 +44,14 @@ runs/
# default directory where logs go
logs/

# default directory where collections go
collections/

# default directory where indexes go
indexes/

# default output location of "Neural Hype" experiments: https://github.com/castorini/anserini/blob/master/docs/experiments-forum2018.md
fine_tuning_results/
fine_tuning_results/

# directory where we keep throw-away Java classes that we don't want checked into the repo.
src/main/java/io/anserini/scratch/
3 changes: 3 additions & 0 deletions bin/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

mvn clean package appassembler:assemble
3 changes: 3 additions & 0 deletions bin/qbuild.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true
1 change: 1 addition & 0 deletions collections/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# This is the default directory for collections. Placeholder so that directory is kept in git.
18 changes: 10 additions & 8 deletions docs/experiments-cord19.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ First, download the data:

```bash
DATE=2020-05-12
DATA_DIR=./cord19-"${DATE}"
DATA_DIR=./collections/cord19-"${DATE}"
mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/document_parses.tar.gz -P "${DATA_DIR}"
Expand All @@ -47,9 +47,11 @@ For a sense of how these different methods stack up, refer to the following pape

+ Jimmy Lin. [Is Searching Full Text More Effective Than Searching Abstracts?](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-46) BMC Bioinformatics, 10:46 (3 February 2009).

The tl;dr — we'd recommend getting started with title + abstract index since it's the smallest in size and easiest to manipulate. Paragraph indexing is likely to be more effective (i.e., better search results), but a bit more difficult to manipulate since some deduping is required to post-process the raw hits (since multiple paragraphs from the same article might be retrieved).
The tl;dr — we'd recommend getting started with abstract index since it's the smallest in size and easiest to manipulate. Paragraph indexing is likely to be more effective (i.e., better search results), but a bit more difficult to manipulate since some deduping is required to post-process the raw hits (since multiple paragraphs from the same article might be retrieved).
The full-text index overly biases long documents and isn't really effective; this condition is included here only for completeness.

Note that as of TREC-COVID Round 1, there is some evidence that the abstract index is more effective for search, see results of experiments [here](experiments-covid.md).

### Abstract

We can index abstracts (and titles, of course) with `Cord19AbstractCollection`, as follows:
Expand All @@ -58,8 +60,8 @@ We can index abstracts (and titles, of course) with `Cord19AbstractCollection`,
sh target/appassembler/bin/IndexCollection \
-collection Cord19AbstractCollection -generator Cord19Generator \
-threads 8 -input "${DATA_DIR}" \
-index "${DATA_DIR}"/lucene-index-cord19-abstract-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > log.cord19-abstract.${DATE}.txt
-index indexes/lucene-index-cord19-abstract-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-abstract.${DATE}.txt
```

The log should end with something like this:
Expand All @@ -85,8 +87,8 @@ We can index the full text, with `Cord19FullTextCollection`, as follows:
sh target/appassembler/bin/IndexCollection \
-collection Cord19FullTextCollection -generator Cord19Generator \
-threads 8 -input "${DATA_DIR}" \
-index "${DATA_DIR}"/lucene-index-cord19-full-text-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > log.cord19-full-text.${DATE}.txt
-index indexes/lucene-index-cord19-full-text-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-full-text.${DATE}.txt
```

The log should end with something like this:
Expand All @@ -112,8 +114,8 @@ We can build a paragraph index with `Cord19ParagraphCollection`, as follows:
sh target/appassembler/bin/IndexCollection \
-collection Cord19ParagraphCollection -generator Cord19Generator \
-threads 8 -input "${DATA_DIR}" \
-index "${DATA_DIR}"/lucene-index-cord19-paragraph-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > log.cord19-paragraph.${DATE}.txt
-index indexes/lucene-index-cord19-paragraph-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-paragraph.${DATE}.txt
```

The log should end with something like this:
Expand Down
1 change: 1 addition & 0 deletions indexes/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# This is the default directory for indexes. Placeholder so that directory is kept in git.
1 change: 1 addition & 0 deletions src/main/java/io/anserini/scratch/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# This is a directory where we keep throw-away Java classes that we don't want checked into the repo.

0 comments on commit 5407b03

Please sign in to comment.