Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change local filename of downloaded pre-built index #2406

Closed
lintool opened this issue Mar 6, 2024 · 4 comments
Closed

Change local filename of downloaded pre-built index #2406

lintool opened this issue Mar 6, 2024 · 4 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Mar 6, 2024

In ~/.cache/pyserini/indexes/, for pre-built indexes, we have:

lucene-index.msmarco-v1-passage.20221004.252b5e
lucene-index.msmarco-v1-passage.20221004.252b5e.c697b18c9a0686ca760583e615dbe450

When Pyserini downloads it, the directory name gets postpended with the MD5 checksum; this doesn't happen with Anserini, so we end up downloading two copies of identical files.

@ArthurChen189 can we change the Anserini download name so it gets the MD5 checksum appended, to be consistent with Pyseirni?

@lintool
Copy link
Member Author

lintool commented Mar 21, 2024

@16BitNarwhal worked on this: #2412

@lintool
Copy link
Member Author

lintool commented Mar 21, 2024

@16BitNarwhal I don't think the issue has been fixed though... :(

On master, I'm running:

java -cp target/anserini-0.24.3-SNAPSHOT-fatjar.jar io.anserini.reproduce.RunMsMarco

I'm getting:

jimmylin@ubuntu2204-102:~/.cache/pyserini/indexes$ ls -l
total 24054864
-rw-r----- 1 jimmylin jimmylin 24632179766 Mar 21 14:39 lucene-hnsw.msmarco-v1-passage-cos-dpr-distil.20240108.825148.tar.gz
drwxr-x--x 2 jimmylin jimmylin          19 Oct  6  2022 lucene-index.cacm.20221005.252b5e.cfe14d543c6a27f4d742fb2d0099b8e0
drwxr-x--x 2 jimmylin jimmylin          15 Oct  4  2022 lucene-index.msmarco-v1-passage.20221004.252b5e
drwxr-x--x 2 jimmylin jimmylin          15 Oct  4  2022 lucene-index.msmarco-v1-passage.20221004.252b5e.c697b18c9a0686ca760583e615dbe450
drwxr-x--x 2 jimmylin jimmylin          15 May 24  2023 lucene-index.msmarco-v1-passage-splade-pp-ed.20230524.a59610
drwxr-x--x 2 jimmylin jimmylin          15 May 24  2023 lucene-index.msmarco-v1-passage-splade-pp-ed.20230524.a59610.4b3c969033cbd017306df42ce134c395
...

E.g., we're getting two copies of the index... (I've cleared the cache before I started...).

@16BitNarwhal
Copy link
Member

Just sent a PR for the fix with details!

@lintool
Copy link
Member Author

lintool commented Mar 24, 2024

Closed by #2413

@lintool lintool closed this as completed Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants