Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to add an Index.html for the entire bucket. #18

Closed
LanDeQuHuXi opened this issue Mar 29, 2017 · 13 comments · Fixed by #80
Closed

Is there a way to add an Index.html for the entire bucket. #18

LanDeQuHuXi opened this issue Mar 29, 2017 · 13 comments · Fixed by #80

Comments

@LanDeQuHuXi
Copy link

Hi guys,
Thanks for your great work.
I'm wondering if there is a way to add an Index.html for the entire bucket?
Right now, s3pypi only generates one index.html for each package, is there a way to generate an index.html for the entire s3 bucket?

@pdonis
Copy link

pdonis commented Sep 11, 2017

I have created a pull request, #26, that addresses this issue.

@derek-adair
Copy link

Nice idea for sure.

@igable
Copy link

igable commented Oct 18, 2018

This would be extremely useful to have this available.

@skwashd
Copy link

skwashd commented May 18, 2021

This is a problem for tools that expect a PEP 503 compliant package repo. For example dependabot fails if it can't read the root index.html. It needs this index in order to know which packages are stored in the repo.

While #26 is a good start it no longer merges cleanly. The linked repo (natefoo/s3pypi-root-index) hasn't been updated in two years.

@mdwint @rubenvdb I'm happy to work on getting this functionality into the core project. I would appreciate some guidance on how you think we should approach this before kicking off the work. Should the aim be to update @pdonis' and @natefoo's work so it merges cleanly or would you like to see something different?

@mdwint
Copy link
Member

mdwint commented May 18, 2021

The reason #26 did not get merged was that it did not handle concurrent updates.

One concern I have is that the top-level index might get overwritten when trying to upload multiple packages at the same time, since S3 does not provide locking.

Implementing this correctly will require some form of locking. Terraform's S3 backend, for example, solves this using a DynamoDB table. It's unfortunate how this adds a dependency on DynamoDB, since not everyone needs this feature. If you decide to go this route, I suggest making it optional, similar to basic auth in #76 (note that this hasn't been merged yet).

A simpler solution raised in #26 is making it a separate command, without consistency guarantees, as in natefoo/s3pypi-root-index. I would not be against including such a command in s3pypi, provided that the lack of locking is clearly stated in the documentation.

@mdwint
Copy link
Member

mdwint commented May 18, 2021

Come to think of it, adding a DynamoDB table for locking might not be a bad idea in general, since package-level indexes could benefit from it as well. This could be opt-in, similar to how Terraform does it.

@mdwint
Copy link
Member

mdwint commented May 18, 2021

@skwashd If you need a short-term solution, I recommend creating a simple script like natefoo/s3pypi-root-index. We can add a general-purpose solution to s3pypi after #76 is merged, and once the DynamoDB locking idea has been worked out.

@skwashd
Copy link

skwashd commented May 19, 2021

@mdwint thanks for the detailed replies.

While I like DynamoDB, it adds another moving part.

Late last year, Amazon made S3 reads after writes strongly consistent. What do you think about making the locking mechanism configurable? I'm happy to explore adding "NONE" and "S3" as possible values initially, then later we could add support for "DYNAMODB".

What are the blockers for merging #76? Should I be basing my work off that branch?

@mdwint
Copy link
Member

mdwint commented May 19, 2021

Late last year, Amazon made S3 reads after writes strongly consistent.

Strong read-after-write consistency does not solve the "lost update" problem. Multiple instances of s3pypi could be reading, modifying, and writing the same index file at the same time. The final instance to write its results would overwrite the others just written. Any new packages/versions in those other updates would be lost. This would be a rare occurrence, but common enough to trigger the occasional bug report if used on a big enough scale, especially if we add top-level index pages.

What do you think about making the locking mechanism configurable? I'm happy to explore adding "NONE" and "S3" as possible values initially, then later we could add support for "DYNAMODB".

What would "S3" entail? I don't think multiple locking implementations are necessary. A CLI option like --lock-indexes could be used toggle it on or off.

To be clear: I don't expect you to solve locking just yet. I'll create a separate issue, and can have a look at this myself.

What are the blockers for merging #76? Should I be basing my work off that branch?

I can merge it in the next few days. If you want to base your work off something, I recommend the develop branch.

@skwashd
Copy link

skwashd commented May 19, 2021

Sorry I should have been clearer. I understand it takes time to create the index file. I was proposing that we could store the lock file in the S3 bucket. I know there is a small window where more than one process to perform a lock check, not see the file and write one.

That said, the top level index writing is really only problem where multiple processes are adding new packages to the repo. The file is just a list of links to the package directories in the bucket. I don't think the risk of collisions there is very high, except during bulk migrations.

@mdwint
Copy link
Member

mdwint commented May 19, 2021

I was proposing that we could store the lock file in the S3 bucket.

That could work if S3's new consistency model is strong enough, but I'm not sure that is the case. Someone proposed this idea for Terraform as well:

We’ve not yet done any research to see if S3’s new guarantees are sufficient for that model to be safe

@skwashd
Copy link

skwashd commented May 19, 2021

Sounds like I give it a shot. If it isn't good enough we can switch to Dynamo.

@mdwint
Copy link
Member

mdwint commented May 19, 2021

@skwashd FYI, #76 has been merged. I've also created a starting point for this issue in #80. I'm assuming DynamoDB in this draft, but that can be swapped out for other implementations. From what I've read so far, I don't see this coming together using S3 alone, but I may be mistaken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants