Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access Control (Exclusion) System #7

Closed
ikreymer opened this issue Feb 12, 2018 · 9 comments
Closed

Access Control (Exclusion) System #7

ikreymer opened this issue Feb 12, 2018 · 9 comments

Comments

@ikreymer
Copy link
Contributor

Support a per-url exclusion system for pywb, including the following modes:

  • A SURT-prefix- based white list of allowed content, and return a HTTP 451
    message if content is outside the white list.
  • A SURT-based blacklist of disallowed content that should return a HTTP 404.
@ikreymer
Copy link
Contributor Author

ikreymer commented Feb 12, 2018

have a few questions regarding the exclusion system requirement (for @anjackson):

  • Are the two modes used separately in different deployments, or within same deployment for different collections?
  • Should the exclusion system also support date ranges?
  • Is there any relation between the exclusion system and the concurrent-lock system (from Support Single-Lock Access for Legal Deposit Restrictions #6), eg. items that publicly excluded are internally lockable, or completely separate systems enabled independently as needed?

@anjackson
Copy link
Contributor

  • They are used within the same deployment.
  • That would be a 'nice to have' - at the moment we just generate a SURT whitelist and blacklist, and that is sufficient.

The last question deserves a bit more detail. We have three deployments:

Deployment Blacklist Whitelist Single-concurrent Use
NPLD Reading Room Y No (everything is whitelisted except the blacklist) Y
Open Access Y Yes (except the blacklist) N
QA Access N No (everything is whitelisted) N

I imagine we will need three different pywb collections, in three different deployments.

Important: We generally expect to run all such services behind NGINX or Apache proxies. In the case of QA Wayback, it's actually proxied through a 'parent' app that performs authentication/authorisation. I've assumed this won't be a problem, but I shouldn't take that for granted.

@ikreymer
Copy link
Contributor Author

Thanks for the clarifications, seems like best option is to configure these as 3 collections for integration testing using the above options:

  • /reading-room
  • /open-access
  • /qa-access

To clarify, if the allow/disallow lists access checks work in this order?

  • if disallow list is present, and SURT prefix in disallow list, return 404
  • if allow list is present, and SURT prefix not in allow list, return 451
  • return content

Does disallow always take precedence?

Or, is most specific match take precedence?
Eg. if com,example)/ is disallowed, but com,example)/allowed is allowed, com,example)/allowed/test.html would be allowed?

@ikreymer
Copy link
Contributor Author

For maximum flexibility, considering the following approach:

Using CDXJ like format for ACL rules, with SURT key, but reverse sorted to facilitate longest prefix matching.

Each rule would have one of the following settings access settings:

  • allow - CDX entry allowed
  • block - CDX entry included, but content not allowed (served with a 451)
  • exclude - CDX entry not included, as if it didn't exist (served with a 404)

An example rule set:

com,example)/path/allowed - {"access": "allow"}
com,example)/path - {"access": "exclude"}
com,example)/ - {"access": "block"}
com - {"access": "allow"}

ACL rules can be stored in .aclj or .adxj (a for access) files and can be merge sorted on lookup (like cdx).

There will also be a default rule, which can be configured to block to serve 451 if no other SURT matches.

@ikreymer
Copy link
Contributor Author

ikreymer commented Feb 17, 2018

The command-line wb-manager tool can also be extended to support management of ACL rules, to avoid having to manipulate the .acl files manually.

Add a rule:
wb-manager acl add <coll> <url_or_surt>

Remove rule:
wb-manager acl remove <coll> <url_or_surt>

Return matching rule:
wb-manager acl match <coll> <url_or_surt>

Import OpenWayback-style non-surt prefix based rules (eg. excludes.txt):
wb-manager acl importexcludes <coll> <path/to/excludes.txt>

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 18, 2018
…ywb#7)

- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
ikreymer added a commit to ukwa/pywb that referenced this issue Feb 18, 2018
…ywb#7)

- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
@ikreymer ikreymer changed the title Exclusion System Access Control (Exclusion) System Feb 18, 2018
ikreymer added a commit to ukwa/pywb that referenced this issue Feb 18, 2018
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
ikreymer added a commit that referenced this issue Feb 18, 2018
…lections as specified in #7

add access.robot for testing exclude, block rules (blacklist) and allow rules and default block (whitelist)
- reading-rooms has single-use-lock, blacklist
- open-access has whitelist and blacklist
- qa-access has no access controls
test-data: add httpbin.org warc/cdx for access system tests
robot script improvements: use shared init & teardown scripts, parametrize collection name, add reusable
check exclude, check blocked, check allowed functions
@anjackson
Copy link
Contributor

Thanks @ikreymer this looks good.

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 21, 2018
… files via command-line (ukwa/ukwa-pywb#7)

- support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj
or specifying an .aclj explicitly for more custom configs
- support adding urls and surts, determine if url is already a surt, otherwise canonicalize
acl commands include:
- acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access>
- acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target
- acl list <target_file_or_coll> -- list all rules for target
- acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save
- acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule
- acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target
ikreymer added a commit that referenced this issue Feb 21, 2018
acl: update acls with cli tool, move block to correct file, include original url in json portion (#7)
@ikreymer
Copy link
Contributor Author

Added initial support for CLI command for operating on individual .aclj files.

For example, to list all rules in a file:
wb-manager acl list ./integration-test/pywb/acl/blocks.aclj

To add a new rule:
wb-manager acl list ./integration-test/pywb/acl/blocks.aclj http://test.example.com/ block

Also supports adding SURTS as well as url:
wb-manager acl add ./integration-test/pywb/acl/allows.aclj uk,org, allow

and removing surts/urls:
wb-manager acl list ./integration-test/pywb/acl/allows.aclj uk,org, remove

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 21, 2018
…ywb#7)

- add, importtxt will create an access file if it doesn't exist
- return status code 1 on errors, including if file doesn't exist (for other commands)
@ikreymer
Copy link
Contributor Author

And also, example . command for importing OpenWayback-style exclusions (one url per line):

wb-manager acl importtxt ./integration-test/pywb/acl/blocks.aclj excludes.txt exclude

@ibnesayeed
Copy link

We are working towards a more flexible and efficient approach for archive profiling that contains aspects of ACLs as well. We were thinking along a similar CDXJ-style format, but more flexible than what is illustrated above. We have included the idea of wildcards in partial SURT keys to identify prefixed matches from exact matches. This enables us to more easily describe scenarios like having one rule for a domain (or path at certain depth), but other rules for other resources under that path.

N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
…ywb#7)

- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
… files via command-line (ukwa/ukwa-pywb#7)

- support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj
or specifying an .aclj explicitly for more custom configs
- support adding urls and surts, determine if url is already a surt, otherwise canonicalize
acl commands include:
- acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access>
- acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target
- acl list <target_file_or_coll> -- list all rules for target
- acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save
- acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule
- acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target
N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
…ywb#7)

- add, importtxt will create an access file if it doesn't exist
- return status code 1 on errors, including if file doesn't exist (for other commands)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants