Skip to content

Commit

Permalink
Merge pull request #15 from webrecorder/main
Browse files Browse the repository at this point in the history
Update from master
  • Loading branch information
omgoo authored May 20, 2021
2 parents db98481 + f07d357 commit 8060ac5
Show file tree
Hide file tree
Showing 23 changed files with 539 additions and 42 deletions.
147 changes: 138 additions & 9 deletions docs/manual/access-control.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,87 @@
.. _access-control:

Access Control System
---------------------
Embargo and Access Control
--------------------------

The access controls system allows for a flexible configuration of rules to allow,
block or exclude access to individual urls by longest-prefix match.
The embargo system allows for date-based rules to block access to captures based on their capture dates.

The access controls system provides additional URL-based rules to allow, block or exclude access to specific URL prefixes or exact URLs.

The embargo and access control rules are configured per collection.

Embargo Settings
================

The embargo system allows restricting access to all URLs within a collection based on the timestamp of each URL.
Access to these resources is 'embargoed' until the date range is adjusted or the time interval passes.

The embargo can be used to disallow access to captures based on following criteria:
- Captures before an exact date
- Captures after an exact date
- Captures newer than a time interval
- Captures older than a time interval

Embargo Before/After Exact Date
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To block access to all captures before or after a specific date, use the ``before`` or ``after`` embargo blocks
with a specific timestamp.

For example, the following blocks access to all URLs captured before 2020-12-26 in the collection ``embargo-before``::

embargo-before:
index_paths: ...
archive_paths: ...
embargo:
before: '20201226'


The following blocks access to all URLs captured on or after 2020-12-26 in collection ``embargo-after``::

embargo-after:
index_paths: ...
archive_paths: ...
embargo:
after: '20201226'

Embargo By Time Interval
^^^^^^^^^^^^^^^^^^^^^^^^

The embargo can also be set for a relative time interval, consisting of years, months, weeks and/or days.


For example, the following blocks access to all URLs newer than 1 year::

embargo-newer:
...
embargo:
newer:
years: 1



The following blocks access to all URLs older than 1 year, 2 months, 3 weeks and 4 days::

embargo-older:
...
embargo:
older:
years: 1
months: 2
weeks: 3
days: 4


Any combination of years, months, weeks and days can be used (as long as at least one is provided) for the ``newer`` or ``older`` embargo settings.


Access Control Settings
=======================

Access Control Files (.aclj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order.
URL-based access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order.
To determine the best match, a binary search is used (similar to CDXJ) lookup and then the best match is found forward.

An .aclj file may look as follows::
Expand All @@ -22,6 +94,8 @@ An .aclj file may look as follows::

Each JSON entry contains an ``access`` field and the original ``url`` field that was used to convert to the SURT (if any).

The JSON entry may also contain a ``user`` field, as explained below.

The prefix consists of a SURT key and a ``-`` (currently reserved for a timestamp/date range field to be added later)

Given these rules, a user would:
Expand All @@ -30,19 +104,55 @@ Given these rules, a user would:
* would receive a 404 not found error when viewing ``http://httpbin.org/anything`` (exclude)


Access Types: allow, block, exclude
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Access Types: allow, block, exclude, allow_ignore_embargo
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The available access types are as follows:

- ``exclude`` - when matched, results are excluded from the index, as if they do not exist. User will receive a 404.
- ``block`` - when matched, results are not excluded from the index, marked with ``access: block``, but access to the actual is blocked. User will see a 451
- ``allow`` - full access to the index and the resource.
- ``allow`` - full access to the index and the resource, but may be overriden by embargo
- ``allow_ignore_embargo`` - full access to the index and resource, overriding any embargo settings

The difference between ``exclude`` and ``block`` is that when blocked, the user can be notified that access is blocked, while
with exclude, no trace of the resource is presented to the user.

The use of ``allow`` is useful to provide access to more specific resources within a broader block/exclude rule.
The use of ``allow`` is useful to provide access to more specific resources within a broader block/exclude rule, while ``allow_ignore_embargo``
can be used to override any embargo settings.

If both are present, the embargo restrictions are checked first and take precedence, unless the ``allow_ignore_embargo`` option is used
to override the embargo.


User-Based Access Controls
^^^^^^^^^^^^^^^^^^^^^^^^^^

The access control rules can further be customized be specifying different permissions for different 'users'. Since pywb does not have a user system,
a special header, ``X-Pywb-ACL-User`` can be used to indicate a specific user.

This setting is designed to allow a more priveleged user to access additional setting or override an embargo.

For example, the following access control settings restricts access to ``https://example.com/restricted/`` by default, but allows access for the ``staff`` user::

com,example)/restricted - {"access": "allow", "user": "staff"}
com,example)/restricted - {"access": "block"}


Combined with the embargo settings, this can also be used to override the embargo for internal organizational users, while keeping the embargo for general access::

com,example)/restricted - {"access": "allow_ignore_embargo", "user": "staff"}
com,example)/restricted - {"access": "allow"}

To make this work, pywb must be running behind an Apache or Nginx system that is configured to set ``X-Pywb-ACL-User: staff`` based on certain settings.

For example, this header may be set based on IP range, or based on password authentication.

Further examples of how to set this header will be provided in the deployments section.

**Note: Do not use the user-based rules without configuring proper authentication on an Apache or Nginx frontend to set or remove this header, otherwise the 'X-Pywb-ACL-User' can easily be faked.**

See the :ref:`config-acl-header` section in Usage for examples on how to configure this header.


Access Error Messages
^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -73,6 +183,11 @@ The URL supplied can be a URL or a SURT prefix. If a SURT is supplied, it is use
wb-manager acl add <collection> com, allow


A specific user for user-based rules can also be specified, for example to add ``allow_ignore_embargo`` for user ``staff`` only, run::

wb-manager acl add <collection> http://httpbin.org/anything/something allow_ignore_embargo staff


By default, access control rules apply to a prefix of a given URL or SURT.

To have the rule apply only to the exact match, use::
Expand Down Expand Up @@ -136,6 +251,20 @@ set merge-sorted to find the best match (very similar to the CDXJ index lookup).
Note: It might make sense to separate ``allows.aclj`` and ``blocks.aclj`` into individual files for organizational reasons,
but there is no specific need to keep more than one access control files.

Finally, ACLJ and embargo settings combined for the same collection might look as follows::

collections:
test:
...
embargo:
newer:
days: 366

acl_paths:
- ./path/to/allows.aclj
- ./path/to/blocks.aclj


Default Access
^^^^^^^^^^^^^^

Expand Down
2 changes: 1 addition & 1 deletion docs/manual/cdxserver_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ the following modifiers:


``fields``
^^^^^^
^^^^^^^^^^

The ``fields`` param can be used to specify which fields to include in the
output. The standard available fields are usually: ``urlkey``,
Expand Down
44 changes: 44 additions & 0 deletions docs/manual/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,50 @@ Then, in your config, simply include:
The configuration assumes uwsgi is started with ``uwsgi uwsgi.ini``


.. _config-acl-header:

Configuring Access Control Header
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :ref:`access-control` system allows users to be granted different access settings based on the value of an ACL header, ``X-pywb-ACL-user``.

The header can be set via Nginx or Apache to grant custom access priviliges based on IP address, password, or other combination of rules.

For example, to set the value of the header to ``staff`` if the IP of the request is from designated local IP ranges (127.0.0.1, 192.168.1.0/24), the following settings can be added to the configs:

For Nginx::

geo $acl_user {
# ensure user is set to empty by default
default "";

# optional: add IP ranges to allow privileged access
127.0.0.1 "staff";
192.168.0.0/24 "staff";
}

...
location /wayback/ {
...
uwsgi_param HTTP_X_PYWB_ACL_USER $acl_user;
}


For Apache::

<If "-R '192.168.1.0/24' || -R '127.0.0.1'">
RequestHeader set X-Pywb-ACL-User staff
</If>
# ensure header is cleared if no match
<Else>
RequestHeader set X-Pywb-ACL-User ""
</Else>

}




Running on Subdirectory Path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 2 additions & 0 deletions pywb/apps/rewriterapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,6 +704,8 @@ def _do_req(self, inputreq, wb_url, kwargs, skip_record):
headers = {'Content-Length': str(len(req_data)),
'Content-Type': 'application/request'}

headers.update(inputreq.warcserver_headers)

if skip_record:
headers['Recorder-Skip'] = '1'

Expand Down
23 changes: 14 additions & 9 deletions pywb/manager/aclmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
class ACLManager(CollectionsManager):
SURT_RX = re.compile('([^:.]+[,)])+')

VALID_ACCESS = ('allow', 'block', 'exclude')
VALID_ACCESS = ('allow', 'block', 'exclude', 'allow_ignore_embargo')

DEFAULT_FILE = 'access-rules.aclj'

Expand Down Expand Up @@ -167,9 +167,9 @@ def add_rule(self, r):
:param argparse.Namespace r: The argparse namespace representing the rule to be added
:rtype: None
"""
return self._add_rule(r.url, r.access, r.exact_match)
return self._add_rule(r.url, r.access, r.exact_match, r.user)

def _add_rule(self, url, access, exact_match=False):
def _add_rule(self, url, access, exact_match=False, user=None):
"""Adds an rule to the acl file
:param str url: The URL for the rule
Expand All @@ -185,12 +185,14 @@ def _add_rule(self, url, access, exact_match=False):
acl['timestamp'] = '-'
acl['access'] = access
acl['url'] = url
if user:
acl['user'] = user

i = 0
replace = False

for rule in self.rules:
if acl['urlkey'] == rule['urlkey'] and acl['timestamp'] == rule['timestamp']:
if acl['urlkey'] == rule['urlkey'] and acl['timestamp'] == rule['timestamp'] and acl.get('user') == rule.get('user'):
replace = True
break

Expand Down Expand Up @@ -255,7 +257,7 @@ def remove_rule(self, r):
i = 0
urlkey = self.to_key(r.url, r.exact_match)
for rule in self.rules:
if urlkey == rule['urlkey']:
if urlkey == rule['urlkey'] and r.user == rule.get('user'):
acl = self.rules.pop(i)
print('Removed Rule:')
self.print_rule(acl)
Expand Down Expand Up @@ -285,7 +287,7 @@ def find_match(self, r):
:rtype: None
"""
access_checker = AccessChecker(self.acl_file, '<default>')
rule = access_checker.find_access_rule(r.url)
rule = access_checker.find_access_rule(r.url, acl_user=r.user)

print('Matched rule:')
print('')
Expand Down Expand Up @@ -344,15 +346,18 @@ def command(name, *args, **kwargs):
else:
op.add_argument(arg)

if kwargs.get('user_opt'):
op.add_argument('-u', '--user')

if kwargs.get('exact_opt'):
op.add_argument('-e', '--exact-match', action='store_true', default=False)

op.set_defaults(acl_func=kwargs['func'])

command('add', 'coll_name', 'url', 'access', func=cls.add_rule, exact_opt=True)
command('remove', 'coll_name', 'url', func=cls.remove_rule, exact_opt=True)
command('add', 'coll_name', 'url', 'access', func=cls.add_rule, exact_opt=True, user_opt=True)
command('remove', 'coll_name', 'url', func=cls.remove_rule, exact_opt=True, user_opt=True)
command('list', 'coll_name', func=cls.list_rules)
command('validate', 'coll_name', func=cls.validate_save)
command('match', 'coll_name', 'url', 'default_access', func=cls.find_match)
command('match', 'coll_name', 'url', 'default_access', func=cls.find_match, user_opt=True)
command('importtxt', 'coll_name', 'filename', 'access', func=cls.add_excludes)

6 changes: 6 additions & 0 deletions pywb/rewrite/rewriteinputreq.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def __init__(self, env, urlkey, url, rewriter):
self.url = url
self.rewriter = rewriter
self.extra_cookie = None
self.warcserver_headers = {}

is_proxy = ('wsgiprox.proxy_host' in env)

Expand Down Expand Up @@ -82,6 +83,11 @@ def get_req_headers(self):
elif name in ('HTTP_IF_MODIFIED_SINCE', 'HTTP_IF_UNMODIFIED_SINCE'):
continue

elif name == 'HTTP_X_PYWB_ACL_USER':
name = name[5:].title().replace('_', '-')
self.warcserver_headers[name] = value
continue

elif name == 'HTTP_X_FORWARDED_PROTO':
name = 'X-Forwarded-Proto'
if self.splits:
Expand Down
2 changes: 1 addition & 1 deletion pywb/static/wombat.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion pywb/version.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = '2.6.0.dev0'
__version__ = '2.6.0b0'

if __name__ == '__main__':
print(__version__)
Loading

0 comments on commit 8060ac5

Please sign in to comment.