-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize project name #421
Conversation
using mock Twisted Request
don't allow unsafe characters in project name
scrapyd/webservice.py
Outdated
|
||
project_name = project_name.decode() | ||
|
||
allowed_name = re.sub('[^A-Za-z0-9.]+', '-', project_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @Digenis wrote in #117, this will break backwards compatibility. If I have an existing project name that contains @
or _
or -
or any number of other legal, secure characters, then I can no longer interact with that project using Scrapyd's API. We can maybe make it "opt-in" behavior, and change the default configuration file to opt-in to this behavior. That way, old deployments that don't have the setting in the conf file will not perform this sanitization - but new deployments will use this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can maybe make it "opt-in" behavior, and change the default configuration file to opt-in to this behavior.
the goal of this is to secure project from doing something like POST file to /etc/hosts or worse deleteversion /etc/hosts. We should not make this opt-in, should be built in. There is always an option to override scrapyd resources, and adding your handler in scrapyd configuration, you can for example add your own addversion.json.
If I have an existing project name that contains @ or _ or - or any number of other legal, secure characters, then I can no longer interact with that project using Scrapyd's API.
yeah you're right, after thinking about this, maybe the approach in this commit is too aggressive, and can break things for some valid use cases. Maybe we could release a milder version in 1.3 and add some deprecation warning saying that in next version, say 2.0. Project names will not accept some characters? Now question which characters to allow in next release, I would say probably all characters that are ok in setuptools.setup(name=) function? So for instance @
won't be allowed, but space should be allowed?
To sum up I'll change this to disallow only insecure characters, allow some legal secure characters. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe let's do it like this: let's add project name sanitization in eggstorage in this PR. Fix project filename there, like we fix version. And in web service just print some deprecation warning in logs about invalid names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was to make it opt-in for existing projects in 1.x, opt-out (i.e. built-in) for new projects in 1.x, and opt-out for all projects in 2.x. This could be done by having a configuration setting whose default value is False in 1.x, but True in 2.x. The default template for the configuration file would set the value to True, such that existing projects (whose configuration file doesn't set a value) would use the default value of False, but new projects would have a configuration file that sets it to True. In short, this progressively changes the default.
That said, if we can relax the validation to avoid breaking changes (your first suggestion), or if we can do something else that progressively changes the behavior (if I understand your second suggestion), that's fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if we can relax the validation to avoid breaking changes (your first suggestion), or if we can do something else that progressively changes the behavior (if I understand your second suggestion), that's fine too.
I would prefer progressive enhancement, but there is no way to completely avoid breaking it for some people here. If someone was storing projects with unusual characters, it is always begging for trouble. People who have projects "some/project" will have problem. But if anyone is actually doing something like this, it's on their own risk.
I think we can trust users to have some knowledge and experience. If someone can install and update ScrapyD he should also be able to read documentation, revert to previous version, remove eggs with bad filenames and deploy new ones using API.
We will also give warning in docs about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if we can relax the validation to avoid breaking changes (your first suggestion)
I relaxed validation here 49ab5a7 (added space, _, -, " to allowed characters) but it is not completely avoiding breaking changes. Still, I think there is no way to avoid it. But no one with their right mind should use "@" or some other strange characters in string for project name. If someone uses project name "my project@" he should really be prepared for trouble from the beginning.
My thinking was to make it opt-in for existing projects in 1.x, opt-out (i.e. built-in) for new projects in 1.x, and opt-out for all projects in 2.x. This could be done by having a configuration setting whose default value is False in 1.x, but True in 2.x.
I thought about it again, it is not a bad idea, but it would add complexity to the project and increase development time and future maintenance time. But I thought about something else, how about we just launch 2.0 in 2022 and just announce this and other changes? 2.0 gives some room for breaking changes and people will tolerate them. It won't be a big 2.0 I guess, because nothing big will change, but it would justify making backward incompatible change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.0 instead of 1.3 is fine with me! I'm not too precious about the size of releases :)
efb6af1
to
49ab5a7
Compare
Codecov Report
@@ Coverage Diff @@
## master #421 +/- ##
==========================================
+ Coverage 79.63% 79.88% +0.24%
==========================================
Files 23 23
Lines 1051 1069 +18
==========================================
+ Hits 837 854 +17
- Misses 214 215 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Tests unrelated to project name problem merged to master here: 2eb5409 this Pull Request will only contain sanitization of project name, to be merged later. |
…ct is not required for listjobs.json.
2f3e529
to
8431cd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved to squash and merge when working on 2.0.
I instead fixed the issue in eggstorage instead of in the API, as described in #117. I prefer it, since the eggstorage is the source of the vulnerability. This does mean that a well-intentioned user can pick project, spider, version and job IDs that will cause errors – but I think this has never happened. Changing status codes is backwards-incompatible. Scrapyd (for better or worse) adopts the API style of always returning 200 OK and putting error messages in JSON responses. If we want to change status codes in future, the patch is: diff --git a/integration_tests/test_webservice.py b/integration_tests/test_webservice.py
index e3e45d5..cf5c799 100644
--- a/integration_tests/test_webservice.py
+++ b/integration_tests/test_webservice.py
@@ -148,6 +148,7 @@ def test_cancel_nonexistent_project():
"/cancel.json",
{"status": "error", "message": "project 'nonexistent' not found"},
data={"project": "nonexistent", "job": "nojob"},
+ status=404,
)
diff --git a/scrapyd/webservice.py b/scrapyd/webservice.py
index 3c449d0..d7ac936 100644
--- a/scrapyd/webservice.py
+++ b/scrapyd/webservice.py
@@ -33,7 +33,7 @@ def param(
def wrapper(self, txrequest, *args, **kwargs):
if encoded not in txrequest.args:
if required:
- raise error.Error(code=http.OK, message=b"'%b' parameter is required" % encoded)
+ raise error.Error(code=http.BAD_REQUEST, message=b"'%b' parameter is required" % encoded)
value = default
else:
@@ -61,7 +61,9 @@ class WsResource(JsonResource):
try:
return JsonResource.render(self, txrequest).encode('utf-8')
except Exception as e:
- if isinstance(e, error.Error):
+ if isinstance(e, error.UnsupportedMethod):
+ txrequest.setResponseCode(http.NOT_ALLOWED)
+ elif isinstance(e, error.Error):
txrequest.setResponseCode(int(e.status))
if self.root.debug:
return traceback.format_exc().encode('utf-8')
@@ -110,7 +112,7 @@ class Schedule(WsResource):
spider_arguments = {k: v[0] for k, v in native_stringify_dict(copy(txrequest.args), keys_only=False).items()}
spiders = get_spider_list(project, version=version)
if spider not in spiders:
- raise error.Error(code=http.OK, message=b"spider '%b' not found" % spider.encode())
+ raise error.Error(code=http.NOT_FOUND, message=b"spider '%b' not found" % spider.encode())
self.root.scheduler.schedule(
project,
spider,
@@ -132,7 +134,7 @@ class Cancel(WsResource):
@param('signal', required=False, default='INT' if sys.platform != 'win32' else 'BREAK')
def render_POST(self, txrequest, project, job, signal):
if project not in self.root.poller.queues:
- raise error.Error(code=http.OK, message=b"project '%b' not found" % project.encode())
+ raise error.Error(code=http.NOT_FOUND, message=b"project '%b' not found" % project.encode())
prevstate = None
diff --git a/tests/test_endpoints.py b/tests/test_endpoints.py
index 40df630..80aeb78 100644
--- a/tests/test_endpoints.py
+++ b/tests/test_endpoints.py
@@ -62,14 +62,14 @@ class TestEndpoint:
def test_launch_spider_get(self, mock_scrapyd):
resp = requests.get(mock_scrapyd.urljoin("schedule.json"))
- assert resp.status_code == 200
+ assert resp.status_code == 405
assert resp.json()['status'] == 'error'
def test_spider_list_no_project(self, mock_scrapyd):
resp = requests.get(mock_scrapyd.urljoin("listspiders.json"))
data = resp.json()
- assert resp.status_code == 200
+ assert resp.status_code == 400
assert data['status'] == 'error'
assert data['message'] == "'project' parameter is required"
|
fixes #117
This fixes important security problem #117 and also adds unit tests for webservice, increases coverage to 80%.
There is some discussion there in issue about allowing some characters, but notice that python won't allow you to build egg with illegal characters if you call your project name /etc/hosts and try to run python setup.py bdist_egg you'll get an error. This means IMO that there is no valid use case when someone sets project name to something that does not pass our validation. Which leaves us with only invalid cases, e.g. user errors or malicious attempts, for both we should reply with HTTP status 400 Bad Request. Regex here: https://github.com/scrapy/scrapyd/pull/421/files#diff-60f47eb24e330d5afdb3fd80dcee49bc31b957551483a97fa8edd59c222d59cbR27 is same as code in setuptools checking for valid package name https://github.com/pypa/setuptools/blob/d508e3841d3a40aeb602b0a9a883b227a27a3842/setuptools/_distutils/command/install_egg_info.py#L61
Probably it would be also good to make sure EggStorage is not storing invalid names. That's because resources can be configurable, so user can override our resource and put his own resource which is not safe. but I would prefer to do it on other pull request.
Another thing: version is already handled on eggstorage side, but we could also raise invalid request 400 HTTP code on invalid versions, but also would be nice to do on other PR, just from point of view of time management, to have separate issues, separate PR-s for separate things.