Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize project name #421

Closed
wants to merge 30 commits into from
Closed

Sanitize project name #421

wants to merge 30 commits into from

Conversation

pawelmhm
Copy link
Contributor

@pawelmhm pawelmhm commented Nov 29, 2021

fixes #117

This fixes important security problem #117 and also adds unit tests for webservice, increases coverage to 80%.

There is some discussion there in issue about allowing some characters, but notice that python won't allow you to build egg with illegal characters if you call your project name /etc/hosts and try to run python setup.py bdist_egg you'll get an error. This means IMO that there is no valid use case when someone sets project name to something that does not pass our validation. Which leaves us with only invalid cases, e.g. user errors or malicious attempts, for both we should reply with HTTP status 400 Bad Request. Regex here: https://github.com/scrapy/scrapyd/pull/421/files#diff-60f47eb24e330d5afdb3fd80dcee49bc31b957551483a97fa8edd59c222d59cbR27 is same as code in setuptools checking for valid package name https://github.com/pypa/setuptools/blob/d508e3841d3a40aeb602b0a9a883b227a27a3842/setuptools/_distutils/command/install_egg_info.py#L61

Probably it would be also good to make sure EggStorage is not storing invalid names. That's because resources can be configurable, so user can override our resource and put his own resource which is not safe. but I would prefer to do it on other pull request.

Another thing: version is already handled on eggstorage side, but we could also raise invalid request 400 HTTP code on invalid versions, but also would be nice to do on other PR, just from point of view of time management, to have separate issues, separate PR-s for separate things.


project_name = project_name.decode()

allowed_name = re.sub('[^A-Za-z0-9.]+', '-', project_name)
Copy link
Contributor

@jpmckinney jpmckinney Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @Digenis wrote in #117, this will break backwards compatibility. If I have an existing project name that contains @ or _ or - or any number of other legal, secure characters, then I can no longer interact with that project using Scrapyd's API. We can maybe make it "opt-in" behavior, and change the default configuration file to opt-in to this behavior. That way, old deployments that don't have the setting in the conf file will not perform this sanitization - but new deployments will use this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can maybe make it "opt-in" behavior, and change the default configuration file to opt-in to this behavior.

the goal of this is to secure project from doing something like POST file to /etc/hosts or worse deleteversion /etc/hosts. We should not make this opt-in, should be built in. There is always an option to override scrapyd resources, and adding your handler in scrapyd configuration, you can for example add your own addversion.json.

If I have an existing project name that contains @ or _ or - or any number of other legal, secure characters, then I can no longer interact with that project using Scrapyd's API.

yeah you're right, after thinking about this, maybe the approach in this commit is too aggressive, and can break things for some valid use cases. Maybe we could release a milder version in 1.3 and add some deprecation warning saying that in next version, say 2.0. Project names will not accept some characters? Now question which characters to allow in next release, I would say probably all characters that are ok in setuptools.setup(name=) function? So for instance @ won't be allowed, but space should be allowed?

To sum up I'll change this to disallow only insecure characters, allow some legal secure characters. Let me know what you think.

Copy link
Contributor Author

@pawelmhm pawelmhm Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe let's do it like this: let's add project name sanitization in eggstorage in this PR. Fix project filename there, like we fix version. And in web service just print some deprecation warning in logs about invalid names

Copy link
Contributor

@jpmckinney jpmckinney Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was to make it opt-in for existing projects in 1.x, opt-out (i.e. built-in) for new projects in 1.x, and opt-out for all projects in 2.x. This could be done by having a configuration setting whose default value is False in 1.x, but True in 2.x. The default template for the configuration file would set the value to True, such that existing projects (whose configuration file doesn't set a value) would use the default value of False, but new projects would have a configuration file that sets it to True. In short, this progressively changes the default.

That said, if we can relax the validation to avoid breaking changes (your first suggestion), or if we can do something else that progressively changes the behavior (if I understand your second suggestion), that's fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if we can relax the validation to avoid breaking changes (your first suggestion), or if we can do something else that progressively changes the behavior (if I understand your second suggestion), that's fine too.

I would prefer progressive enhancement, but there is no way to completely avoid breaking it for some people here. If someone was storing projects with unusual characters, it is always begging for trouble. People who have projects "some/project" will have problem. But if anyone is actually doing something like this, it's on their own risk.

I think we can trust users to have some knowledge and experience. If someone can install and update ScrapyD he should also be able to read documentation, revert to previous version, remove eggs with bad filenames and deploy new ones using API.

We will also give warning in docs about it.

Copy link
Contributor Author

@pawelmhm pawelmhm Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if we can relax the validation to avoid breaking changes (your first suggestion)

I relaxed validation here 49ab5a7 (added space, _, -, " to allowed characters) but it is not completely avoiding breaking changes. Still, I think there is no way to avoid it. But no one with their right mind should use "@" or some other strange characters in string for project name. If someone uses project name "my project@" he should really be prepared for trouble from the beginning.

My thinking was to make it opt-in for existing projects in 1.x, opt-out (i.e. built-in) for new projects in 1.x, and opt-out for all projects in 2.x. This could be done by having a configuration setting whose default value is False in 1.x, but True in 2.x.

I thought about it again, it is not a bad idea, but it would add complexity to the project and increase development time and future maintenance time. But I thought about something else, how about we just launch 2.0 in 2022 and just announce this and other changes? 2.0 gives some room for breaking changes and people will tolerate them. It won't be a big 2.0 I guess, because nothing big will change, but it would justify making backward incompatible change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.0 instead of 1.3 is fine with me! I'm not too precious about the size of releases :)

@codecov
Copy link

codecov bot commented Dec 3, 2021

Codecov Report

Merging #421 (5980853) into master (2eb5409) will increase coverage by 0.24%.
The diff coverage is 87.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #421      +/-   ##
==========================================
+ Coverage   79.63%   79.88%   +0.24%     
==========================================
  Files          23       23              
  Lines        1051     1069      +18     
==========================================
+ Hits          837      854      +17     
- Misses        214      215       +1     
Flag Coverage Δ
unittests 79.88% <87.50%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
scrapyd/webservice.py 72.18% <86.27%> (+2.43%) ⬆️
scrapyd/app.py 91.37% <100.00%> (ø)
scrapyd/utils.py 88.61% <100.00%> (+0.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2eb5409...5980853. Read the comment docs.

scrapyd/utils.py Outdated Show resolved Hide resolved
@pawelmhm
Copy link
Contributor Author

pawelmhm commented Dec 14, 2021

Tests unrelated to project name problem merged to master here: 2eb5409 this Pull Request will only contain sanitization of project name, to be merged later.

@jpmckinney jpmckinney added this to the 2.0 milestone Feb 2, 2023
Copy link
Contributor

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved to squash and merge when working on 2.0.

@jpmckinney jpmckinney mentioned this pull request Jul 17, 2024
4 tasks
@jpmckinney
Copy link
Contributor

I instead fixed the issue in eggstorage instead of in the API, as described in #117.

I prefer it, since the eggstorage is the source of the vulnerability. This does mean that a well-intentioned user can pick project, spider, version and job IDs that will cause errors – but I think this has never happened.

Changing status codes is backwards-incompatible. Scrapyd (for better or worse) adopts the API style of always returning 200 OK and putting error messages in JSON responses.

If we want to change status codes in future, the patch is:

diff --git a/integration_tests/test_webservice.py b/integration_tests/test_webservice.py
index e3e45d5..cf5c799 100644
--- a/integration_tests/test_webservice.py
+++ b/integration_tests/test_webservice.py
@@ -148,6 +148,7 @@ def test_cancel_nonexistent_project():
         "/cancel.json",
         {"status": "error", "message": "project 'nonexistent' not found"},
         data={"project": "nonexistent", "job": "nojob"},
+        status=404,
     )
 
 
diff --git a/scrapyd/webservice.py b/scrapyd/webservice.py
index 3c449d0..d7ac936 100644
--- a/scrapyd/webservice.py
+++ b/scrapyd/webservice.py
@@ -33,7 +33,7 @@ def param(
         def wrapper(self, txrequest, *args, **kwargs):
             if encoded not in txrequest.args:
                 if required:
-                    raise error.Error(code=http.OK, message=b"'%b' parameter is required" % encoded)
+                    raise error.Error(code=http.BAD_REQUEST, message=b"'%b' parameter is required" % encoded)
 
                 value = default
             else:
@@ -61,7 +61,9 @@ class WsResource(JsonResource):
         try:
             return JsonResource.render(self, txrequest).encode('utf-8')
         except Exception as e:
-            if isinstance(e, error.Error):
+            if isinstance(e, error.UnsupportedMethod):
+                txrequest.setResponseCode(http.NOT_ALLOWED)
+            elif isinstance(e, error.Error):
                 txrequest.setResponseCode(int(e.status))
             if self.root.debug:
                 return traceback.format_exc().encode('utf-8')
@@ -110,7 +112,7 @@ class Schedule(WsResource):
         spider_arguments = {k: v[0] for k, v in native_stringify_dict(copy(txrequest.args), keys_only=False).items()}
         spiders = get_spider_list(project, version=version)
         if spider not in spiders:
-            raise error.Error(code=http.OK, message=b"spider '%b' not found" % spider.encode())
+            raise error.Error(code=http.NOT_FOUND, message=b"spider '%b' not found" % spider.encode())
         self.root.scheduler.schedule(
             project,
             spider,
@@ -132,7 +134,7 @@ class Cancel(WsResource):
     @param('signal', required=False, default='INT' if sys.platform != 'win32' else 'BREAK')
     def render_POST(self, txrequest, project, job, signal):
         if project not in self.root.poller.queues:
-            raise error.Error(code=http.OK, message=b"project '%b' not found" % project.encode())
+            raise error.Error(code=http.NOT_FOUND, message=b"project '%b' not found" % project.encode())
 
         prevstate = None
 
diff --git a/tests/test_endpoints.py b/tests/test_endpoints.py
index 40df630..80aeb78 100644
--- a/tests/test_endpoints.py
+++ b/tests/test_endpoints.py
@@ -62,14 +62,14 @@ class TestEndpoint:
     def test_launch_spider_get(self, mock_scrapyd):
         resp = requests.get(mock_scrapyd.urljoin("schedule.json"))
 
-        assert resp.status_code == 200
+        assert resp.status_code == 405
         assert resp.json()['status'] == 'error'
 
     def test_spider_list_no_project(self, mock_scrapyd):
         resp = requests.get(mock_scrapyd.urljoin("listspiders.json"))
         data = resp.json()
 
-        assert resp.status_code == 200
+        assert resp.status_code == 400
         assert data['status'] == 'error'
         assert data['message'] == "'project' parameter is required"
 

@jpmckinney jpmckinney closed this Jul 18, 2024
@jpmckinney jpmckinney deleted the sanitize-project branch July 18, 2024 15:07
@jpmckinney jpmckinney added the pr: abandoned for unmerged PRs that were abandoned label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr: abandoned for unmerged PRs that were abandoned
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Project name sanitization
2 participants