Sanitize project name #421

pawelmhm · 2021-11-29T11:38:02Z

fixes #117

This fixes important security problem #117 and also adds unit tests for webservice, increases coverage to 80%.

There is some discussion there in issue about allowing some characters, but notice that python won't allow you to build egg with illegal characters if you call your project name /etc/hosts and try to run python setup.py bdist_egg you'll get an error. This means IMO that there is no valid use case when someone sets project name to something that does not pass our validation. Which leaves us with only invalid cases, e.g. user errors or malicious attempts, for both we should reply with HTTP status 400 Bad Request. Regex here: https://github.com/scrapy/scrapyd/pull/421/files#diff-60f47eb24e330d5afdb3fd80dcee49bc31b957551483a97fa8edd59c222d59cbR27 is same as code in setuptools checking for valid package name https://github.com/pypa/setuptools/blob/d508e3841d3a40aeb602b0a9a883b227a27a3842/setuptools/_distutils/command/install_egg_info.py#L61

Probably it would be also good to make sure EggStorage is not storing invalid names. That's because resources can be configurable, so user can override our resource and put his own resource which is not safe. but I would prefer to do it on other pull request.

Another thing: version is already handled on eggstorage side, but we could also raise invalid request 400 HTTP code on invalid versions, but also would be nice to do on other PR, just from point of view of time management, to have separate issues, separate PR-s for separate things.

using mock Twisted Request

don't allow unsafe characters in project name

jpmckinney · 2021-11-29T15:41:53Z

scrapyd/webservice.py

+
+        project_name = project_name.decode()
+
+        allowed_name = re.sub('[^A-Za-z0-9.]+', '-', project_name)


As @Digenis wrote in #117, this will break backwards compatibility. If I have an existing project name that contains @ or _ or - or any number of other legal, secure characters, then I can no longer interact with that project using Scrapyd's API. We can maybe make it "opt-in" behavior, and change the default configuration file to opt-in to this behavior. That way, old deployments that don't have the setting in the conf file will not perform this sanitization - but new deployments will use this behavior.

we can maybe make it "opt-in" behavior, and change the default configuration file to opt-in to this behavior.

the goal of this is to secure project from doing something like POST file to /etc/hosts or worse deleteversion /etc/hosts. We should not make this opt-in, should be built in. There is always an option to override scrapyd resources, and adding your handler in scrapyd configuration, you can for example add your own addversion.json.

If I have an existing project name that contains @ or _ or - or any number of other legal, secure characters, then I can no longer interact with that project using Scrapyd's API.

yeah you're right, after thinking about this, maybe the approach in this commit is too aggressive, and can break things for some valid use cases. Maybe we could release a milder version in 1.3 and add some deprecation warning saying that in next version, say 2.0. Project names will not accept some characters? Now question which characters to allow in next release, I would say probably all characters that are ok in setuptools.setup(name=) function? So for instance @ won't be allowed, but space should be allowed?

To sum up I'll change this to disallow only insecure characters, allow some legal secure characters. Let me know what you think.

Or maybe let's do it like this: let's add project name sanitization in eggstorage in this PR. Fix project filename there, like we fix version. And in web service just print some deprecation warning in logs about invalid names

My thinking was to make it opt-in for existing projects in 1.x, opt-out (i.e. built-in) for new projects in 1.x, and opt-out for all projects in 2.x. This could be done by having a configuration setting whose default value is False in 1.x, but True in 2.x. The default template for the configuration file would set the value to True, such that existing projects (whose configuration file doesn't set a value) would use the default value of False, but new projects would have a configuration file that sets it to True. In short, this progressively changes the default.

That said, if we can relax the validation to avoid breaking changes (your first suggestion), or if we can do something else that progressively changes the behavior (if I understand your second suggestion), that's fine too.

That said, if we can relax the validation to avoid breaking changes (your first suggestion), or if we can do something else that progressively changes the behavior (if I understand your second suggestion), that's fine too.

I would prefer progressive enhancement, but there is no way to completely avoid breaking it for some people here. If someone was storing projects with unusual characters, it is always begging for trouble. People who have projects "some/project" will have problem. But if anyone is actually doing something like this, it's on their own risk.

I think we can trust users to have some knowledge and experience. If someone can install and update ScrapyD he should also be able to read documentation, revert to previous version, remove eggs with bad filenames and deploy new ones using API.

We will also give warning in docs about it.

That said, if we can relax the validation to avoid breaking changes (your first suggestion)

I relaxed validation here 49ab5a7 (added space, _, -, " to allowed characters) but it is not completely avoiding breaking changes. Still, I think there is no way to avoid it. But no one with their right mind should use "@" or some other strange characters in string for project name. If someone uses project name "my project@" he should really be prepared for trouble from the beginning.

My thinking was to make it opt-in for existing projects in 1.x, opt-out (i.e. built-in) for new projects in 1.x, and opt-out for all projects in 2.x. This could be done by having a configuration setting whose default value is False in 1.x, but True in 2.x.

I thought about it again, it is not a bad idea, but it would add complexity to the project and increase development time and future maintenance time. But I thought about something else, how about we just launch 2.0 in 2022 and just announce this and other changes? 2.0 gives some room for breaking changes and people will tolerate them. It won't be a big 2.0 I guess, because nothing big will change, but it would justify making backward incompatible change here.

2.0 instead of 1.3 is fine with me! I'm not too precious about the size of releases :)

codecov · 2021-12-03T09:27:29Z

Codecov Report

Merging #421 (5980853) into master (2eb5409) will increase coverage by 0.24%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master     #421      +/-   ##
==========================================
+ Coverage   79.63%   79.88%   +0.24%     
==========================================
  Files          23       23              
  Lines        1051     1069      +18     
==========================================
+ Hits          837      854      +17     
- Misses        214      215       +1

Flag	Coverage Δ
unittests	`79.88% <87.50%> (+0.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
scrapyd/webservice.py	`72.18% <86.27%> (+2.43%)`	⬆️
scrapyd/app.py	`91.37% <100.00%> (ø)`
scrapyd/utils.py	`88.61% <100.00%> (+0.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2eb5409...5980853. Read the comment docs.

scrapyd/utils.py

pawelmhm · 2021-12-14T17:11:40Z

Tests unrelated to project name problem merged to master here: 2eb5409 this Pull Request will only contain sanitization of project name, to be merged later.

…ct is not required for listjobs.json.

…message)

jpmckinney

Approved to squash and merge when working on 2.0.

jpmckinney · 2024-07-18T15:07:07Z

I instead fixed the issue in eggstorage instead of in the API, as described in #117.

I prefer it, since the eggstorage is the source of the vulnerability. This does mean that a well-intentioned user can pick project, spider, version and job IDs that will cause errors – but I think this has never happened.

Changing status codes is backwards-incompatible. Scrapyd (for better or worse) adopts the API style of always returning 200 OK and putting error messages in JSON responses.

If we want to change status codes in future, the patch is:

diff --git a/integration_tests/test_webservice.py b/integration_tests/test_webservice.py
index e3e45d5..cf5c799 100644
--- a/integration_tests/test_webservice.py
+++ b/integration_tests/test_webservice.py
@@ -148,6 +148,7 @@ def test_cancel_nonexistent_project():
         "/cancel.json",
         {"status": "error", "message": "project 'nonexistent' not found"},
         data={"project": "nonexistent", "job": "nojob"},
+        status=404,
     )
 
 
diff --git a/scrapyd/webservice.py b/scrapyd/webservice.py
index 3c449d0..d7ac936 100644
--- a/scrapyd/webservice.py
+++ b/scrapyd/webservice.py
@@ -33,7 +33,7 @@ def param(
         def wrapper(self, txrequest, *args, **kwargs):
             if encoded not in txrequest.args:
                 if required:
-                    raise error.Error(code=http.OK, message=b"'%b' parameter is required" % encoded)
+                    raise error.Error(code=http.BAD_REQUEST, message=b"'%b' parameter is required" % encoded)
 
                 value = default
             else:
@@ -61,7 +61,9 @@ class WsResource(JsonResource):
         try:
             return JsonResource.render(self, txrequest).encode('utf-8')
         except Exception as e:
-            if isinstance(e, error.Error):
+            if isinstance(e, error.UnsupportedMethod):
+                txrequest.setResponseCode(http.NOT_ALLOWED)
+            elif isinstance(e, error.Error):
                 txrequest.setResponseCode(int(e.status))
             if self.root.debug:
                 return traceback.format_exc().encode('utf-8')
@@ -110,7 +112,7 @@ class Schedule(WsResource):
         spider_arguments = {k: v[0] for k, v in native_stringify_dict(copy(txrequest.args), keys_only=False).items()}
         spiders = get_spider_list(project, version=version)
         if spider not in spiders:
-            raise error.Error(code=http.OK, message=b"spider '%b' not found" % spider.encode())
+            raise error.Error(code=http.NOT_FOUND, message=b"spider '%b' not found" % spider.encode())
         self.root.scheduler.schedule(
             project,
             spider,
@@ -132,7 +134,7 @@ class Cancel(WsResource):
     @param('signal', required=False, default='INT' if sys.platform != 'win32' else 'BREAK')
     def render_POST(self, txrequest, project, job, signal):
         if project not in self.root.poller.queues:
-            raise error.Error(code=http.OK, message=b"project '%b' not found" % project.encode())
+            raise error.Error(code=http.NOT_FOUND, message=b"project '%b' not found" % project.encode())
 
         prevstate = None
 
diff --git a/tests/test_endpoints.py b/tests/test_endpoints.py
index 40df630..80aeb78 100644
--- a/tests/test_endpoints.py
+++ b/tests/test_endpoints.py
@@ -62,14 +62,14 @@ class TestEndpoint:
     def test_launch_spider_get(self, mock_scrapyd):
         resp = requests.get(mock_scrapyd.urljoin("schedule.json"))
 
-        assert resp.status_code == 200
+        assert resp.status_code == 405
         assert resp.json()['status'] == 'error'
 
     def test_spider_list_no_project(self, mock_scrapyd):
         resp = requests.get(mock_scrapyd.urljoin("listspiders.json"))
         data = resp.json()
 
-        assert resp.status_code == 200
+        assert resp.status_code == 400
         assert data['status'] == 'error'
         assert data['message'] == "'project' parameter is required"

pawelmhm added 6 commits November 25, 2021 15:33

[tests] unit tests for webservice

546e3dc

using mock Twisted Request

[webservice] safe project name

a13fe9b

[webservice] sanitize project name

8e8e410

don't allow unsafe characters in project name

[webservice] sanitize project name in listproject

bdd1f6a

[webservice] validate project name in schedule.json

0f05b32

[test_webservice] mock list spiders

767fac5

pawelmhm assigned jpmckinney Nov 29, 2021

jpmckinney reviewed Nov 29, 2021

View reviewed changes

webservice: extend allowed character list

49ab5a7

pawelmhm force-pushed the sanitize-project branch from efb6af1 to 49ab5a7 Compare December 3, 2021 09:25

jpmckinney reviewed Dec 3, 2021

View reviewed changes

scrapyd/utils.py Outdated Show resolved Hide resolved

jpmckinney approved these changes Dec 3, 2021

View reviewed changes

pawelmhm mentioned this pull request Dec 10, 2021

[runner.py] don't call get_application #411

Merged

2 tasks

Merge branch 'master' into sanitize-project

5980853

jpmckinney added this to the 2.0 milestone Feb 2, 2023

jpmckinney added 3 commits July 16, 2024 23:34

Merge branch 'master' into sanitize-project

0f45e31

fix(sanitize-project): listprojects.json accepts no parameters. proje…

e86d261

…ct is not required for listjobs.json.

chore: Restore previous style

8431cd0

jpmckinney force-pushed the sanitize-project branch from 2f3e529 to 8431cd0 Compare July 17, 2024 03:47

jpmckinney added 3 commits July 17, 2024 00:53

feat: Clarify error message (no "object" method in UnsupportedMethod …

3720c09

…message)

chore: Refactor project sanitization

a21ce02

feat: Use 405 Method Not Allowed

a89fe27

This was referenced Jul 17, 2024

Check project names in egg storage #516

Closed

Check valid and existing versions in webservices #517

Closed

jpmckinney added 4 commits July 17, 2024 10:35

Merge branch 'master' into sanitize-project

7a5582b

Merge branch 'master' into sanitize-project

7624524

Merge branch 'master' into sanitize-project

fd1c8e7

chore: Refactor webservice.py for easier modification (bis)

ce76bf9

jpmckinney added 4 commits July 17, 2024 17:21

Merge branch 'master' into sanitize-project

67656fe

Merge branch 'master' into sanitize-project

b7dafac

test: Add webservices to project parameter validation test

3f4c1d2

test: Fix test_endpoints expectations

dd2a39c

jpmckinney approved these changes Jul 17, 2024

View reviewed changes

jpmckinney mentioned this pull request Jul 17, 2024

Review old security audit #518

Closed

4 tasks

jpmckinney added 8 commits July 18, 2024 02:59

Merge branch 'master' into sanitize-project

8df8b60

chore: Change DirectoryTraversalError message to not disclose full paths

04bed89

chore: Use generator

7380428

test: Add tests to tests/test_webservice.py for directory traversal

b7d055a

Fix "Merge branch 'master' into sanitize-project"

d8716c7

Fix "Merge branch 'master' into sanitize-project"

287742d

Merge branch 'master' into sanitize-project

69dcf53

Fix "Merge branch 'master' into sanitize-project"

72cdf5b

jpmckinney closed this Jul 18, 2024

jpmckinney deleted the sanitize-project branch July 18, 2024 15:07

jpmckinney added the pr: abandoned for unmerged PRs that were abandoned label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize project name #421

Sanitize project name #421

pawelmhm commented Nov 29, 2021 •

edited by jpmckinney

Loading

jpmckinney Nov 29, 2021 •

edited

Loading

pawelmhm Nov 30, 2021

pawelmhm Nov 30, 2021 •

edited

Loading

jpmckinney Nov 30, 2021 •

edited

Loading

pawelmhm Dec 1, 2021

pawelmhm Dec 3, 2021 •

edited

Loading

jpmckinney Dec 3, 2021

codecov bot commented Dec 3, 2021 •

edited

Loading

pawelmhm commented Dec 14, 2021 •

edited

Loading

jpmckinney left a comment

jpmckinney commented Jul 18, 2024


		project_name = project_name.decode()

		allowed_name = re.sub('[^A-Za-z0-9.]+', '-', project_name)

Sanitize project name #421

Sanitize project name #421

Conversation

pawelmhm commented Nov 29, 2021 • edited by jpmckinney Loading

jpmckinney Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

pawelmhm Nov 30, 2021

Choose a reason for hiding this comment

pawelmhm Nov 30, 2021 • edited Loading

Choose a reason for hiding this comment

jpmckinney Nov 30, 2021 • edited Loading

Choose a reason for hiding this comment

pawelmhm Dec 1, 2021

Choose a reason for hiding this comment

pawelmhm Dec 3, 2021 • edited Loading

Choose a reason for hiding this comment

jpmckinney Dec 3, 2021

Choose a reason for hiding this comment

codecov bot commented Dec 3, 2021 • edited Loading

Codecov Report

pawelmhm commented Dec 14, 2021 • edited Loading

jpmckinney left a comment

Choose a reason for hiding this comment

jpmckinney commented Jul 18, 2024

pawelmhm commented Nov 29, 2021 •

edited by jpmckinney

Loading

jpmckinney Nov 29, 2021 •

edited

Loading

pawelmhm Nov 30, 2021 •

edited

Loading

jpmckinney Nov 30, 2021 •

edited

Loading

pawelmhm Dec 3, 2021 •

edited

Loading

codecov bot commented Dec 3, 2021 •

edited

Loading

pawelmhm commented Dec 14, 2021 •

edited

Loading