Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assumptions around Disclosure of Sensitive Information in URIs #142

Closed
kjetilk opened this issue Jan 29, 2020 · 14 comments
Closed

Assumptions around Disclosure of Sensitive Information in URIs #142

kjetilk opened this issue Jan 29, 2020 · 14 comments

Comments

@kjetilk
Copy link
Member

kjetilk commented Jan 29, 2020

RFC 7231 states that

URIs are intended to be shared, not secured, even when they identify secure resources. URIs are often shown on displays, added to templates when a page is printed, and stored in a variety of unprotected bookmark lists. It is therefore unwise to include information within a URI that is sensitive, personally identifiable, or a risk to disclose.

We need to determine if we should flag a different assumption around this. In particular, a WebID is clearly in most cases identifiable information (#25), and so care should be taken with them. Users may be inclined to put sensitive information in URIs, especially if authoring tools do it without them reflecting on it. It is also relevant in the discussion of #116, where containment triples may be visible to agents that have read access to the container, but not to the resources.

OTOH, assuming that URIs can be kept confidential may be a flawed assumption, something that cannot be done reliably in a general case, and so even if we assume that in the specification, it may do users disservice if it simply cannot be done.

@RubenVerborgh
Copy link
Contributor

RubenVerborgh commented Jan 29, 2020

I think we should be careful here about the intent and interpretation of the above text.

Although it includes the phrase "unwise to include information within a URI that is […]personally identifiable", I think the tacit assumption therein is "…not relevant to the purpose of the URL".

Because, if taken literally, URLs like https://en.wikipedia.org/wiki/Katie_Bouman, https://www.w3.org/People/Berners-Lee/, https://ruben.verborgh.org/, would all be "unwise".

I think the text rather talks about things like http://domain.example/resource?userid=…

This assumption is substantiated by the text that precedes it:

Clients are often privy to large amounts of personal information,
including both information provided by the user to interact with
resources (e.g., the user's name, location, mail address, passwords,
encryption keys, etc.) and information about the user's browsing
activity over time (e.g., history, bookmarks, etc.). Implementations
need to prevent unintentional disclosure of personal information.

So the personal information pertains to the user interacting with the resource identified by the URI, not the agent identified by the URI.

Given that the purpose of quite some WebIDs is to identify a person, I think the above does not hold in general. If the purpose is different, it might hold.

@kjetilk
Copy link
Member Author

kjetilk commented Jan 29, 2020

Given that the purpose of quite some WebIDs is to identify a person, I think the above does not hold in general. If the purpose is different, it might hold.

Yes indeed, but it is slightly tangential because what we need to figure out is if there are cases where it doesn't hold and therefore would require us to be able to protect URIs from a malicious party.

The RFC7231 looks to me as the simplest case to specify and implement, since it validates the assumption that URIs are always public, and therefore does not require any security around them.

In the cases you list, they are arguably OK in the context they are in now, because they are identifiers for public personas. An example of when also the agent matters is this: If you were under an oppressing regime, it would probably be unwise to use them if you posted something controversial about their politics. In that case, you'd use a pseudonym that would not have any personally identifiable information that the oppressing regime could discern from the URI, the assumption would need to be that we could not protect the URI, and so it would be unwise to use them.

It is not so much about the user or the agent, we need to find the context, if any, where there is an important reason to protect the URI to protect the user.

To take something that shouldn't get you hanged, I'll expand on @ericprud 's example from #116, say that Bob creates a photo:
http://bob.example/photos/bob-sleeps-at-work.jpg
and then Bob's boss (who, for some strange reason, doesn't like Bob to sleep at work), has read access to the container, and decides to fire Bob, even though they couldn't see the actual photo, they decide the URI is quite enough. Was it "unwise" of Bob to name the resource like that, and so, we do not need to create security mechanism around it? Or does it fall on us to make sure that Bob's boss should never see the URI even though they may have read access to something around it?

Since we are all very privacy focused, I think it is easy to say "yes, we need to have security around that", but that's where I think we need to take a step back: Even if we wanted, could we? Or might that just create a false sense of security? I.e., it is not just in container listings it may show up, it could also be in malicious intermediaries, apps may try to infer things from it, etc.

I'm inclined to go with RFC7231, and say that, yes, Bob messed up, don't be like Bob. Instead, be careful not to put anything in a URI that you wouldn't let the public see.

@RubenVerborgh
Copy link
Contributor

I'd be inclined to say: let's try, to the extent possible, not to show URIs to agents that do not have read access. But have that as a SHOULD; implementations can go further if desired.

For instance http://bob.example/photos/bob-sleeps-at-work.jpg could be 403 or 404.

@kjetilk
Copy link
Member Author

kjetilk commented Jan 29, 2020

I'd be inclined to say: let's try, to the extent possible, not to show URIs to agents that do not have read access. But have that as a SHOULD; implementations can go further if desired.

Hmmm, OK.

For instance http://bob.example/photos/bob-sleeps-at-work.jpg could be 403 or 404.

Ah, yes, but that's given by the ACL, the real difficulty is if the resource is linked, then what?

So, for example, say that Alice links the photo, is it then required for Alice's Pod to check if the agent has read access to the photo to display the link? If no, then what's the point of requiring it elsewhere?

@RubenVerborgh
Copy link
Contributor

I don't have a direct answer to your question, but I imagine that Verifiable Credentials people have thought a lot about that. And thing like Google Docs have random URLs for that reason.

Maybe it should be user choice though. I can have scrambled URLs or not.

@kjetilk
Copy link
Member Author

kjetilk commented Jan 29, 2020

Yes, clients will need to give users a choice to scramble URLs, and it is also a consideration for Slug-less container appends (#96). I would certainly love to hear more arguments for protecting URIs though.

@acoburn
Copy link
Member

acoburn commented Jan 29, 2020

It is worth noting that the performance characteristics of checking the permissions of each outbound link before displaying it can get quite bad. I have worked with a system that does exactly this, and the performance is spectacularly terrible once you get above a few thousand triples (as in, proxy servers time out after ten minutes...). And changing the response based on who is viewing the resource also makes working with any sort of caching reverse proxy quite difficult, since nearly all requests will be cache misses.

@csarven
Copy link
Member

csarven commented Jan 30, 2020

https://tools.ietf.org/html/rfc3986#section-7 and https://tools.ietf.org/html/rfc3986#section-7.5 are relevant here.. and I think we have agreement around URIs (like WebIDs for instance) by themselves are public and not deemed to be "sensitive". Certainly there are cases where servers should not disclose their assets (usually protected by access control) or its identifiers be guessable. Those should be left at the discretion of the server (or resource owners/controllers..). Ditto naming resources which may or may not disclose information. This is not something the spec should state where a test suite can universally verify. Hence, there are good practices that we can highlight but we probably shouldn't go further.

There is a related category of identifiers: Capability URLs ( https://www.w3.org/TR/capability-urls/ ) and so their visibility and access may need some advice. In Solid practice, beyond account related actions, I'm not aware of anything in that nature (but if you know, please say so). So, something about capability URLs can be noted in the spec (or possibly in Best Practices and Guidelines). Thus far we haven't discussed the possibility of capability URLs - that can be explored in #143 but this issue (142) shouldn't depend on it beyond highlighting the possibility of this category.

Edit: See also httpwg/http-core#278

@kjetilk
Copy link
Member Author

kjetilk commented Jan 30, 2020

Yes, I agree, checking permissions for every link is likely to cause severe performance problems, and yes, that Section 7.5 of the URI is pretty strong, I think we're aligning with that interpretation.

@kjetilk kjetilk added this to the ~First Public Working Draft milestone Jan 30, 2020
@d-a-v-i--
Copy link

secrets leaked via URI are officially classified under common weakness enumeration (CWE) 200: Information Exposure (https://cwe.mitre.org/data/definitions/200.html).

there are levels of risk here to consider. for example, an email username and a password to that user's email both may be considered secrets. however, the username ends up being necessary to share given limited protections for routing/completion of an email exchange. the password remains a much higher classification of secret and never should be exposed, even with the email service provider (as they check only a safe hash of the secret).

in other words, a URI can end up with different levels of classified secrets. some may be for necessary access and there are limits to protection, others should never ever be exposed. it is the latter (in cases of a password, token, etc) that URIs get especially dangerous. the risk not only is an exposed permanence in user-space controlled bookmarks, but also being sent to logs in places unknown and uncontrolled by the secret owner.

performance considerations (even if affecting availability) must lose out when in competition with privacy risks (confidentiality of an identity secret never meant to be shared). otherwise being high-performance means being more untrustworthy, as described under CWE-200.

@elf-pavlik
Copy link
Member

And thing like Google Docs have random URLs for that reason.
Maybe it should be user choice though. I can have scrambled URLs or not.

I find it really important that Solid can provide great UX even when all URLs get generated based on something like UUID and stay only intended as machine readable.
For that we need to make sure users can easily add rdf:label to any resource. This also works better if user wants to change that human readable label but doesn't want to change URL of the resource.

At the same time we may need to clarify use cases for various vanity URLs. As well as commonly used url shorteners.

Last but not least, Google Chrome Developers released short video few days ago: Humans can't read URLs. How can we fix it? - HTTP 203. It includes overview how various browsers present URLs, mostly desktop but it also gives some direction how it works on mobile browsers. Also Progressive Web Apps which user can install on home screen (also on more and more desktop OSs) take direction of not showing URLs in UI.

My take away from that: URLs mostly matter for security and user trusting content for given domain, all the rest becomes less relevant for human readability. Safari doesn't even show full URL and on mobile even the domain hardly fits on the display.

@kjetilk
Copy link
Member Author

kjetilk commented Jan 30, 2020

secrets leaked via URI are officially classified under common weakness enumeration (CWE) 200: Information Exposure (https://cwe.mitre.org/data/definitions/200.html).

Right!

there are levels of risk here to consider. for example, an email username and a password to that user's email both may be considered secrets. however, the username ends up being necessary to share given limited protections for routing/completion of an email exchange. the password remains a much higher classification of secret and never should be exposed, even with the email service provider (as they check only a safe hash of the secret).

OK. So, the question is whether we can mitigate those risks without departuring completely from the general assumption of RFC7231 and RFC3986. My interpretation of those standards is that you should assume that URLs are not confidential, and therefore, you should not put anything sensitive in them.
We need to get down into the details here.

In Solid, a WebID is functionally very similar to a username, and by their very nature, most WebIDs are public. I believe that pseudonymous WebIDs will be very important in Solid, where the identity is known only to the IdP. With pseudonyms, we can make sure that pseudonyms cannot be connected to the real-life identity. In certain constrained applications, we can also make sure that the pseudonym is not exposed. I have had pseudonyms on my agenda for a long time, but it is not something we have discussed much, perhaps we need that discussion going.

There's nothing in Solid that would ever expose a password, to the contrary, such credentials will always be between the client and the user's IdP, that's very constrained.

in other words, a URI can end up with different levels of classified secrets. some may be for necessary access and there are limits to protection, others should never ever be exposed. it is the latter (in cases of a password, token, etc) that URIs get especially dangerous. the risk not only is an exposed permanence in user-space controlled bookmarks, but also being sent to logs in places unknown and uncontrolled by the secret owner.

Yes. It seems to me, however, that the inclusion of secrets into URIs in Solid could fall into these categories:

  1. In Capability URLs (thanks for bringing up that draft, @csarven , I wasn't aware of it).
  2. In the authentication protocol design
  3. The user is being silly, like in the Bob example above.
  4. Some users are doing bad things to other users.

The first category is clearly dangerous, but I think they will have very constrained uses in Solid, it is not a part of the core as of now, but we probably need it for recovery mechanisms, confirmation of potentially harmful actions (such as deleting your Pod), etc. I see absolutely no reason to make any powerful mechanisms using capability URLs, that's not were Solid's power lies anyway. Also, capability URLs will never be linked to or appear in container listings. We can be very careful when designing those mechanisms.

I haven't yet reviewed the authentication protocol that is being worked on, but that is a constrained specification that we should carefully review with this in mind.

Beyond this, I think we can say with confidence that there is no part of Solid that will enter passwords or tokens into URLs.

I think we need to put a lot of effort into help users who are being exploited by other users. One example may be that Alice says on her pod:

<http://bob.example/photos/bob-sleeps-at-work.jpg> dct:title "Bob sleeps at work"@en ;
  rdfs:comment "Alice lols @ bob"@en;
  foaf:depicts <http://bob.example/profile/card#me> . 

(i.e. Alice annotates the picture of Bob and links him to it). Statements like that are obviously done on the Web too, but now they can be queried... Protecting URLs cannot help in this case, if Alice had read access to the container and possibly the picture, then there is nothing access control can do to mitigate this. Bob can merely revoke Alice's access to his photos, but he cannot remove Alice's annotation.

What could have helped is if Bob hadn't made that identifier for the picture in the first place, Alice might not have noticed. As a matter of best practice, we could recommend that clients try to help the user make good choices.

What has always baffled me with social media is that the ways that we have in real life to say that something is not OK isn't ported there. We're not using the means available to say that it is not OK for Alice to say that. Say that Bob's friend Charlie comes along, notices Alice's annotation, he should then be able to tell Alice "this is not OK" and also flag Alice's writing, so that good participants in the Solid ecosystem know that they should not distribute this any further unless Bob flags it as OK.

I think we should put effort into socio-technical measures like that to mitigate such issues.

Obviously, there is a possibility that some users decide to add a resource with their tokens into the URI. That will never be a part of the Solid design, but it is hard to prevent them from doing something as silly as that. We could try. Moreover, flawed application designs may do it on the user's behalf, in which case a review and a flag to tell people not to trust them is an appropriate action I think. We should have a clear mention not to do such things, as there is really no reason to.

In summary, the design of Solid is such that the most dangerous things will not be in URLs in general purpose parts of the specification. If they are used, it will be in very constrained parts of Solid that we can carefully design and review.

In the general purpose parts of Solid, secrets can be put into URLs by accident by users, and we should try to mitigate that by best practices, but in general, it does not differ from the wider Web. Thus, the question remains if we can follow the considerations of RFC7231 and RFC3986, or if we need to be more constrained.

@kjetilk
Copy link
Member Author

kjetilk commented Feb 4, 2020

Another way to (try to) think about this is in terms of extremes and a continuum between them. The two extremes would be:

  1. All URIs are public, and thus secrets can never be embedded in them,
  2. All URIs are secret until explicitly authorized to be shared.

Number 2 extreme is a radical departure from Web architecture and standards as seen in RFC7231 and RFC3986, but it is still possible to make such an assumption in restricted systems (e.g. an Intranet). As usual, confidentiality can be breached in such a system if an authorized agent links from the public Web, so the system must never let such a link bypass WAC.

Number 1 extreme, is I believe, a valid assumption for the resource access and lifecycle parts of Solid, though we should make very clear that clients and servers must not arbitrarily assign names to non-public resources (for the purpose of "The Spec", we should mention that in connection with Slug).

For other parts of Solid, for example a Pod provisioning API, we should have a design goal to create the API without requiring secrets in URLs, but we acknowledge that short-lived capability URLs may be required. We should strive to minimize the use of secrets in URLs, and where we do, the design must be reviewed by security.

@timbl
Copy link
Contributor

timbl commented Jul 27, 2021

These discussions relate to Web architecture are a general level, where it is not really constructive for the Solid community to get into deep discussions. Closing the issue as out of scope.

@timbl timbl closed this as completed Jul 27, 2021
@csarven csarven moved this to Done in Specification Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

7 participants