As discussed in
Section 1
, allowing arbitrary
sites to initiate calls violates the core Web security guarantee;
without some access restrictions on local devices, any malicious site
could simply bug a user. At minimum, then, it
MUST NOT
be possible for
arbitrary sites to initiate calls to arbitrary locations without user
consent. This immediately raises the question, however, of what should
be the scope of user consent.
In order for the user to
make an intelligent decision about whether to allow a call
(and hence their camera and microphone input to be routed somewhere),
they must understand either who is requesting access, where the media
is going, or both. As detailed below, there are two basic conceptual
models:
You are sending your media to entity A because you want to
talk to entity A (e.g., your mother).
Entity A (e.g., a calling service) asks to access the user's devices with the assurance
that it will transfer the media to entity B (e.g., your mother).
In either case, identity is at the heart of any consent decision.
Moreover, the identity of the party the browser is connecting to is all that the browser can meaningfully enforce;
if you are calling A, A can simply forward the media to C. Similarly,
if you authorize A to place a call to B, A can call C instead.
In either case, all the browser is able to do is verify and check
authorization for whoever is controlling where the media goes.
The target of the media can of course advertise a security/privacy
policy, but this is not something that the browser can
enforce. Even so, there are a variety of different consent scenarios
that motivate different technical consent mechanisms.
We discuss these mechanisms in the sections below.
It's important to understand that consent to access local devices
is largely orthogonal to consent to transmit various kinds of
data over the network (see
Section 4.2
).
Consent for device access is largely a matter of protecting
the user's privacy from malicious sites. By contrast,
consent to send network traffic is about preventing the
user's browser from being used to attack its local network.
Thus, we need to ensure communications consent even if the
site is not able to access the camera and microphone at
all (hence WebSockets's consent mechanism) and similarly,
we need to be concerned with the site accessing the
user's camera and microphone even if the data is to be
sent back to the site via conventional HTTP-based network
mechanisms such as HTTP POST.
In addition to camera and microphone access, there has been
demand for screen and/or application sharing functionality.
Unfortunately, the security implications of this
functionality are much harder for users to intuitively
analyze than for camera and microphone access.
(See
for a full analysis.)
The most obvious threats are simply those of "oversharing".
I.e., the user may believe they are sharing a window when
in fact they are sharing an application, or may forget they
are sharing their whole screen, icons, notifications, and all.
This is already an issue with existing screen sharing technologies
and is made somewhat worse if a partially trusted site is responsible for asking
for the resource to be shared rather than having the user propose it.
A less obvious threat involves the impact of screen sharing on the
Web security model. A key part of the Same-Origin Policy is that
HTML or JS from site A can reference content from site B and cause
the browser to load it, but (unless explicitly permitted) cannot
see the result. However, if a Web application from a site is
screen sharing the browser, then this violates that invariant,
with serious security consequences. For example, an attacker site
might request screen sharing and then briefly open up a new
window to the user's bank or webmail account, using screen sharing
to read the resulting displayed content. A more sophisticated
attack would be to open up a source view window to a site and use the
screen sharing result to view anti-cross-site request forgery tokens.
These threats suggest that screen/application sharing might need
a higher level of user consent than access to the camera or
microphone.
While a large number of possible calling scenarios are possible, the
scenarios discussed in this section illustrate many of
the difficulties of identifying the relevant scope of consent.
The first scenario we consider is a dedicated calling service. In this
case, the user has a relationship with a calling site
and repeatedly makes calls on it. It is likely
that rather than having to give permission for each call,
the user will want to give the calling service long-term
access to the camera and microphone. This is a natural fit
for a long-term consent mechanism (e.g., installing an
app store "application" to indicate permission for the
calling service).
A variant of the dedicated calling service is a gaming site
(e.g., a poker site) which hosts a dedicated calling service
to allow players to call each other.
With any kind of service where the user may use the same
service to talk to many different people, there is a question
about whether the user can know who they are talking to.
If I grant permission to calling service A to make calls
on my behalf, then I am implicitly granting it permission
to bug my computer whenever it wants. This suggests another
consent model in which a site is authorized to make calls
but only to certain target entities (identified via
media-plane cryptographic mechanisms as described in
Section 4.3.2
and especially
Section 4.3.2.3
). Note that the
question of consent here is related to but
distinct from the question of peer identity: I
might be willing to allow a calling site to in general
initiate calls on my behalf but still have some calls
via that site where I can be sure that the site is not
listening in.
Another simple scenario is calling the site you're actually visiting.
The paradigmatic case here is the "click here to talk to a
representative" windows that appear on many shopping sites.
In this case, the user's expectation is that they are
calling the site they're actually visiting. However, it is
unlikely that they want to provide a general consent to such
a site; just because I want some information on a car
doesn't mean that I want the car manufacturer to be able
to activate my microphone whenever they please. Thus,
this suggests the need for a second consent mechanism
where I only grant consent for the duration of a given
call. As described in
Section 3.1
great care must be taken in the design of this interface
to avoid the users just clicking through. Note also
that the user interface chrome, which is the representation
through which the user interacts with the user agent itself,
must clearly display elements
showing that the call is continuing in order to avoid attacks
where the calling site just leaves it up indefinitely but
shows a Web UI that implies otherwise.
Now that we have described the calling scenarios, we can start to reason about
the security requirements.
As discussed in
Section 3.2
, the basic unit of
Web sandboxing is the origin, and so it is natural to scope consent
to the origin. Specifically, a script from origin A
MUST
only be allowed
to initiate communications (and hence to access the camera and microphone)
if the user has specifically authorized access for that origin.
It is of course technically possible to have coarser-scoped permissions,
but because the Web model is scoped to the origin, this creates a difficult
mismatch.
Arguably, the origin is not fine-grained enough. Consider the situation where
Alice visits a site and authorizes it to make a single call. If consent is
expressed solely in terms of the origin, then on any future visit to that
site (including one induced via a mash-up or ad network), the site can
bug Alice's computer, use the computer to place bogus calls, etc.
While in principle Alice could grant and then
revoke the privilege, in practice privileges accumulate; if we are concerned
about this attack, something else is needed. There are a number of potential countermeasures to
this sort of issue.
Individual Consent
Ask the user for permission for each call.
Callee-oriented Consent
Only allow calls to a given user.
Cryptographic Consent
Only allow calls to a given set of peer keying material or
to a cryptographically established identity.
Unfortunately, none of these approaches is satisfactory for all cases.
As discussed above, individual consent puts the user's approval
in the UI flow for every call. Not only does this quickly become annoying
but it can train the user to simply click "OK", at which point the consent becomes
useless. Thus, while it may be necessary to have individual consent in some
cases, this is not a suitable solution for (for instance) the calling
service case. Where necessary, in-flow user interfaces must be carefully
designed to avoid the risk of the user blindly clicking through.
The other two options are designed to restrict calls to a given target.
Callee-oriented consent provided by the calling site
would not work well because a malicious site can claim that the
user is calling any user of their choice. One fix for this is to tie calls to a
cryptographically established identity. While not suitable for all cases,
this approach may be useful for some. If we consider the case
of advertising, it's not particularly convenient
to require the advertiser to instantiate an IFRAME on the hosting site just
to get permission; a more convenient approach is to cryptographically tie
the advertiser's certificate to the communication directly. We're still
tying permissions to the origin here, but to the media origin (and/or destination)
rather than to the Web origin.
RFC8827
describes mechanisms which facilitate this sort of consent.
Another case where media-level cryptographic identity makes sense is when a user
really does not trust the calling site. For instance, I might be worried that
the calling service will attempt to bug my computer, but I also want to be
able to conveniently call my friends. If consent is tied to particular
communications endpoints, then my risk is limited. Naturally, it
is somewhat challenging to design UI primitives which express this sort
of policy. The problem becomes even more challenging in multi-user
calling cases.
Origin-based security is intended to secure against Web attackers. However, we must
also consider the case of network attackers. Consider the case where I have
granted permission to a calling service by an origin that has the HTTP scheme,
e.g., . If I ever use my computer on
an unsecured network (e.g., a hotspot or if my own home wireless network
is insecure), and browse any HTTP site, then an attacker can bug my computer. The attack proceeds
like this:
I connect to . Note that this site is unaffiliated
with the calling service.
The attacker modifies my HTTP connection to inject an IFRAME (or a redirect)
to .
The attacker forges the response from to
inject JS to initiate a call to themselves.
Note that this attack does not depend on the media being insecure. Because the
call is to the attacker, it is also encrypted to them. Moreover, it need not
be executed immediately; the attacker can "infect" the origin semi-permanently
(e.g., with a Web worker or a popped-up window that is hidden under the main window)
and thus be able to bug me long
after I have left the infected network. This risk is created by allowing
calls at all from a page fetched over HTTP.
Even if calls are only possible from HTTPS
RFC2818
sites,
if those sites include active content (e.g., JavaScript) from an untrusted
site, that JavaScript is executed in the security context of the page
finer-grained
. This could lead to compromise of a call
even if the parent page is safe. Note: This issue is not restricted
to
pages
which contain untrusted content.
If any page from a
given origin ever loads JavaScript from an attacker, then it is
possible for that attacker to infect the browser's notion of that
origin semi-permanently.