Diff: rfc8845xml2.original.xml

	rfc8845xml2.original.xml	rfc8845.xml

		<?xml version='1.0' encoding='utf-8'?>
		<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
		<rfc xmlns:xi="http://www.w3.org/2001/XInclude" submissionType="IETF"
		category="std" consensus="yes" number="8845" obsoletes="" updates=""
		xml:lang="en" sortRefs="true" symRefs="true" tocInclude="true"
		version="3" ipr="trust200902" docName="draft-ietf-clue-framework-25">
		<!-- xml2rfc v2v3 conversion 2.45.2 -->
		<front>
		<title abbrev="CLUE Framework">Framework for Telepresence Multi-Streams</tit
		le>
		<seriesInfo name="RFC" value="8845"/>

		<author fullname="Mark Duckworth" initials="M." role="editor" surname="Du
		ckworth">
		<organization/>
		<address>
		<postal>
		<city></city><region></region><code></code>
		<country></country>
		</postal>
		<email>mrducky73@outlook.com</email>
		</address>
		</author>
		<author fullname="Andrew Pepperell" initials="A." surname="Pepperell">
		<organization>Acano</organization>
		<address>
		<postal>
		<city>Uxbridge</city>
		<country>United Kingdom</country>
		</postal>
		<email>apeppere@gmail.com</email>
		</address>
		</author>
		<author fullname="Stephan Wenger" initials="S." surname="Wenger">
		<organization abbrev="Tencent">Tencent</organization>
		<address>
		<postal>
		<street>2747 Park Blvd.</street>
		<city>Palo Alto</city><region>CA</region><code>94306</code>
		<country>United States of America</country>
		</postal>
		<email>stewe@stewe.org</email>
		</address>
		</author>
		<date month="January" year="2021"/>
		<area>ART</area>
		<workgroup>CLUE</workgroup>

		<keyword>Telepresence</keyword>
		<keyword>Conferencing</keyword>
		<keyword>Video-Conferencing</keyword>
		<keyword>MCU</keyword>
		<abstract>
		<t>
		This document defines a framework for a protocol to enable devices
		in a telepresence conference to interoperate. The protocol enables
		communication of information about multiple media streams so a
		sending system and receiving system can make reasonable decisions
		about transmitting, selecting, and rendering the media streams.
		This protocol is used in addition to SIP signaling and Session Description Pr
		otocol (SDP)
		negotiation for setting up a telepresence session.</t>
		</abstract>
		</front>
		<middle>
		<section anchor="s-1" numbered="true" toc="default">
		<name>Introduction</name>
		<t>
		Current telepresence systems, though based on open standards such
		as RTP <xref target="RFC3550" format="default"/> and SIP <xref target="RFC326
		1" format="default"/>, cannot easily interoperate with
		each other. A major factor limiting the interoperability of
		telepresence systems is the lack of a standardized way to describe
		and negotiate the use of multiple audio and video streams
		comprising the media flows. This document provides a framework for
		protocols to enable interoperability by handling multiple streams
		in a standardized way. The framework is intended to support the
		use cases described in "Use Cases for Telepresence Multistreams"
		<xref target="RFC7205" format="default"/> and to meet the requirements in "Re
		quirements for
		Telepresence Multistreams" <xref target="RFC7262" format="default"/>. This in
		cludes cases using
		multiple media streams that are not necessarily telepresence.</t>
		<t>
		The basic session setup for the use cases is based on SIP <xref target="RFC32
		61" format="default"/>
		and SDP offer/answer <xref target="RFC3264" format="default"/>. In addition
		to basic SIP & SDP
		offer/answer, signaling that is ControLling mUltiple streams for
		tElepresence (CLUE) specific is required to exchange the
		information describing the multiple Media Streams. The motivation
		for this framework, an overview of the signaling, and the information
		required to be exchanged are described in subsequent sections of
		this document. Companion documents describe the signaling details
		<xref target="RFC8848" format="default"/>, the data model <xref target="RFC88
		46" format="default"/>, and the protocol <xref target="RFC8847" format="default"
		/>.</t>
		</section>

		<section anchor="s-2" numbered="true" toc="default">
		<name>Requirements Language</name>
		<t>
		The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQU
		IRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
		NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>
		RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
		"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to
		be interpreted as
		described in BCP 14 <xref target="RFC2119" format="default"/> <xref tar
		get="RFC8174" format="default"/>
		when, and only when, they appear in all capitals, as shown here.
		</t>
		</section>
		<section anchor="s-3" numbered="true" toc="default">
		<name>Definitions</name>
		<t>
		The terms defined below are used throughout this document and
		in companion documents. Capitalization is used in order to easily identify a
		defined term.</t>

		<dl newline="false" spacing="normal">
		<dt>Advertisement:</dt>
		<dd>A CLUE message a Media Provider sends to a Media
		Consumer describing specific aspects of the content of the Media
		and any restrictions it has in terms of being able to provide
		certain Streams simultaneously.</dd>
		<dt>Audio Capture (AC):</dt>
		<dd>Media Capture for audio. Denoted as "ACn" in the
		examples in this document.</dd>
		<dt>Capture:</dt>
		<dd>Same as Media Capture.</dd>
		<dt>Capture Device:</dt>
		<dd>A device that converts physical input, such as
		audio, video, or text, into an electrical signal, in most cases to
		be fed into a Media encoder.</dd>
		<dt>Capture Encoding:</dt>
		<dd>A specific Encoding of a Media Capture, to be
		sent by a Media Provider to a Media Consumer via RTP.</dd>
		<dt>Capture Scene:</dt>
		<dd>A structure representing a spatial region captured
		by one or more Capture Devices, each capturing Media representing a
		portion of the region. The spatial region represented by a Capture
		Scene may correspond to a real region in physical space, such as a
		room. A Capture Scene includes attributes and one or more Capture
		Scene Views, with each view including one or more Media Captures.</dd>
		<dt>Capture Scene View (CSV):</dt>
		<dd>A list of Media Captures of the same
		Media type that together form one way to represent the entire
		Capture Scene.</dd>

		<dt>CLUE:</dt>
		<dd>CLUE is an
		acronym for "ControLling mUltiple streams for tElepresence", which is
		the name of the IETF working group in which this document and certain
		companion documents have been developed. Often, CLUE-* refers to
		something that has been designed by the CLUE working group; for
		example, this document may be called the CLUE-framework document
		herein and elsewhere.</dd>

		<dt>CLUE-capable device:</dt>
		<dd>A device that supports the CLUE data channel
		<xref target="RFC8850" format="default"/>, the CLUE protocol <xref target="RF
		C8847" format="default"/> and the principles of CLUE negotiation; it also seeks
		CLUE-enabled calls.</dd>
		<dt>CLUE-enabled call:</dt>
		<dd>A call in which two CLUE-capable devices have
		successfully negotiated support for a CLUE data channel in SDP
		<xref target="RFC4566" format="default"/>. A CLUE-enabled call is not necessa
		rily immediately able
		to send CLUE-controlled Media; negotiation of the data channel and
		of the CLUE protocol must complete first. Calls between two CLUE-capable devi
		ces that have not yet successfully completed
		negotiation of support for the CLUE data channel in SDP are not
		considered CLUE-enabled.</dd>
		<dt>Conference:</dt>
		<dd>Used as defined in "A Framework for
		Conferencing within the Session Initiation Protocol (SIP)" <xref target="RFC4
		353" format="default"/>.</dd>
		<dt>Configure Message:</dt>
		<dd>A CLUE message a Media Consumer sends to a Media
		Provider specifying which content and Media Streams it wants to
		receive, based on the information in a corresponding Advertisement
		message.</dd>
		<dt>Consumer:</dt>
		<dd>Short for Media Consumer.</dd>
		<dt>Encoding:</dt>
		<dd>Short for Individual Encoding.</dd>
		<dt>Encoding Group:</dt>
		<dd>A set of Encoding parameters representing a total
		Media Encoding capability to be subdivided across potentially
		multiple Individual Encodings.</dd>
		<dt>Endpoint:</dt>
		<dd>A CLUE-capable device that is the logical point of final
		termination through receiving, decoding and Rendering, and/or
		initiation through capturing, encoding, and sending of Media
		Streams. An Endpoint consists of one or more physical devices
		that source and sink Media Streams, and exactly one <xref target="RFC4353" fo
		rmat="default"/>
		Participant (which, in turn, includes exactly one SIP User Agent).
		Endpoints can be anything from multiscreen/multicamera rooms to
		handheld devices.</dd>
		<dt>Global View:</dt>
		<dd>A set of references to one or more CSVs
		of the same Media type that are defined within Scenes of the same
		Advertisement. A Global View is a suggestion from the Provider to
		the Consumer for one set of CSVs that provide a useful
		representation of all the Scenes in the Advertisement.</dd>
		<dt>Global View List:</dt>
		<dd>A list of Global Views included in an
		Advertisement. A Global View List may include Global Views of
		different Media types.</dd>
		<dt>Individual Encoding:</dt>
		<dd>a set of parameters representing a way to
		encode a Media Capture to become a Capture Encoding.</dd>
		<dt>Multipoint Control Unit (MCU):</dt>
		<dd>a CLUE-capable device that connects
		two or more Endpoints into one single multimedia
		Conference <xref target="RFC7667" format="default"/>. An MCU includes a Mixe
		r like that described in <xref target="RFC4353" format="default"/>,
		without the requirement of <xref target="RFC4353" format="default"/> to send
		Media to each
		participant.</dd>
		<dt>Media:</dt>
		<dd>Any data that, after suitable encoding, can be conveyed over
		RTP, including audio, video, or timed text.</dd>

		<dt>Media Capture (MC):</dt>
		<dd>A source of Media, such as from one or more Capture
		Devices or constructed from other Media Streams.</dd>
		<dt>Media Consumer:</dt>
		<dd>A CLUE-capable device that intends to receive
		Capture Encodings.</dd>
		<dt>Media Provider:</dt>
		<dd>A CLUE-capable device that intends to send Capture
		Encodings.</dd>
		<dt>Multiple Content Capture (MCC):</dt>
		<dd>A Capture that mixes and/or
		switches other Captures of a single type (for example, all audio or all
		video). Particular Media Captures may or may not be present in the
		resultant Capture Encoding, depending on time or space. Denoted as
		"MCCn" in the example cases in this document.</dd>
		<dt>Plane of Interest:</dt>
		<dd>The spatial plane within a Scene containing the
		most-relevant subject matter.</dd>
		<dt>Provider:</dt>
		<dd>Same as a Media Provider.</dd>
		<dt>Render:</dt>
		<dd>The process of generating a representation from Media, such
		as displayed motion video or sound emitted from loudspeakers.</dd>
		<dt>Scene:</dt>
		<dd>Same as a Capture Scene.</dd>
		<dt>Simultaneous Transmission Set:</dt>
		<dd>A set of Media Captures that can be
		transmitted simultaneously from a Media Provider.</dd>
		<dt>Single Media Capture:</dt>
		<dd>A Capture that contains Media from a single
		source Capture Device, e.g., an Audio Capture from a single
		microphone or a Video Capture from a single camera.</dd>
		<dt>Spatial Relation:</dt>
		<dd>The arrangement of two objects in space, in
		contrast to relation in time or other relationships.</dd>
		<dt>Stream:</dt>
		<dd>A Capture Encoding sent from a Media Provider to a Media
		Consumer via RTP <xref target="RFC3550" format="default"/>.</dd>
		<dt>Stream Characteristics:</dt>
		<dd>The Media Stream attributes commonly used
		in non-CLUE SIP/SDP environments (such as Media codec, bitrate,
		resolution, profile/level, etc.) as well as CLUE-specific
		attributes, such as the Capture ID or a spatial location.</dd>
		<dt>Video Capture (VC):</dt>
		<dd>Media Capture for video. Denoted as VCn in the
		example cases in this document.</dd>
		<dt>Video Composite:</dt>
		<dd>A single image that is formed, normally by an RTP
		mixer inside an MCU, by combining visual elements from separate
		sources.</dd>
		</dl>
		</section>

		<section anchor="s-4" numbered="true" toc="default">
		<name>Overview and Motivation</name>
		<t>
		This section provides an overview of the functional elements
		defined in this document to represent a telepresence or
		multistream system. The motivations for the framework described
		in this document are also provided.</t>
		<t>
		Two key concepts introduced in this document are the terms "Media Provider" a
		nd "Media Consumer". A Media Provider represents the
		entity that sends the Media and a Media Consumer represents the
		entity that receives the Media. A Media Provider provides Media in
		the form of RTP packets; a Media Consumer consumes those RTP
		packets. Media Providers and Media Consumers can reside in
		Endpoints or in Multipoint Control Units (MCUs). A Media Provider
		in an Endpoint is usually associated with the generation of Media
		for Media Captures; these Media Captures are typically sourced
		from cameras, microphones, and the like. Similarly, the Media
		Consumer in an Endpoint is usually associated with renderers, such
		as screens and loudspeakers. In MCUs, Media Providers and
		Consumers can have the form of outputs and inputs, respectively,
		of RTP mixers, RTP translators, and similar devices. Typically,
		telepresence devices, such as Endpoints and MCUs, would perform as
		both Media Providers and Media Consumers, the former being
		concerned with those devices' transmitted Media and the latter
		with those devices' received Media. In a few circumstances, a
		CLUE-capable device includes only Consumer or Provider
		functionality, such as recorder-type Consumers or webcam-type
		Providers.</t>
		<t>
		The motivations for the framework outlined in this document
		include the following:</t>
		<ol spacing="normal" type="(%d)">
		<li>Endpoints in telepresence systems typically have multiple Media
		Capture and Media Render devices, e.g., multiple cameras and
		screens. While previous system designs were able to set up calls
		that would capture Media using all cameras and display Media on all
		screens, for example, there was no mechanism that could associate
		these Media Captures with each other in space and time, in a cross-vendor
		interoperable way.</li>
		<li>The mere fact that there are multiple Media Capture and Media Render
		devices, each of which may be configurable in aspects such as zoom,
		leads to the difficulty that a variable number of such devices can
		be used to capture different aspects of a region. The Capture
		Scene concept allows for the description of multiple setups for
		those multiple Media Capture devices that could represent sensible
		operation points of the physical Capture Devices in a room, chosen
		by the operator. A Consumer can pick and choose from those
		configurations based on its rendering abilities and then inform the
		Provider about its choices. Details are provided in <xref target="s-7" fo
		rmat="default"/>.</li>
		<li>In some cases, physical limitations or other reasons disallow
		the concurrent use of a device in more than one setup. For
		example, the center camera in a typical three-camera conference
		room can set its zoom objective to capture either the middle
		few seats only or all seats of a room, but not both concurrently. The
		Simultaneous Transmission Set concept allows a Provider to signal
		such limitations. Simultaneous Transmission Sets are part of the
		Capture Scene description and are discussed in <xref target="s-8" format="
		default"/>.</li>
		<li>Often, the devices in a room do not have the computational
		complexity or connectivity to deal with multiple Encoding options
		simultaneously, even if each of these options is sensible in
		certain scenarios, and even if the simultaneous transmission is
		also sensible (i.e., in case of multicast Media distribution to
		multiple Endpoints). Such constraints can be expressed by the
		Provider using the Encoding Group concept, which is described in <xref tar
		get="s-9" format="default"/>.</li>
		<li>Due to the potentially large number of RTP Streams required for
		a Multimedia Conference involving potentially many Endpoints, each
		of which can have many Media Captures and Media renderers, it has
		become common to multiplex multiple RTP Streams onto the same
		transport address, so as to avoid using the port number as a
		multiplexing point and the associated shortcomings such as
		NAT/firewall traversal. The large number of possible permutations
		of sensible options a Media Provider can make available to a Media
		Consumer makes a mechanism desirable that allows it to narrow down
		the number of possible options that a SIP offer/answer exchange has
		to consider. Such information is made available using protocol
		mechanisms specified in this document and companion documents.
		The
		Media Provider and Media Consumer may use information in CLUE
		messages to reduce the complexity of SIP offer/answer messages.
		Also, there are aspects of the control of both Endpoints and MCUs
		that dynamically change during the progress of a call, such as
		audio-level-based screen switching, layout changes, and so on,
		which need to be conveyed. Note that these control aspects are
		complementary to those specified in traditional SIP-based
		conference management, such as Binary Floor Control Protocol (BFCP). An e
		xemplary call flow can be
		found in <xref target="s-5" format="default"/>.</li>
		</ol>
		<t>
		Finally, all this information needs to be conveyed, and the notion
		of support for it needs to be established. This is done by the
		negotiation of a "CLUE channel", a data channel negotiated early
		during the initiation of a call. An Endpoint or MCU that rejects
		the establishment of this data channel, by definition, does not
		support CLUE-based mechanisms, whereas an Endpoint or MCU that
		accepts it is indicating support for CLUE as specified in this
		document and its companion documents.</t>
		</section>
		<section anchor="s-5" numbered="true" toc="default">
		<name>Description of the Framework/Model</name>
		<t>
		The CLUE framework specifies how multiple Media Streams are to be
		handled in a telepresence Conference.</t>
		<t>
		A Media Provider (transmitting Endpoint or MCU) describes specific
		aspects of the content of the Media and the Media Stream Encodings
		it can send in an Advertisement; and the Media Consumer responds to
		the Media Provider by specifying which content and Media Streams it
		wants to receive in a Configure message. The Provider then
		transmits the asked-for content in the specified Streams.</t>
		<t>
		This Advertisement and Configure typically occur during call
		initiation, after CLUE has been enabled in a call, but they <bcp14>MAY</bcp14
		> also
		happen at any time throughout the call, whenever there is a change
		in what the Consumer wants to receive or (perhaps less common) what the
		Provider can send.</t>
		<t>
		An Endpoint or MCU typically acts as both Provider and Consumer at
		the same time, sending Advertisements and sending Configurations in
		response to receiving Advertisements. (It is possible to be just
		one or the other.)</t>
		<t>
		The data model <xref target="RFC8846" format="default"/> is based around two
		main concepts: a Capture and an Encoding. A Media Capture,
		such as of type audio or video, has attributes to describe the
		content a Provider can send. Media Captures are described in terms
		of CLUE-defined attributes, such as Spatial Relationships and
		purpose of the Capture. Providers tell Consumers which Media
		Captures they can provide, described in terms of the Media Capture
		attributes.</t>
		<t>
		A Provider organizes its Media Captures into one or more Capture
		Scenes, each representing a spatial region, such as a room. A
		Consumer chooses which Media Captures it wants to receive from the
		Capture Scenes.</t>
		<t>
		In addition, the Provider can send the Consumer a description of
		the Individual Encodings it can send in terms of identifiers that
		relate to items in SDP <xref target="RFC4566" format="default"/>.</t>
		<t>
		The Provider can also specify constraints on its ability to provide
		Media, and a sensible design choice for a Consumer is to take these
		into account when choosing the content and Capture Encodings it
		requests in the later offer/answer exchange. Some constraints are
		due to the physical limitations of device; for example, a camera
		may not be able to provide zoom and non-zoom views simultaneously.
		Other constraints are system based, such as maximum bandwidth.</t>
		<t>
		The following diagram illustrates the information contained in an
		Advertisement.</t>
		<figure anchor="ref-advertisement-structure">
		<name>Advertisement Structure</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		...................................................................
		. Provider Advertisement +--------------------+ .
		. \| Simultaneous Sets \| .
		. +------------------------+ +--------------------+ .
		. \| Capture Scene N \| +--------------------+ .
		. +-+----------------------+ \| \| Global View List \| .
		. \| Capture Scene 2 \| \| +--------------------+ .
		. +-+----------------------+ \| \| +----------------------+ .
		. \| Capture Scene 1 \| \| \| \| Encoding Group N \| .
		. \| +---------------+ \| \| \| +-+--------------------+ \| .
		. \| \| Attributes \| \| \| \| \| Encoding Group 2 \| \| .
		. \| +---------------+ \| \| \| +-+--------------------+ \| \| .
		. \| \| \| \| \| Encoding Group 1 \| \| \| .
		. \| +----------------+ \| \| \| \| parameters \| \| \| .
		. \| \| V i e w s \| \| \| \| \| bandwidth \| \| \| .
		. \| \| +---------+ \| \| \| \| \| +-------------------+\| \| \| .
		. \| \| \|Attribute\| \| \| \| \| \| \| V i d e o \|\| \| \| .
		. \| \| +---------+ \| \| \| \| \| \| E n c o d i n g s \|\| \| \| .
		. \| \| \| \| \| \| \| \| Encoding 1 \|\| \| \| .
		. \| \| View 1 \| \| \| \| \| \| \|\| \| \| .
		. \| \| (list of MCs) \| \| \|-+ \| +-------------------+\| \| \| .
		. \| +----\|-\|--\|------+ \|-+ \| \| \| \| .
		. +---------\|-\|--\|---------+ \| +-------------------+\| \| \| .
		. \| \| \| \| \| A u d i o \|\| \| \| .
		. \| \| \| \| \| E n c o d i n g s \|\| \| \| .
		. v \| \| \| \| Encoding 1 \|\| \| \| .
		. +---------\|--\|--------+ \| \| \|\| \| \| .
		. \| Media Capture N \|------>\| +-------------------+\| \| \| .
		. +-+---------v--\|------+ \| \| \| \| \| .
		. \| Media Capture 2 \| \| \| \| \|-+ .
		. +-+--------------v----+ \|-------->\| \| \| .
		. \| Media Capture 1 \| \| \| \| \|-+ .
		. \| +----------------+ \|---------->\| \| .
		. \| \| Attributes \| \| \|_+ +----------------------+ .
		. \| +----------------+ \|_+ .
		. +---------------------+ .
		. .
		...................................................................
		]]></artwork>
		</figure>
		<t><xref target="ref-basic-information-flow" format="default"/> illustrate
		s the call flow used by a simple system (two Endpoints) in compliance with this
		document. A very brief outline of the call flow is described in the text that f
		ollows.</t>
		<figure anchor="ref-basic-information-flow">
		<name>Basic Information Flow</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		+-----------+ +-----------+
		\| Endpoint1 \| \| Endpoint2 \|
		+----+------+ +-----+-----+
		\| INVITE (BASIC SDP+CLUECHANNEL) \|
		\|--------------------------------->\|
		\| 200 0K (BASIC SDP+CLUECHANNEL)\|
		\|<---------------------------------\|
		\| ACK \|
		\|--------------------------------->\|
		\| \|
		\|<################################>\|
		\| BASIC MEDIA SESSION \|
		\|<################################>\|
		\| \|
		\| CONNECT (CLUE CTRL CHANNEL) \|
		\|=================================>\|
		\| ... \|
		\|<================================>\|
		\| CLUE CTRL CHANNEL ESTABLISHED \|
		\|<================================>\|
		\| \|
		\| ADVERTISEMENT 1 \|
		\|*********************************>\|
		\| ADVERTISEMENT 2 \|
		\|<*********************************\|
		\| \|
		\| CONFIGURE 1 \|
		\|<*********************************\|
		\| CONFIGURE 2 \|
		\|*********************************>\|
		\| \|
		\| REINVITE (UPDATED SDP) \|
		\|--------------------------------->\|
		\| 200 0K (UPDATED SDP)\|
		\|<---------------------------------\|
		\| ACK \|
		\|--------------------------------->\|
		\| \|
		\|<################################>\|
		\| UPDATED MEDIA SESSION \|
		\|<################################>\|
		\| \|
		v v
		]]></artwork>
		</figure>
		<t>
		An initial offer/answer exchange establishes a basic Media session,
		for example, audio-only, and a CLUE channel between two Endpoints.
		With the establishment of that channel, the Endpoints have
		consented to use the CLUE protocol mechanisms and, therefore, <bcp14>MUST</bc
		p14>
		adhere to the CLUE protocol suite as outlined herein.</t>
		<t>
		Over this CLUE channel, the Provider in each Endpoint conveys its
		characteristics and capabilities by sending an Advertisement as
		specified herein. The Advertisement is typically not sufficient to
		set up all Media. The Consumer in the Endpoint receives the
		information provided by the Provider and can use it for several
		purposes. It uses it, along with information from an offer/answer
		exchange, to construct a CLUE Configure message to tell the
		Provider what the Consumer wishes to receive. Also, the Consumer
		may use the information provided to tailor the SDP it is going to
		send during any following SIP offer/answer exchange, and its
		reaction to SDP it receives in that step. It is often a sensible
		implementation choice to do so. Spatial relationships associated
		with the Media can be included in the Advertisement, and it is
		often sensible for the Media Consumer to take those spatial
		relationships into account when tailoring the SDP. The Consumer
		can also limit the number of Encodings it must set up resources to
		receive, and not waste resources on unwanted Encodings, because it
		has the Provider's Advertisement information ahead of time to
		determine what it really wants to receive. The Consumer can also
		use the Advertisement information for local rendering decisions.</t>
		<t>
		This initial CLUE exchange is followed by an SDP offer/answer
		exchange that not only establishes those aspects of the Media that
		have not been "negotiated" over CLUE, but also has the effect of
		setting up the Media transmission itself, involving potentially
		security exchanges, Interactive Connectivity Establishment (ICE), and whatnot
		. This step is considered "plain vanilla
		SIP".</t>
		<t>
		During the lifetime of a call, further exchanges <bcp14>MAY</bcp14> occur ove
		r the
		CLUE channel. In some cases, those further exchanges lead to a
		modified system behavior of Provider or Consumer (or both) without
		any other protocol activity such as further offer/answer exchanges.
		For example, a Configure Message requesting that the Provider place a
		different Capture source into a Capture Encoding, signaled over the
		CLUE channel, ought not to lead to heavy-handed mechanisms like SIP
		re-invites. In other cases, however, after the CLUE negotiation, an
		additional offer/answer exchange becomes necessary. For example,
		if both sides decide to upgrade the call from one screen to a
		multi-screen call, and more bandwidth is required for the additional
		video channels compared to what was previously negotiated using
		offer/answer, a new offer/answer exchange is required.</t>
		<t>
		One aspect of the protocol outlined herein, and specified in more
		detail in companion documents, is that it makes available to the
		Consumer information regarding the Provider's capabilities to
		deliver Media and attributes related to that Media such as their
		Spatial Relationship. The operation of the renderer inside the
		Consumer is unspecified in that it can choose to ignore some
		information provided by the Provider and/or not Render Media
		Streams available from the Provider (although the Consumer follows
		the CLUE protocol and, therefore, gracefully receives and responds
		to the Provider's information using a Configure operation).</t>
		<t>
		A CLUE-capable device interoperates with a device that does not
		support CLUE. The CLUE-capable device can determine, by the result
		of the initial offer/answer exchange, if the other device supports
		and wishes to use CLUE. The specific mechanism for this is
		described in <xref target="RFC8848" format="default"/>. If the other device
		does
		not use CLUE, then the CLUE-capable device falls back to behavior
		that does not require CLUE.</t>
		<t>
		As for the Media, Provider and Consumer have an end-to-end
		communication relationship with respect to (RTP-transported) Media;
		and the mechanisms described herein and in companion documents do
		not change the aspects of setting up those RTP flows and sessions.
		In other words, the RTP Media sessions conform to the negotiated
		SDP whether or not CLUE is used.</t>
		</section>
		<section anchor="s-6" numbered="true" toc="default">
		<name>Spatial Relationships</name>
		<t>
		In order for a Consumer to perform a proper rendering, it is often
		necessary (or at least helpful) for the Consumer to have received
		spatial information about the Streams it is receiving. CLUE
		defines a coordinate system that allows Media Providers to describe
		the Spatial Relationships of their Media Captures to enable proper
		scaling and spatially sensible rendering of their Streams. The
		coordinate system is based on a few principles:</t>
		<ul spacing="normal">
		<li>Each Capture Scene has a distinct coordinate system, unrelated
		to the coordinate systems of other Scenes.</li>
		<li>Simple systems that do not have multiple Media Captures to
		associate spatially need not use the coordinate model, although
		it can still be useful to provide an Area of Capture.</li>
		<li>
		<t>Coordinates can either be in real, physical units (millimeters),
		have an unknown scale, or have no physical scale. Systems that
		know their physical dimensions (for example, professionally
		installed Telepresence room systems) <bcp14>MUST</bcp14> provide those rea
		l-world measurements to enable the best user experience for
		advanced receiving systems that can utilize this information.
		Systems that don't know specific physical dimensions but still
		know relative distances <bcp14>MUST</bcp14> use "Unknown Scale". "No Scal
		e" is
		intended to be used only where Media Captures from different
		devices (with potentially different scales) will be forwarded
		alongside one another (e.g., in the case of an MCU).

		</t>
		<ul spacing="normal">
		<li>"Millimeters" means the scale is in millimeters.</li>
		<li>"Unknown Scale" means the scale is not necessarily in millimeter
		s, but
		the scale is the same for every Capture in the Capture Scene.</li>
		<li>"No Scale" means the scale could be different for each
		Capture -- an MCU Provider that advertises two adjacent
		Captures and picks sources (which can change quickly) from
		different Endpoints might use this value; the scale could be
		different and changing for each Capture. But the areas of
		capture still represent a Spatial Relation between Captures.</li>
		</ul>
		</li>
		<li>The coordinate system is right-handed Cartesian X, Y, Z with the
		origin at a spatial location of the Provider's choosing. The
		Provider <bcp14>MUST</bcp14> use the same coordinate system with the same
		scale
		and origin for all coordinates within the same Capture Scene.</li>
		</ul>
		<t>The direction of increasing coordinate values is as follows:
		X increases from left to right, from the point of view of an
		observer at the front of the room looking toward the back;
		Y increases from the front of the room to the back of the room;
		Z increases from low to high (i.e., floor to ceiling).</t>
		<t>
		Cameras in a Scene typically point in the direction of increasing
		Y, from front to back. But there could be multiple cameras
		pointing in different directions. If the physical space does not
		have a well-defined front and back, the Provider chooses any
		direction for X, Y, and Z consistent with right-handed
		coordinates.</t>
		</section>
		<section anchor="s-7" numbered="true" toc="default">
		<name>Media Captures and Capture Scenes</name>
		<t>
		This section describes how Providers can describe the content of
		Media to Consumers.</t>
		<section anchor="s-7.1" numbered="true" toc="default">
		<name>Media Captures</name>
		<t>
		Media Captures are the fundamental representations of Streams that
		a device can transmit. What a Media Capture actually represents is
		flexible:</t>
		<ul spacing="normal">
		<li>It can represent the immediate output of a physical source (e.g.,
		camera, microphone) or 'synthetic' source (e.g., laptop computer, DVD play
		er).</li>
		<li>It can represent the output of an audio mixer or video composer.</
		li>
		<li>It can represent a concept such as 'the loudest speaker'.</li>
		<li>It can represent a conceptual position such as 'the leftmost
		Stream'.</li>
		</ul>
		<t>
		To identify and distinguish between multiple Capture instances,
		Captures have a unique identity. For instance, VC1, VC2, AC1, and
		AC2 (where VC1 and VC2 refer to two different Video Captures and
		AC1 and AC2 refer to two different Audio Captures).</t>
		<t>Some key points about Media Captures:
		</t>
		<ul spacing="normal">
		<li>A Media Capture is of a single Media type (e.g., audio or
		video).</li>
		<li>A Media Capture is defined in a Capture Scene and is given an
		Advertisement unique identity. The identity may be referenced
		outside the Capture Scene that defines it through an MCC.</li>
		<li>A Media Capture may be associated with one or more CSVs.</li>
		<li>A Media Capture has exactly one set of spatial information.</li>
		<li>A Media Capture can be the source of at most one Capture
		Encoding.</li>
		</ul>
		<t>
		Each Media Capture can be associated with attributes to describe
		what it represents.</t>
		<section anchor="s-7.1.1" numbered="true" toc="default">
		<name>Media Capture Attributes</name>
		<t>
		Media Capture attributes describe information about the Captures.
		A Provider can use the Media Capture attributes to describe the
		Captures for the benefit of the Consumer of the Advertisement
		message. All these attributes are optional. Media Capture
		attributes include:

		</t>
		<ul spacing="normal">
		<li>Spatial information, such as Point of Capture, Point on Line
		of Capture, and Area of Capture, (all of which, in combination,
		define the capture field of, for example, a camera).</li>
		<li>Other descriptive information to help the Consumer choose
		between Captures (e.g., description, presentation, view,
		priority, language, person information, and type).</li>
		</ul>
		<t>
		The subsections below define the Capture attributes.</t>
		<section anchor="s-7.1.1.1" numbered="true" toc="default">
		<name>Point of Capture</name>
		<t>
		The Point of Capture attribute is a field with a single Cartesian
		(X, Y, Z) point value that describes the spatial location of the
		capturing device (such as camera). For an Audio Capture with
		multiple microphones, the Point of Capture defines the nominal midpoint of th
		e microphones.</t>
		</section>
		<section anchor="s-7.1.1.2" numbered="true" toc="default">
		<name>Point on Line of Capture</name>
		<t>
		The Point on Line of Capture attribute is a field with a single
		Cartesian (X, Y, Z) point value that describes a position in space
		of a second point on the axis of the capturing device, toward the
		direction it is pointing; the first point being the Point of
		Capture (see above).</t>
		<t>
		Together, the Point of Capture and Point on Line of Capture define
		the direction and axis of the capturing device, for example, the
		optical axis of a camera or the axis of a microphone. The Media
		Consumer can use this information to adjust how it Renders the
		received Media if it so chooses.</t>
		<t>
		For an Audio Capture, the Media Consumer can use this information
		along with the Audio Capture Sensitivity Pattern to define a three-dimensiona
		l volume of capture where sounds can be expected to be
		picked up by the microphone providing this specific Audio Capture.
		If the Consumer wants to associate an Audio Capture with a Video
		Capture, it can compare this volume with the Area of Capture for
		video Media to provide a check on whether the Audio Capture is
		indeed spatially associated with the Video Capture. For example, a
		video Area of Capture that fails to intersect at all with the audio
		volume of capture, or is at such a long radial distance from the
		microphone Point of Capture that the audio level would be very low,
		would be inappropriate.</t>
		</section>
		<section anchor="s-7.1.1.3" numbered="true" toc="default">
		<name>Area of Capture</name>
		<t>
		The Area of Capture is a field with a set of four (X, Y, Z) points
		as a value that describes the spatial location of what is being
		"captured". This attribute applies only to Video Captures, not
		other types of Media. By comparing the Area of Capture for
		different Video Captures within the same Capture Scene, a Consumer
		can determine the Spatial Relationships between them and Render
		them correctly.</t>
		<t>
		The four points <bcp14>MUST</bcp14> be co-planar, forming a quadrilateral, wh
		ich
		defines the Plane of Interest for the particular Media Capture.</t>
		<t>
		If the Area of Capture is not specified, it means the Video Capture
		might be spatially related to other Captures in the same Scene, but
		there is no detailed information on the relationship. For a switched
		Capture that switches between different sections within a larger
		area, the Area of Capture <bcp14>MUST</bcp14> use coordinates for the larger
		potential area.</t>
		</section>
		<section anchor="s-7.1.1.4" numbered="true" toc="default">
		<name>Mobility of Capture</name>
		<t>
		The Mobility of Capture attribute indicates whether or not the
		Point of Capture, Point on Line of Capture, and Area of Capture
		values stay the same over time, or are expected to change
		(potentially frequently). Possible values are static, dynamic, and
		highly dynamic.</t>
		<t>
		An example for "dynamic" is a camera mounted on a stand that is
		occasionally hand-carried and placed at different positions in
		order to provide the best angle to capture a work task. A camera
		worn by a person who moves around the room is an example for
		"highly dynamic". In either case, the effect is that the Point of Capture,
		Capture Axis, and Area of Capture change with time.</t>
		<t>
		The Point of Capture of a static Capture <bcp14>MUST NOT</bcp14> move for the
		life of
		the CLUE session. The Point of Capture of dynamic Captures is
		categorized by a change in position followed by a reasonable period
		of stability -- in the order of magnitude of minutes. Highly
		dynamic Captures are categorized by a Point of Capture that is
		constantly moving. If the Area of Capture, Point of Capture, and
		Point on Line of Capture attributes are included with dynamic or highly
		dynamic Captures, they indicate spatial information at the time of
		the Advertisement.</t>
		</section>
		<section anchor="s-7.1.1.5" numbered="true" toc="default">
		<name>Audio Capture Sensitivity Pattern</name>
		<t>
		The Audio Capture Sensitivity Pattern attribute applies only to
		Audio Captures. This attribute gives information about the nominal
		sensitivity pattern of the microphone that is the source of the
		Capture. Possible values include patterns such as omni, shotgun,
		cardioid, and hyper-cardioid.</t>
		</section>
		<section anchor="s-7.1.1.6" numbered="true" toc="default">
		<name>Description</name>
		<t>
		The Description attribute is a human-readable description (which
		could be in multiple languages) of the Capture.</t>
		</section>
		<section anchor="s-7.1.1.7" numbered="true" toc="default">
		<name>Presentation</name>
		<t>
		The Presentation attribute indicates that the Capture originates
		from a presentation device, that is, one that provides supplementary
		information to a Conference through slides, video, still images,
		data, etc. Where more information is known about the Capture, it <bcp14>MAY<
		/bcp14>
		be expanded hierarchically to indicate the different types of
		presentation Media, e.g., presentation.slides, presentation.image,
		etc.</t>
		<t>
		Note: It is expected that a number of keywords will be defined that
		provide more detail on the type of presentation. Refer to <xref target="RFC88
		46" format="default"/> for how to extend the model.</t>
		</section>
		<section anchor="s-7.1.1.8" numbered="true" toc="default">
		<name>View</name>
		<t>
		The View attribute is a field with enumerated values, indicating
		what type of view the Capture relates to. The Consumer can use
		this information to help choose which Media Captures it wishes to
		receive. Possible values are as follows:</t>
		<dl newline="false" spacing="normal" indent="12">
		<dt>Room:</dt>
		<dd>Captures the entire Scene
		</dd>
		<dt>Table:</dt>
		<dd>Captures the conference table with seated people
		</dd>
		<dt>Individual:</dt>
		<dd>Captures an individual person</dd>
		<dt>Lectern:</dt>
		<dd>Captures the region of the lectern including the
		presenter, for example, in a classroom-style conference room
		</dd>
		<dt>Audience:</dt>
		<dd>Captures a region showing the audience in a classroom-style co
		nference room
		</dd>
		</dl>
		</section>
		<section anchor="s-7.1.1.9" numbered="true" toc="default">
		<name>Language</name>
		<t>
		The Language attribute indicates one or more languages used in the
		content of the Media Capture. Captures <bcp14>MAY</bcp14> be offered in diff
		erent
		languages in case of multilingual and/or accessible Conferences. A
		Consumer can use this attribute to differentiate between them and
		pick the appropriate one.</t>
		<t>
		Note that the Language attribute is defined and meaningful both for
		Audio and Video Captures. In case of Audio Captures, the meaning
		is obvious. For a Video Capture, "Language" could, for example, be
		sign interpretation or text.</t>
		<t>
		The Language attribute is coded per <xref target="RFC5646" format="default"/>
		.</t>
		</section>
		<section anchor="s-7.1.1.10" numbered="true" toc="default">
		<name>Person Information</name>
		<t>
		The Person Information attribute allows a Provider to provide
		specific information regarding the people in a Capture (regardless
		of whether or not the Capture has a Presentation attribute). The
		Provider may gather the information automatically or manually from
		a variety of sources; however, the xCard <xref target="RFC6351" format="defau
		lt"/> format is used to
		convey the information. This allows various information, such as
		Identification information (<xref section="6.2" sectionFormat="of" target="RF
		C6350" format="default"/>), Communication
		Information (<xref section="6.4" sectionFormat="of" target="RFC6350" format="
		default"/>), and Organizational information
		(<xref section="6.6" sectionFormat="of" target="RFC6350" format="default"/>),
		to be communicated. A Consumer may then
		automatically (i.e., via a policy) or manually select Captures
		based on information about who is in a Capture. It also allows a
		Consumer to Render information regarding the people participating
		in the Conference or to use it for further processing.</t>
		<t>
		The Provider may supply a minimal set of information or a larger
		set of information. However, it <bcp14>MUST</bcp14> be compliant to <xref tar
		get="RFC6350" format="default"/> and
		supply a "VERSION" and "FN" property. A Provider may supply
		multiple xCards per Capture of any KIND (<xref section="6.1.4" sectionFormat=
		"of" target="RFC6350" format="default"/>).</t>
		<t>
		In order to keep CLUE messages compact, the Provider <bcp14>SHOULD</bcp14> us
		e a
		URI to point to any LOGO, PHOTO, or SOUND contained in the xCard
		rather than transmitting the LOGO, PHOTO, or SOUND data in a CLUE
		message.</t>
		</section>
		<section anchor="s-7.1.1.11" numbered="true" toc="default">
		<name>Person Type</name>
		<t>
		The Person Type attribute indicates the type of people contained in
		the Capture with respect to the meeting agenda (regardless of
		whether or not the Capture has a Presentation attribute). As a
		Capture may include multiple people, the attribute may contain
		multiple values. However, values <bcp14>MUST NOT</bcp14> be repeated within t
		he
		attribute.</t>
		<t>
		An Advertiser associates the person type with an individual Capture
		when it knows that a particular type is in the Capture. If an
		Advertiser cannot link a particular type with some certainty to a
		Capture, then it is not included. On reception of a
		Capture with a Person Type attribute, a Consumer knows with some certainty th
		at
		the Capture contains that person type. The Capture may contain
		other person types, but the Advertiser has not been able to
		determine that this is the case.</t>
		<t>The types of Captured people include:
		</t>
		<dl newline="false" spacing="normal" indent="15">
		<dt>Chair:</dt>
		<dd>the person responsible for running the meeting
		according to the agenda.</dd>
		<dt>Vice-Chair:</dt>
		<dd>the person responsible for assisting the chair in
		running the meeting.</dd>
		<dt>Minute Taker:</dt>
		<dd>the person responsible for recording the
		minutes of the meeting.</dd>
		<dt>Attendee:</dt>
		<dd>the person has no particular responsibilities with
		respect to running the meeting.</dd>
		<dt>Observer:</dt>
		<dd>an Attendee without the right to influence the
		discussion.</dd>
		<dt>Presenter:</dt>
		<dd>the person scheduled on the agenda to make a
		presentation in the meeting. Note: This is not related to any
		"active speaker" functionality.</dd>
		<dt>Translator:</dt>
		<dd>the person providing some form of translation
		or commentary in the meeting.</dd>
		<dt>Timekeeper:</dt>
		<dd>the person responsible for maintaining the
		meeting schedule.</dd>
		</dl>
		<t>
		Furthermore, the Person Type attribute may contain one or more
		strings allowing the Provider to indicate custom meeting-specific
		types.</t>
		</section>
		<section anchor="s-7.1.1.12" numbered="true" toc="default">
		<name>Priority</name>
		<t>
		The Priority attribute indicates a relative priority between
		different Media Captures. The Provider sets this priority, and the
		Consumer <bcp14>MAY</bcp14> use the priority to help decide which Captures it
		wishes to receive.</t>
		<t>
		The Priority attribute is an integer that indicates a relative
		priority between Captures. For example, it is possible to assign a
		priority between two presentation Captures that would allow a
		remote Endpoint to determine which presentation is more important.
		Priority is assigned at the individual Capture level. It represents
		the Provider's view of the relative priority between Captures with
		a priority. The same priority number <bcp14>MAY</bcp14> be used across multip
		le
		Captures. It indicates that they are equally important. If no priority
		is assigned, no assumptions regarding relative importance of the
		Capture can be assumed.</t>
		</section>
		<section anchor="s-7.1.1.13" numbered="true" toc="default">
		<name>Embedded Text</name>
		<t>
		The Embedded Text attribute indicates that a Capture provides
		embedded textual information. For example, the Video Capture may
		contain speech-to-text information composed with the video image.</t>
		</section>
		<section anchor="s-7.1.1.14" numbered="true" toc="default">
		<name>Related To</name>
		<t>
		The Related To attribute indicates the Capture contains additional
		complementary information related to another Capture. The value
		indicates the identity of the other Capture to which this Capture
		is providing additional information.</t>
		<t>
		For example, a Conference can utilize translators or facilitators
		that provide an additional audio Stream (i.e., a translation or
		description or commentary of the Conference). Where multiple
		Captures are available, it may be advantageous for a Consumer to
		select a complementary Capture instead of or in addition to a
		Capture it relates to.</t>
		</section>
		</section>
		</section>
		<section anchor="s-7.2" numbered="true" toc="default">
		<name>Multiple Content Capture</name>
		<t>
		The MCC indicates that one or more Single Media Captures are
		multiplexed (temporally and/or spatially) or mixed in one Media
		Capture. Only one Capture type (i.e., audio, video, etc.) is
		allowed in each MCC instance. The MCC may contain a reference to
		the Single Media Captures (which may have their own attributes) as
		well as attributes associated with the MCC itself. An MCC may also
		contain other MCCs. The MCC <bcp14>MAY</bcp14> reference Captures from withi
		n the
		Capture Scene that defines it or from other Capture Scenes. No
		ordering is implied by the order that Captures appear within an MCC.
		An MCC <bcp14>MAY</bcp14> contain no references to other Captures to indicate
		that
		the MCC contains content from multiple sources, but no information
		regarding those sources is given. MCCs either contain the
		referenced Captures and no others or have no referenced Captures
		and, therefore, may contain any Capture.</t>
		<t>
		One or more MCCs may also be specified in a CSV. This allows an
		Advertiser to indicate that several MCC Captures are used to
		represent a Capture Scene. <xref target="ref-advertisement-sent-to-endpoint-
		f-two-encodings" format="default"/> provides an example of this
		case.</t>
		<t>
		As outlined in <xref target="s-7.1" format="default"/>, each instance of the
		MCC has its own
		Capture identity, i.e., MCC1. It allows all the individual Captures
		contained in the MCC to be referenced by a single MCC identity.</t>
		<t>The example below shows the use of a Multiple Content Capture:</t>
		<table anchor="ref-multiple-content-capture-concept" align="center">
		<name>Multiple Content Capture Concept</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> </th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">{MC attributes}</td>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left">{MC attributes}</td>
		</tr>
		<tr>
		<td align="left">VC3</td>
		<td align="left">{MC attributes}</td>
		</tr>
		<tr>
		<td align="left">MCC1(VC1,VC2,VC3)</td>
		<td align="left">{MC and MCC attributes}</td>
		</tr>
		<tr>
		<td align="left">CSV(MCC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t>
		This indicates that MCC1 is a single Capture that contains the
		Captures VC1, VC2, and VC3, according to any MCC1 attributes.</t>
		<section anchor="s-7.2.1" numbered="true" toc="default">
		<name>MCC Attributes</name>
		<t>
		Media Capture attributes may be associated with the MCC instance
		and the Single Media Captures that the MCC references. A Provider
		should avoid providing conflicting attribute values between the MCC
		and Single Media Captures. Where there is conflict the attributes
		of the MCC, a Provider should override any that may be present in the individ
		ual
		Captures.</t>
		<t>
		A Provider <bcp14>MAY</bcp14> include as much or as little of the original so
		urce
		Capture information as it requires.</t>
		<t>
		There are MCC-specific attributes that <bcp14>MUST</bcp14> only be used with
		Multiple Content Captures. These are described in the sections
		below. The attributes described in <xref target="s-7.1.1" format="default"/>
		<bcp14>MAY</bcp14> also be used
		with MCCs.</t>
		<t>
		The spatial-related attributes of an MCC indicate its Area of
		Capture and Point of Capture within the Scene, just like any other
		Media Capture. The spatial information does not imply anything
		about how other Captures are composed within an MCC.</t>
		<t>For example: a virtual Scene could be constructed for the MCC
		Capture with two Video Captures with a MaxCaptures attribute set
		to 2 and an Area of Capture attribute provided with an overall
		area. Each of the individual Captures could then also include an
		Area of Capture attribute with a subset of the overall area.
		The Consumer would then know how each Capture is related to others
		within the Scene, but not the relative position of the individual
		Captures within the composed Capture.

		</t>

		<table anchor="table_2">
		<name>Example of MCC and Single Media Capture Attributes</name>
		<thead>
		<tr><th align="left">Capture Scene #1</th><th/></tr>
		</thead>
		<tbody>
		<tr>
		<td>VC1</td>
		<td align="right">
		<ul empty="true" spacing="compact">
		<li>AreaofCapture=(0,0,0)(9,0,0)</li>
		<li>(0,0,9)(9,0,9)</li>
		</ul>
		</td>
		</tr>
		<tr>
		<td>VC2</td>
		<td align="right">
		<ul empty="true" spacing="compact">
		<li>AreaofCapture=(10,0,0)(19,0,0)</li>
		<li>(10,0,9)(19,0,9)</li>
		</ul>
		</td>
		</tr>
		<tr>
		<td>MCC1(VC1,VC2)</td>
		<td align="right">
		<ul empty="true" spacing="compact">
		<li>MaxCaptures=2</li>
		<li>AreaofCapture=(0,0,0)(19,0,0)</li>
		<li>(0,0,9)(19,0,9)</li>
		</ul>
		</td>
		</tr>
		<tr>
		<td>CSV(MCC1)</td>
		<td/>
		</tr>
		</tbody>
		</table>

		<t>
		The subsections below describe the MCC-only attributes.</t>
		<section anchor="s-7.2.1.1" numbered="true" toc="default">
		<name>MaxCapture: Maximum Number of Captures within an MCC</name>
		<t>
		The MaxCaptures attribute indicates the maximum
		number of individual Captures that may appear in a Capture Encoding
		at a time. The actual number at any given time can be less than or
		equal to this maximum. It may be used to derive how the Single
		Media Captures within the MCC are composed/switched with regard
		to space and time.</t>

		<t>
		A Provider can indicate that the number of Captures in an MCC
		Capture Encoding is equal ("=") to the MaxCaptures value or that
		there may be any number of Captures up to and including ("<=") the
		MaxCaptures value. This allows a Provider to distinguish between an
		MCC that purely represents a composition of sources and an MCC
		that represents switched sources or switched and composed sources.</t>
		<t>
		MaxCaptures may be set to one so that only content related to one
		of the sources is shown in the MCC Capture Encoding at a time, or
		it may be set to any value up to the total number of Source Media
		Captures in the MCC.</t>
		<t>
		The bullets below describe how the setting of MaxCaptures versus the
		number of Captures in the MCC affects how sources appear in a
		Capture Encoding:</t>
		<ul spacing="normal">
		<li>A switched case occurs when
		MaxCaptures is set to <= 1 and the number of Captures in
		the MCC is greater than 1 (or not specified) in the MCC. Zero
		or one Captures may be switched into the Capture Encoding. Note:
		zero is allowed because of the "<=".</li>
		<li>A switched case occurs when MaxCaptures is set to = 1 and
		the number of Captures in the MCC is greater than 1 (or not
		specified) in the MCC. Only one Capture source is contained in
		a Capture Encoding at a time.</li>
		<li>A switched and composed case occurs when MaxCaptures is set
		to <= N (with N > 1) and the number of Captures in the
		MCC is greater than N (or not specified). The Capture Encoding
		may contain purely switched sources (i.e., <=2 allows for one
		source on its own), or it may contain composed and switched
		sources (i.e., a composition of two sources switched between the
		sources).</li>
		<li>A switched and composed case occurs when MaxCaptures is set
		to = N (with N > 1) and the number of Captures in the MCC
		is greater than N (or not specified). The Capture Encoding
		contains composed and switched sources (i.e., a composition of
		N sources switched between the sources). It is not possible to
		have a single source.</li>
		<li>A switched and composed case occurs when MaxCaptures is set
		<= to the number of Captures in the MCC. The Capture
		Encoding may contain Media switched between any number (up to
		the MaxCaptures) of composed sources.</li>
		<li>A composed case occurs when MaxCaptures is set = to the number
		of Captures in the
		MCC. All the sources are composed into
		a single Capture Encoding.</li>
		</ul>
		<t>
		If this attribute is not set, then as a default, it is assumed that all
		source Media Capture content can appear concurrently in the Capture
		Encoding associated with the MCC.</t>
		<t>
		For example, the use of MaxCaptures equal to 1 on an MCC with three
		Video Captures, VC1, VC2, and VC3, would indicate that the Advertiser
		in the Capture Encoding would switch between VC1, VC2, and VC3 as
		there may be only a maximum of one Capture at a time.</t>
		</section>
		<section anchor="s-7.2.1.2" numbered="true" toc="default">
		<name>Policy</name>
		<t>
		The Policy MCC attribute indicates the criteria that the Provider
		uses to determine when and/or where Media content appears in the
		Capture Encoding related to the MCC.</t>
		<t>
		The attribute is in the form of a token that indicates the policy
		and an index representing an instance of the policy. The same
		index value can be used for multiple MCCs.</t>
		<t>
		The tokens are as follows:
		</t>
		<dl newline="false" spacing="normal">
		<dt>SoundLevel:</dt>
		<dd>This indicates that the content of the MCC is
		determined by a sound-level-detection algorithm. The loudest
		(active) speaker (or a previous speaker, depending on the index
		value) is contained in the MCC.</dd>
		<dt>RoundRobin:</dt>
		<dd>This indicates that the content of the MCC is
		determined by a time-based algorithm. For example, the Provider
		provides content from a particular source for a period of time and
		then provides content from another source, and so on.</dd>
		</dl>
		<t>
		An index is used to represent an instance in the policy setting. An
		index of 0 represents the most current instance of the policy, i.e.,
		the active speaker, 1 represents the previous instance, i.e., the
		previous active speaker, and so on.</t>
		<t>
		The following example shows a case where the Provider provides two
		Media Streams, one showing the active speaker and a second Stream
		showing the previous speaker.</t>
		<table anchor="ref-example-policy-mcc-attribute-usage" align="center
		">
		<name>Example Policy MCC Attribute Usage</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> </th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">MCC1(VC1,VC2)</td>
		<td align="left">Policy=SoundLevel:0<br/>
		MaxCaptures=1</td>
		</tr>
		<tr>
		<td align="left">MCC2(VC1,VC2)</td>
		<td align="left">Policy=SoundLevel:1<br/>
		MaxCaptures=1</td>
		</tr>
		<tr>
		<td align="left">CSV(MCC1,MCC2)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		</section>

		<section anchor="s-7.2.1.3" numbered="true" toc="default">
		<name>SynchronizationID: Synchronization Identity</name>
		<t>
		The SynchronizationID MCC attribute indicates how the
		individual Captures in multiple MCC Captures are synchronized. To
		indicate that the Capture Encodings associated with MCCs contain
		Captures from the same source at the same time, a Provider should
		set the same SynchronizationID on each of the concerned
		MCCs. It is the Provider that determines what the source for the
		Captures is, so a Provider can choose how to group together Single
		Media Captures into a combined "source" for the purpose of
		switching them together to keep them synchronized according to the
		SynchronizationID attribute. For example, when the Provider is in
		an MCU, it may determine that each separate CLUE Endpoint is a
		remote source of Media. The SynchronizationID may be used
		across Media types, i.e., to synchronize audio- and video-related
		MCCs.</t>
		<t>
		Without this attribute it is assumed that multiple MCCs may provide
		content from different sources at any particular point in time.</t>
		<t>For example:
		</t>

		<table anchor="table_4">
		<name>Example SynchronizationID MCC Attribute Usage</name>

		<tbody>
		<tr><th>Capture Scene #1</th> <th/></tr>
		<tr><td>VC1</td> <td>Description=Left</td></tr>
		<tr><td>VC2</td> <td>Description=Center</td></t
		r>
		<tr><td>VC3</td> <td>Description=Right</td></tr
		>
		<tr><td>AC1</td> <td>Description=Room</td></tr>
		<tr><td>CSV(VC1,VC2,VC3)</td> <td/></tr>
		<tr><td>CSV(AC1)</td> <td/></tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #2</th> <th/></tr>

		<tr><td>VC4</td> <td>Description=Left</td></tr>
		<tr><td>VC5</td> <td>Description=Center</td></t
		r>
		<tr><td>VC6</td> <td>Description=Right</td></tr
		>
		<tr><td>AC2</td> <td>Description=Room</td></tr>
		<tr><td>CSV(VC4,VC5,VC6)</td> <td/></tr>
		<tr><td>CSV(AC2)</td> <td/></tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #3</th> <th/></tr>

		<tr><td>VC7</td> <td/></tr>
		<tr><td>AC3</td> <td/></tr>

		</tbody>

		<tbody>
		<tr><th>Capture Scene #4</th> <th/></tr>

		<tr><td>VC8</td> <td/></tr>
		<tr><td>AC4</td> <td/></tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #5</th> <th/></tr>

		<tr><td>MCC1(VC1,VC4,VC7)</td> <td>SynchronizationID=1<br/>M
		axCaptures=1</td></tr>
		<tr><td>MCC2(VC2,VC5,VC8)</td> <td>SynchronizationID=1<br/>M
		axCaptures=1</td></tr>
		<tr><td>MCC3(VC3,VC6)</td> <td>MaxCaptures=1</td></tr>
		<tr><td>MCC4(AC1,AC2,AC3,AC4)</td> <td>SynchronizationID=1<br/>M
		axCaptures=1</td></tr>
		<tr><td>CSV(MCC1,MCC2,MCC3)</td> <td/></tr>
		<tr><td>CSV(MCC4)</td> <td/></tr>
		</tbody>
		</table>

		<t>
		The above Advertisement would indicate that MCC1, MCC2, MCC3, and
		MCC4 make up a Capture Scene. There would be four Capture
		Encodings (one for each MCC). Because MCC1 and MCC2 have the same
		SynchronizationID, each Encoding from MCC1 and MCC2, respectively,
		would together have content from only Capture Scene 1 or only
		Capture Scene 2 or the combination of VC7 and VC8 at a particular
		point in time. In this case, the Provider has decided the sources
		to be synchronized are Scene #1, Scene #2, and Scene #3 and #4
		together. The Encoding from MCC3 would not be synchronized with
		MCC1 or MCC2. As MCC4 also has the same SynchronizationID
		as MCC1 and MCC2, the content of the audio Encoding will be
		synchronized with the video content.</t>
		</section>
		<section anchor="s-7.2.1.4" numbered="true" toc="default">
		<name>Allow Subset Choice</name>
		<t>
		The Allow Subset Choice MCC attribute is a boolean value,
		indicating whether or not the Provider allows the Consumer to
		choose a specific subset of the Captures referenced by the MCC.
		If this attribute is true, and the MCC references other Captures,
		then the Consumer <bcp14>MAY</bcp14> select (in a Configure message) a specif
		ic
		subset of those Captures to be included in the MCC, and the
		Provider <bcp14>MUST</bcp14> then include only that subset. If this attribut
		e is
		false, or the MCC does not reference other Captures, then the
		Consumer <bcp14>MUST NOT</bcp14> select a subset.</t>
		</section>
		</section>
		</section>
		<section anchor="s-7.3" numbered="true" toc="default">
		<name>Capture Scene</name>
		<t>
		In order for a Provider's individual Captures to be used
		effectively by a Consumer, the Provider organizes the Captures into
		one or more Capture Scenes, with the structure and contents of
		these Capture Scenes being sent from the Provider to the Consumer
		in the Advertisement.</t>
		<t>
		A Capture Scene is a structure representing a spatial region
		containing one or more Capture Devices, each capturing Media
		representing a portion of the region. A Capture Scene includes one
		or more Capture Scene Views (CSVs), with each CSV including one or
		more Media Captures of the same Media type. There can also be
		Media Captures that are not included in a CSV. A
		Capture Scene represents, for example, the video image of a group
		of people seated next to each other, along with the sound of their
		voices, which could be represented by some number of VCs and ACs in
		the CSVs. An MCU can also describe in Capture
		Scenes what it constructs from Media Streams it receives.</t>
		<t>
		A Provider <bcp14>MAY</bcp14> advertise one or more Capture Scenes. What
		constitutes an entire Capture Scene is up to the Provider. A
		simple Provider might typically use one Capture Scene for
		participant Media (live video from the room cameras) and another
		Capture Scene for a computer-generated presentation. In more-complex systems
		, the use of additional Capture Scenes is also
		sensible. For example, a classroom may advertise two Capture
		Scenes involving live video: one including only the camera
		capturing the instructor (and associated audio) the other
		including camera(s) capturing students (and associated audio).</t>
		<t>
		A Capture Scene <bcp14>MAY</bcp14> (and typically will) include more than one
		type
		of Media. For example, a Capture Scene can include several CSVs
		for Video Captures and several CSVs for
		Audio Captures. A particular Capture <bcp14>MAY</bcp14> be included in more
		than
		one CSV.</t>
		<t>
		A Provider <bcp14>MAY</bcp14> express Spatial Relationships between Captures
		that
		are included in the same Capture Scene. However, there is no
		Spatial Relationship between Media Captures from different Capture
		Scenes. In other words, Capture Scenes each use their own spatial
		measurement system as outlined in <xref target="s-6" format="default"/>.</t>
		<t>
		A Provider arranges Captures in a Capture Scene to help the
		Consumer choose which Captures it wants to Render. The CSVs
		in a Capture Scene are different alternatives the
		Provider is suggesting for representing the Capture Scene. Each
		CSV is given an advertisement-unique identity. The
		order of CSVs within a Capture Scene has no
		significance. The Media Consumer can choose to receive all Media
		Captures from one CSV for each Media type (e.g.,
		audio and video), or it can pick and choose Media Captures
		regardless of how the Provider arranges them in CSVs.
		Different CSVs of the same Media type are
		not necessarily mutually exclusive alternatives. Also note that
		the presence of multiple CSVs (with potentially
		multiple Encoding options in each view) in a given Capture Scene
		does not necessarily imply that a Provider is able to serve all the
		associated Media simultaneously (although the construction of such
		an over-rich Capture Scene is probably not sensible in many cases).
		What a Provider can send simultaneously is determined through the
		Simultaneous Transmission Set mechanism, described in <xref target="s-8" form
		at="default"/>.</t>
		<t>
		Captures within the same CSV <bcp14>MUST</bcp14> be of the same
		Media type -- it is not possible to mix audio and Video Captures in
		the same CSV, for instance. The Provider <bcp14>MUST</bcp14> be
		capable of encoding and sending all Captures (that have an Encoding
		Group) in a single CSV simultaneously. The order of
		Captures within a CSV has no significance. A
		Consumer can decide to receive all the Captures in a single CSV,
		but a Consumer could also decide to receive just a
		subset of those Captures. A Consumer can also decide to receive
		Captures from different CSVs, all subject to the
		constraints set by Simultaneous Transmission Sets, as discussed in
		<xref target="s-8" format="default"/>.</t>
		<t>
		When a Provider advertises a Capture Scene with multiple CSVs, it
		is essentially signaling that there are multiple representations of
		the same Capture Scene available. In some cases, these multiple
		views would be used simultaneously (for instance, a "video view" and
		an "audio view"). In some cases, the views would conceptually be
		alternatives (for instance, a view consisting of three Video
		Captures covering the whole room versus a view consisting of just a
		single Video Capture covering only the center of a room). In this
		latter example, one sensible choice for a Consumer would be to
		indicate (through its Configure and possibly through an additional
		offer/answer exchange) the Captures of that CSV that
		most closely matched the Consumer's number of display devices or
		screen layout.</t>
		<t>
		The following is an example of four potential CSVs for
		an Endpoint-style Provider:</t>
		<ol spacing="normal" type="1">
		<li>(VC0, VC1, VC2) - left, center, and right camera Video Captures</l
		i>
		<li>(MCC3) - Video Capture associated with loudest room segment</li>
		<li>(VC4) - Video Capture zoomed out view of all people in the room</l
		i>
		<li>(AC0) - main audio</li>
		</ol>
		<t>
		The first view in this Capture Scene example is a list of Video
		Captures that have a Spatial Relationship to each other.
		Determination of the order of these Captures (VC0, VC1, and VC2) for
		rendering purposes is accomplished through use of their Area of
		Capture attributes. The second view (MCC3) and the third view
		(VC4) are alternative representations of the same room's video,
		which might be better suited to some Consumers' rendering
		capabilities. The inclusion of the Audio Capture in the same
		Capture Scene indicates that AC0 is associated with all of those
		Video Captures, meaning it comes from the same spatial region.
		Therefore, if audio were to be Rendered at all, this audio would be
		the correct choice, irrespective of which Video Captures were
		chosen.</t>
		<section anchor="s-7.3.1" numbered="true" toc="default">
		<name>Capture Scene Attributes</name>
		<t>
		Capture Scene attributes can be applied to Capture Scenes as well
		as to individual Media Captures. Attributes specified at this
		level apply to all constituent Captures. Capture Scene attributes
		include the following:</t>
		<ul spacing="normal">
		<li>Human-readable description of the Capture Scene, which could
		be in multiple languages;</li>
		<li>xCard Scene information</li>
		<li>Scale information ("Millimeters", "Unknown Scale", "No Scale"),
		as
		described in <xref target="s-6" format="default"/>.</li>
		</ul>
		<section anchor="s-7.3.1.1" numbered="true" toc="default">
		<name>Scene Information</name>
		<t>
		The Scene Information attribute provides information regarding the
		Capture Scene rather than individual participants. The Provider
		may gather the information automatically or manually from a
		variety of sources. The Scene Information attribute allows a
		Provider to indicate information such as organizational or
		geographic information allowing a Consumer to determine which
		Capture Scenes are of interest in order to then perform Capture
		selection. It also allows a Consumer to Render information
		regarding the Scene or to use it for further processing.</t>
		<t>
		As per <xref target="s-7.1.1.10" format="default"/>, the xCard format is used
		to convey this
		information and the Provider may supply a minimal set of
		information or a larger set of information.</t>
		<t>
		In order to keep CLUE messages compact the Provider <bcp14>SHOULD</bcp14> use
		a
		URI to point to any LOGO, PHOTO, or SOUND contained in the xCard
		rather than transmitting the LOGO, PHOTO, or SOUND data in a CLUE
		message.</t>
		</section>
		</section>
		<section anchor="s-7.3.2" numbered="true" toc="default">
		<name>Capture Scene View Attributes</name>
		<t>
		A Capture Scene can include one or more CSVs in
		addition to the Capture-Scene-wide attributes described above.
		CSV attributes apply to the CSV as a
		whole, i.e., to all Captures that are part of the CSV.
		</t>
		<t>CSV attributes include the following:
		</t>
		<ul spacing="normal">
		<li>A human-readable description (which could be in multiple
		languages) of the CSV.</li>
		</ul>
		</section>
		</section>
		<section anchor="s-7.4" numbered="true" toc="default">
		<name>Global View List</name>
		<t>
		An Advertisement can include an optional Global View list. Each
		item in this list is a Global View. The Provider can include
		multiple Global Views, to allow a Consumer to choose sets of
		Captures appropriate to its capabilities or application. The
		choice of how to make these suggestions in the Global View list
		for what represents all the Scenes for which the Provider can send
		Media is up to the Provider. This is very similar to how each CSV
		represents a particular Scene.</t>
		<t>
		As an example, suppose an Advertisement has three Scenes, and each
		Scene has three CSVs, ranging from one to three Video Captures in
		each CSV. The Provider is advertising a total of nine Video
		Captures across three Scenes. The Provider can use the Global
		View list to suggest alternatives for Consumers that can't receive
		all nine Video Captures as separate Media Streams. For
		accommodating a Consumer that wants to receive three Video
		Captures, a Provider might suggest a Global View containing just a
		single CSV with three Captures and nothing from the other two
		Scenes. Or a Provider might suggest a Global View containing
		three different CSVs, one from each Scene, with a single Video
		Capture in each.</t>
		<t>Some additional rules:
		</t>
		<ul spacing="normal">
		<li>The ordering of Global Views in the Global View list is
		insignificant.</li>
		<li>The ordering of CSVs within each Global View is
		insignificant.</li>
		<li>A particular CSV may be used in multiple Global Views.</li>
		<li>The Provider must be capable of encoding and sending all
		Captures within the CSVs of a given Global View
		simultaneously.</li>
		</ul>
		<t>
		The following figure shows an example of the structure of Global
		Views in a Global View List.</t>
		<figure anchor="ref-global-view-list-structure">
		<name>Global View List Structure</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		........................................................
		. Advertisement .
		. .
		. +--------------+ +-------------------------+ .
		. \|Scene 1 \| \|Global View List \| .
		. \| \| \| \| .
		. \| CSV1 (v)<----------------- Global View (CSV 1) \| .
		. \| <-------. \| \| .
		. \| \| *--------- Global View (CSV 1,5) \| .
		. \| CSV2 (v) \| \| \| \| .
		. \| \| \| \| \| .
		. \| CSV3 (v)<---------*------- Global View (CSV 3,5) \| .
		. \| \| \| \| \| \| .
		. \| CSV4 (a)<----------------- Global View (CSV 4) \| .
		. \| <-----------. \| \| .
		. +--------------+ \| \| *----- Global View (CSV 4,6) \| .
		. \| \| \| \| \| .
		. +--------------+ \| \| \| +-------------------------+ .
		. \|Scene 2 \| \| \| \| .
		. \| \| \| \| \| .
		. \| CSV5 (v)<-------' \| \| .
		. \| <---------' \| .
		. \| \| \| (v) = video .
		. \| CSV6 (a)<-----------' (a) = audio .
		. \| \| .
		. +--------------+ .
		`......................................................'
		]]></artwork>
		</figure>
		</section>
		</section>
		<section anchor="s-8" numbered="true" toc="default">
		<name>Simultaneous Transmission Set Constraints</name>
		<t>
		In many practical cases, a Provider has constraints or limitations
		on its ability to send Captures simultaneously. One type of
		limitation is caused by the physical limitations of capture
		mechanisms; these constraints are represented by a Simultaneous
		Transmission Set. The second type of limitation reflects the
		encoding resources available, such as bandwidth or video encoding
		throughput (macroblocks/second). This type of constraint is
		captured by Individual Encodings and Encoding Groups, discussed
		below.</t>
		<t>
		Some Endpoints or MCUs can send multiple Captures simultaneously;
		however, sometimes there are constraints that limit which Captures
		can be sent simultaneously with other Captures. A device may not
		be able to be used in different ways at the same time. Provider
		Advertisements are made so that the Consumer can choose one of
		several possible mutually exclusive usages of the device. This
		type of constraint is expressed in a Simultaneous Transmission Set,
		which lists all the Captures of a particular Media type (e.g.,
		audio, video, or text) that can be sent at the same time. There are
		different Simultaneous Transmission Sets for each Media type in the
		Advertisement. This is easier to show in an example.</t>
		<t>
		Consider the example of a room system where there are three cameras,
		each of which can send a separate Capture covering two people
		each: VC0, VC1, and VC2. The middle camera can also zoom out (using an
		optical zoom lens) and show all six people, VC3. But the middle
		camera cannot be used in both modes at the same time; it has to
		either show the space where two participants sit or the whole six
		seats, but not both at the same time. As a result, VC1 and VC3
		cannot be sent simultaneously.</t>
		<t>
		Simultaneous Transmission Sets are expressed as sets of the Media
		Captures that the Provider could transmit at the same time (though,
		in some cases, it is not intuitive to do so). If a Multiple
		Content Capture is included in a Simultaneous Transmission Set, it
		indicates that the Capture Encoding associated with it could be
		transmitted as the same time as the other Captures within the
		Simultaneous Transmission Set. It does not imply that the Single
		Media Captures contained in the Multiple Content Capture could all
		be transmitted at the same time.</t>
		<t>
		In this example, the two Simultaneous Transmission Sets are shown in
		<xref target="ref-two-simultaneous-transmission-sets" format="default"/>. If
		a Provider advertises one or more mutually exclusive
		Simultaneous Transmission Sets, then, for each Media type, the
		Consumer <bcp14>MUST</bcp14> ensure that it chooses Media Captures that lie w
		holly
		within one of those Simultaneous Transmission Sets.</t>
		<table anchor="ref-two-simultaneous-transmission-sets" align="center">
		<name>Two Simultaneous Transmission Sets</name>
		<thead>
		<tr>
		<th align="left">Simultaneous Sets</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">{VC0, VC1, VC2}</td>
		</tr>
		<tr>
		<td align="left">{VC0, VC3, VC2}</td>
		</tr>
		</tbody>
		</table>
		<t>
		A Provider OPTIONALLY can include the Simultaneous Transmission
		Sets in its Advertisement. These constraints apply across all the
		Capture Scenes in the Advertisement. It is a syntax-conformance
		requirement that the Simultaneous Transmission Sets <bcp14>MUST</bcp14> allow
		all
		the Media Captures in any particular CSV to be used
		simultaneously. Similarly, the Simultaneous Transmission Sets <bcp14>MUST</b
		cp14>
		reflect the simultaneity expressed by any Global View.</t>
		<t>
		For shorthand convenience, a Provider <bcp14>MAY</bcp14> describe a Simultane
		ous
		Transmission Set in terms of CSVs and Capture
		Scenes. If a CSV is included in a Simultaneous
		Transmission Set, then all Media Captures in the CSV
		are included in the Simultaneous Transmission Set. If a Capture
		Scene is included in a Simultaneous Transmission Set, then all its
		CSVs (of the corresponding Media type) are included
		in the Simultaneous Transmission Set. The end result reduces to a
		set of Media Captures, of a particular Media type, in either case.</t>
		<t>
		If an Advertisement does not include Simultaneous Transmission
		Sets, then the Provider <bcp14>MUST</bcp14> be able to simultaneously provide
		all
		the Captures from any one CSV of each Media type from each Capture
		Scene. Likewise, if there are no Simultaneous Transmission Sets
		and there is a Global View list, then the Provider <bcp14>MUST</bcp14> be abl
		e to
		simultaneously provide all the Captures from any particular Global
		View (of each Media type) from the Global View list.</t>
		<t>
		If an Advertisement includes multiple CSVs in a
		Capture Scene, then the Consumer <bcp14>MAY</bcp14> choose one CSV
		for each Media type, or it <bcp14>MAY</bcp14> choose individual Captures base
		d on the
		Simultaneous Transmission Sets.</t>
		</section>
		<section anchor="s-9" numbered="true" toc="default">
		<name>Encodings</name>
		<t>
		Individual Encodings and Encoding Groups are CLUE's mechanisms
		allowing a Provider to signal its limitations for sending Captures,
		or combinations of Captures, to a Consumer. Consumers can map the
		Captures they want to receive onto the Encodings, with the Encoding
		parameters they want. As for the relationship between the CLUE-specified mec
		hanisms based on Encodings and the SIP offer/answer
		exchange, please refer to <xref target="s-5" format="default"/>.</t>
		<section anchor="s-9.1" numbered="true" toc="default">
		<name>Individual Encodings</name>
		<t>
		An Individual Encoding represents a way to encode a Media Capture
		as a Capture Encoding, to be sent as an encoded Media Stream from
		the Provider to the Consumer. An Individual Encoding has a set of
		parameters characterizing how the Media is encoded.</t>
		<t>
		Different Media types have different parameters, and different
		encoding algorithms may have different parameters. An Individual
		Encoding can be assigned to at most one Capture Encoding at any
		given time.</t>
		<t>
		Individual Encoding parameters are represented in SDP
		<xref target="RFC4566" format="default"/>,
		not in CLUE messages. For example, for a video Encoding using
		H.26x compression technologies, this can include parameters such
		as follows:
		</t>
		<ul spacing="compact">
		<li>Maximum bandwidth;</li>
		<li>Maximum picture size in pixels;</li>
		<li>Maximum number of pixels to be processed per second;</li>
		</ul>
		<t>
		The bandwidth parameter is the only one that specifically relates
		to a CLUE Advertisement, as it can be further constrained by the
		maximum group bandwidth in an Encoding Group.</t>
		</section>
		<section anchor="s-9.2" numbered="true" toc="default">
		<name>Encoding Group</name>
		<t>
		An Encoding Group includes a set of one or more Individual
		Encodings, and parameters that apply to the group as a whole. By
		grouping multiple Individual Encodings together, an Encoding Group
		describes additional constraints on bandwidth for the group. A
		single Encoding Group <bcp14>MAY</bcp14> refer to Encodings for different Med
		ia
		types.</t>
		<t>The Encoding Group data structure contains:

		</t>
		<ul spacing="normal">
		<li>Maximum bitrate for all Encodings in the group combined;</li>
		<li>A list of identifiers for the Individual Encodings belonging to th
		e group.</li>
		</ul>
		<t>
		When the Individual Encodings in a group are instantiated into
		Capture Encodings, each Capture Encoding has a bitrate that <bcp14>MUST</bcp1
		4> be
		less than or equal to the max bitrate for the particular Individual
		Encoding. The "maximum bitrate for all Encodings in the group"
		parameter gives the additional restriction that the sum of all the
		individual Capture Encoding bitrates <bcp14>MUST</bcp14> be less than or equa
		l to
		this group value.</t>
		<t>
		The following diagram illustrates one example of the structure of a
		Media Provider's Encoding Groups and their contents.</t>
		<figure anchor="ref-encoding-group-structure">
		<name>Encoding Group Structure</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		,-------------------------------------------------.
		\| Media Provider \|
		\| \|
		\| ,--------------------------------------. \|
		\| \| ,--------------------------------------. \|
		\| \| \| ,--------------------------------------. \|
		\| \| \| \| Encoding Group \| \|
		\| \| \| \| ,-----------. \| \|
		\| \| \| \| \| \| ,---------. \| \|
		\| \| \| \| \| \| \| \| ,---------.\| \|
		\| \| \| \| \| Encoding1 \| \|Encoding2\| \|Encoding3\|\| \|
		\| `.\| \| \| \| \| \| `---------'\| \|
		\| `.\| `-----------' `---------' \| \|
		\| `--------------------------------------' \|
		`-------------------------------------------------'
		]]></artwork>
		</figure>
		<t>A Provider advertises one or more Encoding Groups. Each Encoding
		Group includes one or more Individual Encodings. Each Individual
		Encoding can represent a different way of encoding Media. For
		example, one Individual Encoding may be 1080p60 video, another could
		be 720p30, with a third being 352x288p30, all in, for example, H.264
		format.</t>

		<t>While a typical three-codec/display system might have one Encoding
		Group per "codec box" (physical codec, connected to one camera and
		one screen), there are many possibilities for the number of
		Encoding Groups a Provider may be able to offer and for the
		Encoding values in each Encoding Group.</t>
		<t>
		There is no requirement for all Encodings within an Encoding Group
		to be instantiated at the same time.</t>
		</section>
		<section anchor="s-9.3" numbered="true" toc="default">
		<name>Associating Captures with Encoding Groups</name>
		<t>
		Each Media Capture, including MCCs, <bcp14>MAY</bcp14> be associated with one
		Encoding Group. To be eligible for configuration, a Media Capture
		<bcp14>MUST</bcp14> be associated with one Encoding Group, which is used to
		instantiate that Capture into a Capture Encoding. When an MCC is
		configured, all the Media Captures referenced by the MCC will appear
		in the Capture Encoding according to the attributes of the chosen
		Encoding of the MCC. This allows an Advertiser to specify Encoding
		attributes associated with the Media Captures without the need to
		provide an individual Capture Encoding for each of the inputs.</t>
		<t>
		If an Encoding Group is assigned to a Media Capture referenced by
		the MCC, it indicates that this Capture may also have an individual
		Capture Encoding.</t>
		<t>For example:
		</t>
		<table anchor="ref-example-usage-of-encoding-with-mcc-and-source-capture
		s" align="center">
		<name>Example Usage of Encoding with MCC and Source Captures</name>
		<thead>
		<tr>
		<th align="left">Capture Scene #1</th>
		<th align="left"> </th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">EncodeGroupID=1</td>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">MCC1(VC1,VC2)</td>
		<td align="left">EncodeGroupID=2</td>
		</tr>
		<tr>
		<td align="left">CSV(VC1)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV(MCC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t>
		This would indicate that VC1 may be sent as its own Capture
		Encoding from EncodeGroupID=1 or that it may be sent as part of a
		Capture Encoding from EncodeGroupID=2 along with VC2.</t>
		<t>
		More than one Capture <bcp14>MAY</bcp14> use the same Encoding Group.</t>
		<t>
		The maximum number of Capture Encodings that can result from a
		particular Encoding Group constraint is equal to the number of
		Individual Encodings in the group. The actual number of Capture
		Encodings used at any time <bcp14>MAY</bcp14> be less than this maximum. Any
		of
		the Captures that use a particular Encoding Group can be encoded
		according to any of the Individual Encodings in the group.</t>
		<t>
		It is a protocol conformance requirement that the Encoding Groups
		<bcp14>MUST</bcp14> allow all the Captures in a particular CSV to
		be used simultaneously.</t>
		</section>
		</section>
		<section anchor="s-10" numbered="true" toc="default">
		<name>Consumer's Choice of Streams to Receive from the Provider</name>
		<t>
		After receiving the Provider's Advertisement message (which includes
		Media Captures and associated constraints), the Consumer composes
		its reply to the Provider in the form of a Configure message. The
		Consumer is free to use the information in the Advertisement as it
		chooses, but there are a few obviously sensible design choices,
		which are outlined below.</t>
		<t>
		If multiple Providers connect to the same Consumer (i.e., in an
		MCU-less multiparty call), it is the responsibility of the Consumer
		to compose Configures for each Provider that both fulfill each
		Provider's constraints as expressed in the Advertisement, as well
		as its own capabilities.</t>
		<t>
		In an MCU-based multiparty call, the MCU can logically terminate
		the Advertisement/Configure negotiation in that it can hide the
		characteristics of the receiving Endpoint and rely on its own
		capabilities (transcoding/transrating/etc.) to create Media Streams
		that can be decoded at the Endpoint Consumers. The timing of an
		MCU's sending of Advertisements (for its outgoing ports) and
		Configures (for its incoming ports, in response to Advertisements
		received there) is up to the MCU and is implementation dependent.</t>
		<t>
		As a general outline, a Consumer can choose, based on the
		Advertisement it has received, which Captures it wishes to receive,
		and which Individual Encodings it wants the Provider to use to
		encode the Captures.</t>
		<t>
		On receipt of an Advertisement with an MCC, the Consumer treats the
		MCC as per other non-MCC Captures with the following differences:</t>
		<ul spacing="normal">
		<li>The Consumer would understand that the MCC is a Capture that
		includes the referenced individual Captures (or any Captures, if
		none are referenced) and that these individual Captures are
		delivered as part of the MCC's Capture Encoding.</li>
		<li>The Consumer may utilize any of the attributes associated with
		the referenced individual Captures and any Capture Scene attributes
		from where the individual Captures were defined to choose Captures
		and for Rendering decisions.</li>
		<li>If the MCC attribute Allow Subset Choice is true, then the
		Consumer may or may not choose to receive all the indicated
		Captures. It can choose to receive a subset of Captures indicated
		by the MCC.</li>
		</ul>
		<t>For example, if the Consumer receives:

		</t>
		<ul empty="true" spacing="normal">
		<li>MCC1(VC1,VC2,VC3){attributes}</li>
		</ul>
		<t>
		A Consumer could choose all the Captures within an MCC; however, if
		the Consumer determines that it doesn't want VC3, it can return
		MCC1(VC1,VC2). If it wants all the individual Captures, then it
		returns only the MCC identity (i.e., MCC1). If the MCC in the
		Advertisement does not reference any individual Captures, or the
		Allow Subset Choice attribute is false, then the Consumer cannot
		choose what is included in the MCC: it is up to the Provider to
		decide.</t>
		<t>
		A Configure Message includes a list of Capture Encodings. These
		are the Capture Encodings the Consumer wishes to receive from the
		Provider. Each Capture Encoding refers to one Media Capture and
		one Individual Encoding.</t>
		<t>
		For each Capture the Consumer wants to receive, it configures one
		of the Encodings in that Capture's Encoding Group. The Consumer
		does this by telling the Provider, in its Configure Message, which
		Encoding to use for each chosen Capture. Upon receipt of this
		Configure from the Consumer, common knowledge is established
		between Provider and Consumer regarding sensible choices for the
		Media Streams. The setup of the actual Media channels, at least in
		the simplest case, is left to a following offer/answer exchange.
		Optimized implementations may speed up the reaction to the
		offer/answer exchange by reserving the resources at the time of
		finalization of the CLUE handshake.</t>
		<t>
		CLUE Advertisements and Configure Messages don't necessarily
		require a new SDP offer/answer for every CLUE message
		exchange. But the resulting Encodings sent via RTP must conform to
		the most-recent SDP offer/answer result.</t>
		<t>
		In order to meaningfully create and send an initial Configure, the
		Consumer needs to have received at least one Advertisement, and an
		SDP offer defining the Individual Encodings, from the Provider.</t>
		<t>
		In addition, the Consumer can send a Configure at any time during
		the call. The Configure <bcp14>MUST</bcp14> be valid according to the most
		recently received Advertisement. The Consumer can send a Configure
		either in response to a new Advertisement from the Provider or on
		its own, for example, because of a local change in conditions
		(people leaving the room, connectivity changes, multipoint related
		considerations).</t>
		<t>
		When choosing which Media Streams to receive from the Provider, and
		the encoding characteristics of those Media Streams, the Consumer
		advantageously takes several things into account: its local
		preference, simultaneity restrictions, and encoding limits.</t>
		<section anchor="s-10.1" numbered="true" toc="default">
		<name>Local Preference</name>
		<t>
		A variety of local factors influence the Consumer's choice of
		Media Streams to be received from the Provider:</t>
		<ul spacing="normal">
		<li>If the Consumer is an Endpoint, it is likely that it would
		choose, where possible, to receive Video and Audio Captures that
		match the number of display devices and audio system it has.</li>
		<li>If the Consumer is an MCU, it may choose to receive loudest
		speaker Streams (in order to perform its own Media composition)
		and avoid pre-composed Video Captures.</li>
		<li>User choice (for instance, selection of a new layout) may result
		in a different set of Captures, or different Encoding
		characteristics, being required by the Consumer.</li>
		</ul>
		</section>
		<section anchor="s-10.2" numbered="true" toc="default">
		<name>Physical Simultaneity Restrictions</name>
		<t>
		Often there are physical simultaneity constraints of the Provider
		that affect the Provider's ability to simultaneously send all of
		the Captures the Consumer would wish to receive. For instance, an
		MCU, when connected to a multi-camera room system, might prefer to
		receive both individual video Streams of the people present in the
		room and an overall view of the room from a single camera. Some
		Endpoint systems might be able to provide both of these sets of
		Streams simultaneously, whereas others might not (if the overall
		room view were produced by changing the optical zoom level on the
		center camera, for instance).</t>
		</section>
		<section anchor="s-10.3" numbered="true" toc="default">
		<name>Encoding and Encoding Group Limits</name>
		<t>
		Each of the Provider's Encoding Groups has limits on bandwidth,
		and the constituent potential Encodings have limits on the
		bandwidth, computational complexity, video frame rate, and
		resolution that can be provided. When choosing the Captures to be
		received from a Provider, a Consumer device <bcp14>MUST</bcp14> ensure that t
		he
		Encoding characteristics requested for each individual Capture
		fits within the capability of the Encoding it is being configured
		to use, as well as ensuring that the combined Encoding
		characteristics for Captures fit within the capabilities of their
		associated Encoding Groups. In some cases, this could cause an
		otherwise "preferred" choice of Capture Encodings to be passed
		over in favor of different Capture Encodings -- for instance, if a
		set of three Captures could only be provided at a low resolution
		then a three screen device could switch to favoring a single,
		higher quality, Capture Encoding.</t>
		</section>
		</section>
		<section anchor="s-11" numbered="true" toc="default">
		<name>Extensibility</name>
		<t>
		One important characteristics of the Framework is its
		extensibility. The standard for interoperability and handling
		multiple Streams must be future-proof. The framework itself is
		inherently extensible through expanding the data model types. For
		example:</t>
		<ul spacing="normal">
		<li>Adding more types of Media, such as telemetry, can done by
		defining additional types of Captures in addition to audio and
		video.</li>
		<li>Adding new functionalities, such as 3-D Video Captures, may
		require additional attributes describing the Captures.</li>
		</ul>
		<t>
		The infrastructure is designed to be extended rather than
		requiring new infrastructure elements. Extension comes through
		adding to defined types.</t>
		</section>
		<section anchor="s-12" numbered="true" toc="default">
		<name>Examples - Using the Framework (Informative)</name>
		<t>
		This section gives some examples, first from the point of view of
		the Provider, then the Consumer, then some multipoint scenarios.</t>
		<section anchor="s-12.1" numbered="true" toc="default">
		<name>Provider Behavior</name>
		<t>
		This section shows some examples in more detail of how a Provider
		can use the framework to represent a typical case for telepresence
		rooms. First, an Endpoint is illustrated, then an MCU case is
		shown.</t>
		<section anchor="s-12.1.1" numbered="true" toc="default">
		<name>Three-Screen Endpoint Provider</name>
		<t>
		Consider an Endpoint with the following description:</t>
		<t>
		Three cameras, three displays, and a six-person table</t>
		<ul spacing="normal">
		<li>Each camera can provide one Capture for each 1/3-section of the
		table.</li>
		<li>A single Capture representing the active speaker can be provided
		(voice-activity-based camera selection to a given encoder input
		port implemented locally in the Endpoint).</li>
		<li>A single Capture representing the active speaker with the other
		two Captures shown picture in picture (PiP) within the Stream can
		be provided (again, implemented inside the Endpoint).</li>
		<li>A Capture showing a zoomed out view of all six seats in the room
		can be provided.</li>
		</ul>
		<t>
		The Video and Audio Captures for this Endpoint can be described as
		follows.</t>

		<t>
		Video Captures:
		</t>
		<dl newline="false" spacing="normal" indent="6">
		<dt>VC0</dt>
		<dd>(the left camera Stream), Encoding Group=EG0, view=table</dd>
		<dt>VC1</dt>
		<dd>(the center camera Stream), Encoding Group=EG1, view=table</dd>
		<dt>VC2</dt>
		<dd>(the right camera Stream), Encoding Group=EG2, view=table</dd>
		<dt>MCC3</dt>
		<dd>(the loudest panel Stream), Encoding Group=EG1, view=table, MaxC
		aptures=1, policy=SoundLevel</dd>
		<dt>MCC4</dt>
		<dd>(the loudest panel Stream with PiPs), Encoding Group=EG1, view=r
		oom, MaxCaptures=3, policy=SoundLevel</dd>
		<dt>VC5</dt>
		<dd>(the zoomed out view of all people in the room), Encoding Group=
		EG1, view=room</dd>
		<dt>VC6</dt>
		<dd>(presentation Stream), Encoding Group=EG1, presentation</dd>
		</dl>
		<t>
		The following diagram is a top view of the room with three cameras, three
		displays, and six seats. Each camera captures two people. The six
		seats are not all in a straight line.</t>
		<figure anchor="ref-room-layout-top-view">
		<name>Room Layout Top View</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		,-. d
		( )`--.__ +---+
		`-' / `--.__ \| \|
		,-. \| `-.._ \|_-+Camera 2 (VC2)
		( ).' <--(AC1)-+-''`+-+
		`-' \|_...---'' \| \|
		,-.c+-..__ +---+
		( )\| ``--..__ \| \|
		`-' \| ``+-..\|_-+Camera 1 (VC1)
		,-. \| <--(AC2)..--'\|+-+ ^
		( )\| __..--' \| \| \|
		`-'b\|..--' +---+ \|X
		,-. \|``---..___ \| \| \|
		( )\ ```--..._\|_-+Camera 0 (VC0) \|
		`-' \ <--(AC0) ..-''`-+ \|
		,-. \ __.--'' \| \| <----------+
		( ) \|..-'' +---+ Y
		`-' a (0,0,0) origin is under Camera 1
		]]></artwork>
		</figure>
		<t>
		The two points labeled 'b' and 'c' are intended to be at the midpoint
		between the seating positions, and where the fields of view of the
		cameras intersect.</t>
		<t>
		The Plane of Interest for VC0 is a vertical plane that intersects
		points 'a' and 'b'.</t>
		<t>
		The Plane of Interest for VC1 intersects points 'b' and 'c'. The
		plane of interest for VC2 intersects points 'c' and 'd'.</t>
		<t>
		This example uses an area scale of millimeters.</t>
		<t>Areas of capture:</t>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		bottom left bottom right top left top right
		VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
		VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
		VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757)
		MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
		MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
		VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
		VC6 none
		]]></artwork>

		<t>Points of capture:</t>
		<artwork name="" type="" align="left" alt="">
		VC0 (-1678,0,800)
		VC1 (0,0,800)
		VC2 (1678,0,800)
		MCC3 none
		MCC4 none
		VC5 (0,0,800)
		VC6 none
		</artwork>
		<t>
		In this example, the right edge of the VC0 area lines up with the
		left edge of the VC1 area. It doesn't have to be this way. There
		could be a gap or an overlap. One additional thing to note for
		this example is the distance from 'a' to 'b' is equal to the distance
		from 'b' to 'c' and the distance from 'c' to 'd'. All these distances are
		1346 mm. This is the planar width of each Area of Capture for VC0,
		VC1, and VC2.</t>
		<t>
		Note the text in parentheses (e.g., "the left camera Stream") is
		not explicitly part of the model, it is just explanatory text for
		this example, and it is not included in the model with the Media
		Captures and attributes. Also, MCC4 doesn't say anything about
		how a Capture is composed, so the Media Consumer can't tell based
		on this Capture that MCC4 is composed of a "loudest panel with PiPs".</t>
		<t>
		Audio Captures:</t>
		<t>
		Three ceiling microphones are located between the cameras and the
		table, at the same height as the cameras. The microphones point
		down at an angle toward the seating positions.</t>
		<ul spacing="normal">
		<li>AC0 (left), Encoding Group=EG3</li>
		<li>AC1 (right), Encoding Group=EG3</li>
		<li>AC2 (center), Encoding Group=EG3</li>
		<li>AC3 being a simple pre-mixed audio Stream from the room (mono),
		Encoding Group=EG3</li>
		<li>AC4 audio Stream associated with the presentation video (mono)
		Encoding Group=EG3, presentation</li>
		</ul>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		Point of Capture: Point on Line of Capture:
		AC0 (-1342,2000,800) (-1342,2925,379)
		AC1 ( 1342,2000,800) ( 1342,2925,379)
		AC2 ( 0,2000,800) ( 0,3000,379)
		AC3 ( 0,2000,800) ( 0,3000,379)
		AC4 none
		]]></artwork>
		<t>The physical simultaneity information is:
		</t>
		<ul empty="true" spacing="normal">
		<li>Simultaneous Transmission Set #1 {VC0, VC1, VC2, MCC3, MCC4,
		VC6}</li>
		<li>Simultaneous Transmission Set #2 {VC0, VC2, VC5, VC6}</li>
		</ul>
		<t>
		This constraint indicates that it is not possible to use all the VCs at
		the same time. VC5 cannot be used at the same time as VC1 or MCC3
		or MCC4. Also, using every member in the set simultaneously may
		not make sense -- for example, MCC3 (loudest) and MCC4 (loudest with
		PiP). In addition, there are Encoding constraints that make
		choosing all of the VCs in a set impossible. VC1, MCC3, MCC4,
		VC5, and VC6 all use EG1 and EG1 has only three ENCs. This constraint
		shows up in the Encoding Groups, not in the Simultaneous
		Transmission Sets.</t>
		<t>
		In this example, there are no restrictions on which Audio Captures
		can be sent simultaneously.</t>
		<t>
		Encoding Groups:</t>
		<t>
		This example has three Encoding Groups associated with the Video
		Captures. Each group can have three Encodings, but with each
		potential Encoding having a progressively lower specification. In
		this example, 1080p60 transmission is possible (as ENC0 has a
		maxPps value compatible with that). Significantly, as up to three
		Encodings are available per group, it is possible to transmit some
		Video Captures simultaneously that are not in the same view in the
		Capture Scene, for example, VC1 and MCC3 at the same time. The
		information below about Encodings is a summary of what would be
		conveyed in SDP, not directly in the CLUE Advertisement.</t>
		<figure anchor="ref-example-encoding-groups-for-video">
		<name>Example Encoding Groups for Video</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		encodeGroupID=EG0, maxGroupBandwidth=6000000
		encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
		maxPps=124416000, maxBandwidth=4000000
		encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
		maxPps=27648000, maxBandwidth=4000000
		encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
		maxPps=15552000, maxBandwidth=4000000
		encodeGroupID=EG1 maxGroupBandwidth=6000000
		encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
		maxPps=124416000, maxBandwidth=4000000
		encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
		maxPps=27648000, maxBandwidth=4000000
		encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
		maxPps=15552000, maxBandwidth=4000000
		encodeGroupID=EG2 maxGroupBandwidth=6000000
		encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
		maxPps=124416000, maxBandwidth=4000000
		encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
		maxPps=27648000, maxBandwidth=4000000
		encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
		maxPps=15552000, maxBandwidth=4000000
		]]></artwork>
		</figure>
		<t>
		For audio, there are five potential Encodings available, so all
		five Audio Captures can be encoded at the same time.</t>
		<figure anchor="ref-example-encoding-group-for-audio">
		<name>Example Encoding Group for Audio</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		encodeGroupID=EG3, maxGroupBandwidth=320000
		encodeID=ENC9, maxBandwidth=64000
		encodeID=ENC10, maxBandwidth=64000
		encodeID=ENC11, maxBandwidth=64000
		encodeID=ENC12, maxBandwidth=64000
		encodeID=ENC13, maxBandwidth=64000
		]]></artwork>
		</figure>
		<t>
		Capture Scenes:</t>
		<t>
		The following table represents the Capture Scenes for this
		Provider. Recall that a Capture Scene is composed of alternative
		CSVs covering the same spatial region. Capture
		Scene #1 is for the main people Captures, and Capture Scene #2 is
		for presentation.</t>
		<t>Each row in the table is a separate CSV.</t>

		<table align="center">
		<name>Example CSVs</name>
		<tbody>
		<tr>
		<th align="left"> Capture Scene #1</th>
		</tr>
		<tr>
		<td align="left">VC0, VC1, VC2</td>
		</tr>
		<tr>
		<td align="left">MCC3</td>
		</tr>
		<tr>
		<td align="left">MCC4</td>
		</tr>
		<tr>
		<td align="left">VC5</td>
		</tr>
		<tr>
		<td align="left">AC0, AC1, AC2</td>
		</tr>
		<tr>
		<td align="left">AC3</td>
		</tr>
		</tbody>

		<tbody>
		<tr>
		<th align="left"> Capture Scene #2</th>
		</tr>
		<tr>
		<td align="left">VC6</td>
		</tr>
		<tr>
		<td align="left">AC4</td>
		</tr>
		</tbody>
		</table>
		<t>
		Different Capture Scenes are distinct from each other and do not
		overlap. A Consumer can choose a view from each Capture Scene. In
		this case, the three Captures, VC0, VC1, and VC2, are one way of
		representing the video from the Endpoint. These three Captures
		should appear adjacent to each other. Alternatively, another
		way of representing the Capture Scene is with the Capture MCC3,
		which automatically shows the person who is talking; this is the same for
		the MCC4 and VC5 alternatives.</t>
		<t>
		As in the video case, the different views of audio in Capture
		Scene #1 represent the "same thing", in that one way to receive
		the audio is with the three Audio Captures (AC0, AC1, and AC2), and
		another way is with the mixed AC3. The Media Consumer can choose
		an audio CSV it is capable of receiving.</t>
		<t>
		The spatial ordering is understood by the Media Capture attribute's
		Area of Capture, Point of Capture, and Point on Line of Capture.</t>
		<t>
		A Media Consumer would likely want to choose a CSV
		to receive, partially based on how many Streams it can simultaneously
		receive. A Consumer that can receive three video Streams would
		probably prefer to receive the first view of Capture Scene #1
		(VC0, VC1, and VC2) and not receive the other views. A Consumer that
		can receive only one video Stream would probably choose one of the
		other views.</t>
		<t>
		If the Consumer can receive a presentation Stream too, it would
		also choose to receive the only view from Capture Scene #2 (VC6).</t>
		</section>
		<section anchor="s-12.1.2" numbered="true" toc="default">
		<name>Encoding Group Example</name>
		<t>
		This is an example of an Encoding Group to illustrate how it can
		express dependencies between Encodings. The information below
		about Encodings is a summary of what would be conveyed in SDP, not
		directly in the CLUE Advertisement.</t>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		encodeGroupID=EG0 maxGroupBandwidth=6000000
		encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
		maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
		encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
		maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
		encodeID=AUDENC0, maxBandwidth=96000
		encodeID=AUDENC1, maxBandwidth=96000
		encodeID=AUDENC2, maxBandwidth=96000
		]]></artwork>
		<t>
		Here, the Encoding Group is EG0. Although the Encoding Group is
		capable of transmitting up to 6 Mbit/s, no individual video
		Encoding can exceed 4 Mbit/s.</t>
		<t>
		This Encoding Group also allows up to three audio Encodings, AUDENC<0-2&gt
		;. It is not required that audio and video Encodings reside
		within the same Encoding Group, but if so, then the group's overall
		maxBandwidth value is a limit on the sum of all audio and video
		Encodings configured by the Consumer. A system that does not wish
		or need to combine bandwidth limitations in this way should
		instead use separate Encoding Groups for audio and video in order
		for the bandwidth limitations on audio and video to not interact.</t>
		<t>
		Audio and video can be expressed in separate Encoding Groups, as
		in this illustration.</t>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		encodeGroupID=EG0 maxGroupBandwidth=6000000
		encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
		maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
		encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
		maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
		encodeGroupID=EG1 maxGroupBandwidth=500000
		encodeID=AUDENC0, maxBandwidth=96000
		encodeID=AUDENC1, maxBandwidth=96000
		encodeID=AUDENC2, maxBandwidth=96000
		]]></artwork>
		</section>
		<section anchor="s-12.1.3" numbered="true" toc="default">
		<name>The MCU Case</name>
		<t>
		This section shows how an MCU might express its Capture Scenes,
		intending to offer different choices for Consumers that can handle
		different numbers of Streams. Each MCC is for video. A single
		Audio Capture is provided for all single and multi-screen
		configurations that can be associated (e.g., lip-synced) with any
		combination of Video Captures (the MCCs) at the Consumer.</t>
		<table anchor="ref-mcu-main-capture-scenes" align="center">
		<name>MCU Main Capture Scenes</name>
		<thead>
		<tr>
		<th align="left">Capture Scene #1</th>
		<th align="left"/>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">MCC</td>
		<td align="left">for a one-screen Consumer</td>
		</tr>
		<tr>
		<td align="left">MCC1, MCC2</td>
		<td align="left">for a two-screen Consumer</td>
		</tr>
		<tr>
		<td align="left">MCC3, MCC4, MCC5</td>
		<td align="left">for a three-screen Consumer</td>
		</tr>
		<tr>
		<td align="left">MCC6, MCC7, MCC8, MCC9</td>
		<td align="left">for a four-screen Consumer</td>
		</tr>
		<tr>
		<td align="left">AC0</td>
		<td align="left">AC representing all participants</td>
		</tr>
		<tr>
		<td align="left">CSV(MCC0)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV(MCC1,MCC2)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV(MCC3,MCC4,MCC5)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV(MCC6,MCC7,MCC8,MCC9)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV(AC0)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t>
		If/when a presentation Stream becomes active within the Conference,
		the MCU might re-advertise the available Media as:</t>
		<table anchor="ref-mcu-presentation-capture-scene" align="center">
		<name>MCU Presentation Capture Scene</name>
		<thead>
		<tr>
		<th align="left">Capture Scene #2</th>
		<th align="left">Note</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC10</td>
		<td align="left">Video Capture for presentation</td>
		</tr>
		<tr>
		<td align="left">AC1</td>
		<td align="left">Presentation audio to accompany VC10</td>
		</tr>
		<tr>
		<td align="left">CSV(VC10)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV(AC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		</section>
		</section>
		<section anchor="s-12.2" numbered="true" toc="default">
		<name>Media Consumer Behavior</name>
		<t>
		This section gives an example of how a Media Consumer might behave
		when deciding how to request Streams from the three-screen
		Endpoint described in the previous section.</t>
		<t>
		The receive side of a call needs to balance its requirements
		(based on number of screens and speakers), its decoding capabilities,
		available bandwidth, and the Provider's capabilities in order
		to optimally configure the Provider's Streams. Typically, it would
		want to receive and decode Media from each Capture Scene
		advertised by the Provider.</t>
		<t>
		A sane, basic, algorithm might be for the Consumer to go through
		each CSV in turn and find the collection of Video
		Captures that best matches the number of screens it has (this
		might include consideration of screens dedicated to presentation
		video display rather than "people" video) and then decide between
		alternative views in the video Capture Scenes based either on
		hard-coded preferences or on user choice. Once this choice has been
		made, the Consumer would then decide how to configure the
		Provider's Encoding Groups in order to make best use of the
		available network bandwidth and its own decoding capabilities.</t>
		<section anchor="s-12.2.1" numbered="true" toc="default">
		<name>One-Screen Media Consumer</name>
		<t>
		MCC3, MCC4, and VC5 are all different views by themselves, not
		grouped together in a single view; so, the receiving device should
		choose between one of those. The choice would come down to
		whether to see the greatest number of participants simultaneously
		at roughly equal precedence (VC5), a switched view of just the
		loudest region (MCC3), or a switched view with PiPs (MCC4). An
		Endpoint device with a small amount of knowledge of these
		differences could offer a dynamic choice of these options, in-call, to the us
		er.</t>
		</section>
		<section anchor="s-12.2.2" numbered="true" toc="default">
		<name>Two-Screen Media Consumer Configuring the Example</name>
		<t>
		Mixing systems with an even number of screens, "2n", and those
		with "2n+1" cameras (and vice versa) is always likely to be the
		problematic case. In this instance, the behavior is likely to be
		determined by whether a "two-screen" system is really a "two-decoder"
		system, i.e., whether only one received Stream can be displayed
		per screen or whether more than two Streams can be received and
		spread across the available screen area. To enumerate three possible
		behaviors here for the two-screen system when it learns that the far
		end is "ideally" expressed via three Capture Streams:</t>
		<ol spacing="normal" type="1">
		<li>Fall back to receiving just a single Stream (MCC3, MCC4, or VC5
		as per the one-screen Consumer case above) and either leave one
		screen blank or use it for presentation if/when a
		presentation becomes active.</li>
		<li>Receive three Streams (VC0, VC1, and VC2) and display across two
		screens (either with each Capture being scaled to 2/3 of a
		screen and the center Capture being split across two screens), or,
		as would be necessary if there were large bezels on the
		screens, with each Stream being scaled to 1/2 the screen width
		and height and there being a fourth "blank" panel. This fourth panel
		could potentially be used for any presentation that became
		active during the call.</li>
		<li>Receive three Streams, decode all three, and use control informa
		tion
		indicating which was the most active to switch between showing
		the left and center Streams (one per screen) and the center and
		right Streams.</li>
		</ol>
		<t>
		For an Endpoint capable of all three methods of working described
		above, again it might be appropriate to offer the user the choice
		of display mode.</t>
		</section>
		<section anchor="s-12.2.3" numbered="true" toc="default">
		<name>Three-Screen Media Consumer Configuring the Example</name>
		<t>
		This is the most straightforward case: the Media Consumer would
		look to identify a set of Streams to receive that best matched its
		available screens; so, the VC0 plus VC1 plus VC2 should match
		optimally. The spatial ordering would give sufficient information
		for the correct Video Capture to be shown on the correct screen.
		The Consumer would need to divide a single Encoding
		Group's capability by 3 either to determine what resolution and frame
		rate to configure the Provider with or to configure the individual
		Video Captures' Encoding Groups with what makes most sense (taking
		into account the receive side decode capabilities, overall call
		bandwidth, the resolution of the screens plus any user preferences
		such as motion vs. sharpness).</t>
		</section>
		</section>
		<section anchor="s-12.3" numbered="true" toc="default">
		<name>Multipoint Conference Utilizing Multiple Content Captures</name>
		<t>
		The use of MCCs allows the MCU to construct outgoing Advertisements
		describing complex Media switching and composition scenarios. The
		following sections provide several examples.</t>
		<t>
		Note: in the examples the identities of the CLUE elements (e.g.,
		Captures, Capture Scene) in the incoming Advertisements overlap.
		This is because there is no coordination between the Endpoints.
		The MCU is responsible for making these unique in the outgoing
		Advertisement.</t>
		<section anchor="s-12.3.1" numbered="true" toc="default">
		<name>Single Media Captures and MCC in the Same Advertisement</name>
		<t>
		Four Endpoints are involved in a Conference where CLUE is used. An
		MCU acts as a middlebox between the Endpoints with a CLUE channel
		between each Endpoint and the MCU. The MCU receives the following
		Advertisements.</t>
		<table anchor="ref-advertisement-received-from-endpoint-a" align="cent
		er">
		<name>Advertisement Received from Endpoint A</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=AustralianConfRoom</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">Description=Audience<br/>EncodeGroupID=1</td>
		</tr>
		<tr>
		<td align="left">CSV(VC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<table anchor="ref-advertisement-received-from-endpoint-b" align="cent
		er">
		<name>Advertisement Received from Endpoint B</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=ChinaConfRoom</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">Description=Speaker<br/>EncodeGroupID=1</td>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left">Description=Audience<br/>EncodeGroupID=1</td>
		</tr>
		<tr>
		<td align="left">CSV(VC1, VC2)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t keepWithPrevious="true">Note: Endpoint B indicates that it sends tw
		o Streams.</t>
		<table anchor="ref-advertisement-received-from-endpoint-c" align="cent
		er">
		<name>Advertisement Received from Endpoint C</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=USAConfRoom</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">Description=Audience<br/>EncodeGroupID=1</td>
		</tr>
		<tr>
		<td align="left">CSV(VC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t>
		If the MCU wanted to provide a Multiple Content Captures containing
		a round-robin switched view of the audience from the three Endpoints
		and the speaker, it could construct the following Advertisement:</t>

		<table anchor="ref-advertisement-sent-to-endpoint-f-one-encoding">
		<name>Advertisement Sent to Endpoint F - One Encoding</name>
		<tbody>
		<tr>
		<th>Capture Scene #1</th> <th>Description=AustralianConfRoom</
		th>
		</tr>
		<tr>
		<td>VC1</td> <td>Description=Audience</td>
		</tr>
		<tr>
		<td>CSV(VC1)</td> <td/>
		</tr>
		</tbody>

		<tbody>
		<tr>
		<th>Capture Scene #2</th> <th>Description=ChinaConfRoom</th>
		</tr>
		<tr>
		<td>VC2</td> <td>Description=Speaker</td>
		</tr>
		<tr>
		<td>VC3</td> <td>Description=Audience</td>
		</tr>
		<tr>
		<td>CSV(VC2, VC3)</td> <td/>
		</tr>
		</tbody>
		<tbody>
		<tr>
		<th>Capture Scene #3</th> <th>Description=USAConfRoom</th>
		</tr>
		<tr>
		<td>VC4</td> <td>Description=Audience</td>
		</tr>
		<tr>
		<td>CSV(VC4)</td> <td/>
		</tr>
		</tbody>
		<tbody>
		<tr><th>Capture Scene #4</th> <th/></tr>

		<tr>
		<td>MCC1(VC1,VC2,VC3,VC4)</td>

		<td>Policy=RoundRobin:1<br/>
		MaxCaptures=1<br/>
		EncodingGroup=1</td>
		</tr>
		<tr>
		<td>CSV(MCC1)</td> <td/>
		</tr>
		</tbody>
		</table>

		<t>
		Alternatively, if the MCU wanted to provide the speaker as one Media
		Stream and the audiences as another, it could assign an Encoding
		Group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene
		#4 as per the example below.</t>
		<table anchor="ref-advertisement-sent-to-endpoint-f-two-encodings">
		<name>Advertisement Sent to Endpoint F - Two Encodings</name>
		<tbody>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=AustralianConfRoom</th>
		</tr>

		<tr><td>VC1</td> <td>Description=Audience</td>
		</tr>
		<tr><td>CSV(VC1)</td> <td/>
		</tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #2</th> <th>Description=ChinaConfRoom</th>
		</tr>
		<tr><td>VC2</td> <td>Description=Speaker
		<br/>EncodingGroup=1</td>
		</tr>
		<tr><td>VC3</td> <td>Description=Audience</td>
		</tr>
		<tr><td>CSV(VC2, VC3)</td> <td/>
		</tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #3</th> <th>Description=USAConfRoom</th>
		</tr>
		<tr><td>VC4</td> <td>Description=Audience</td>
		</tr>
		<tr><td>CSV(VC4)</td> <td/>
		</tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #4</th> <th/>
		</tr>
		<tr><td>MCC1(VC1,VC3,VC4)</td> <td>Policy=RoundRobin:1
		<br/>MaxCaptures=1
		<br/>EncodingGroup=1
		<br/>AllowSubset=True</td>
		</tr>
		<tr><td>MCC2(VC2)</td> <td>MaxCaptures=1
		<br/>EncodingGroup=1</td>
		</tr>
		<tr><td>CSV2(MCC1,MCC2)</td> <td/>
		</tr>
		</tbody>
		</table>
		<t>
		Therefore, a Consumer could choose whether or not to have a separate
		speaker-related Stream and could choose which Endpoints to see. If
		it wanted the second Stream but not the Australian conference room,
		it could indicate the following Captures in the Configure message:</t>
		<table anchor="table_15">
		<name>MCU Case: Consumer Response</name>
		<tbody>
		<tr><td>MCC1(VC3,VC4)</td> <td>Encoding</td></tr>
		<tr><td>VC2</td> <td>Encoding</td></tr>
		</tbody>
		</table>
		</section>
		<section anchor="s-12.3.2" numbered="true" toc="default">
		<name>Several MCCs in the Same Advertisement</name>
		<t>

		Multiple MCCs can be used where multiple Streams are used to carry
		Media from multiple Endpoints. For example:</t>
		<t>
		A Conference has three Endpoints D, E, and F. Each Endpoint has
		three Video Captures covering the left, middle, and right regions of
		each conference room. The MCU receives the following
		Advertisements from D and E.</t>
		<table anchor="ref-advertisement-received-from-endpoint-d" align="cent
		er">
		<name>Advertisement Received from Endpoint D</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=AustralianConfRoom</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">CaptureArea=Left</td>
		</tr>
		<tr>
		<td align="left"/>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left">CaptureArea=Center</td>
		</tr>
		<tr>
		<td align="left"/>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">VC3</td>
		<td align="left">CaptureArea=Right</td>
		</tr>
		<tr>
		<td align="left"/>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">CSV(VC1,VC2,VC3)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<table anchor="ref-advertisement-received-from-endpoint-e" align="cent
		er">
		<name>Advertisement Received from Endpoint E</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=ChinaConfRoom</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">CaptureArea=Left</td>
		</tr>
		<tr>
		<td align="left"/>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left">CaptureArea=Center</td>
		</tr>
		<tr>
		<td align="left"/>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">VC3</td>
		<td align="left">CaptureArea=Right</td>
		</tr>
		<tr>
		<td align="left"/>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">CSV(VC1,VC2,VC3)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t>
		The MCU wants to offer Endpoint F three Capture Encodings. Each
		Capture Encoding would contain all the Captures from either
		Endpoint D or Endpoint E, depending on the active speaker.
		The MCU sends the following Advertisement:</t>
		<table anchor="ref-advertisement-sent-to-endpoint-f">
		<name>Advertisement Sent to Endpoint F</name>
		<tbody>
		<tr>
		<th>Capture Scene #1</th><th>Description=AustralianConfRoom</th>
		</tr>

		<tr><td>VC1</td> <td/></tr>
		<tr><td>VC2</td> <td/></tr>
		<tr><td>VC3</td> <td/></tr>
		<tr><td>CSV(VC1,VC2,VC3)</td> <td/></tr>
		</tbody>

		<tbody>
		<tr><th>Capture Scene #2</th> <th>Description=ChinaConfRoom</t
		h></tr>

		<tr><td>VC4</td> <td/></tr>
		<tr><td>VC5</td> <td/></tr>
		<tr><td>VC6</td> <td/></tr>
		<tr><td>CSV(VC4,VC5,VC6)</td> <td/></tr>
		</tbody>
		<tbody>

		<tr><th>Capture Scene #3</th> <th/></tr>

		<tr><td>MCC1(VC1,VC4)</td> <td>CaptureArea=Left
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr><td>MCC2(VC2,VC5)</td> <td>CaptureArea=Center
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr><td>MCC3(VC3,VC6)</td> <td>CaptureArea=Right
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr><td>CSV(MCC1,MCC2,MCC3)</td> <td/></tr>
		</tbody>
		</table>

		</section>
		<section anchor="s-12.3.3" numbered="true" toc="default">
		<name>Heterogeneous Conference with Switching and Composition</name>
		<t>
		Consider a Conference between Endpoints with the following
		characteristics:</t>
		<dl newline="false" spacing="normal">
		<dt>Endpoint A -</dt>
		<dd>4 screens, 3 cameras</dd>

		<dt>Endpoint B -</dt>
		<dd>3 screens, 3 cameras</dd>

		<dt>Endpoint C -</dt>
		<dd>3 screens, 3 cameras</dd>

		<dt>Endpoint D -</dt>
		<dd>3 screens, 3 cameras</dd>

		<dt>Endpoint E -</dt>
		<dd>1 screen, 1 camera</dd>

		<dt>Endpoint F -</dt>
		<dd>2 screens, 1 camera</dd>

		<dt>Endpoint G -</dt>
		<dd>1 screen, 1 camera</dd>
		</dl>
		<t>
		This example focuses on what the user in one of the three-camera
		multi-screen Endpoints sees. Call this person User A, at Endpoint
		A. There are four large display screens at Endpoint A. Whenever
		somebody at another site is speaking, all the Video Captures from
		that Endpoint are shown on the large screens. If the talker is at
		a three-camera site, then the video from those three cameras fills three of
		the screens. If the person speaking is at a single-camera site, then video
		from that camera fills one of the screens, while the other screens
		show video from other single-camera Endpoints.</t>
		<t>
		User A hears audio from the four loudest talkers.</t>
		<t>
		User A can also see video from other Endpoints, in addition to the
		current person speaking, although much smaller in size. Endpoint A has four
		screens, so one of those screens shows up to nine other Media Captures
		in a tiled fashion. When video from a three-camera Endpoint appears in
		the tiled area, video from all three cameras appears together across
		the screen with correct Spatial Relationship among those three images.</t>
		<figure anchor="ref-endpoint-a-4-screen-display">
		<name>Endpoint A - Four-Screen Display</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		+---+---+---+ +-------------+ +-------------+ +-------------+
		\| \| \| \| \| \| \| \| \| \|
		+---+---+---+ \| \| \| \| \| \|
		\| \| \| \| \| \| \| \| \| \|
		+---+---+---+ \| \| \| \| \| \|
		\| \| \| \| \| \| \| \| \| \|
		+---+---+---+ +-------------+ +-------------+ +-------------+
		]]></artwork>
		</figure>
		<t>
		User B at Endpoint B sees a similar arrangement, except there are
		only three screens, so the nine other Media Captures are spread out across
		the bottom of the three displays, in a PiP format.
		When video from a three-camera Endpoint appears in the PiP area, video
		from all three cameras appears together across one screen with
		correct Spatial Relationship.</t>
		<figure anchor="ref-endpoint-b-3-screen-display-with-pips">
		<name>Endpoint B - Three-Screen Display with PiPs</name>
		<artwork name="" type="" align="left" alt=""><![CDATA[
		+-------------+ +-------------+ +-------------+
		\| \| \| \| \| \|
		\| \| \| \| \| \|
		\| \| \| \| \| \|
		\| +-+ +-+ +-+ \| \| +-+ +-+ +-+ \| \| +-+ +-+ +-+ \|
		\| +-+ +-+ +-+ \| \| +-+ +-+ +-+ \| \| +-+ +-+ +-+ \|
		+-------------+ +-------------+ +-------------+
		]]></artwork>
		</figure>
		<t>
		When somebody at a different Endpoint becomes the current speaker,
		then User A and User B both see the video from the new person speaking
		appear on their large screen area, while the previous speaker takes
		one of the smaller tiled or PiP areas. The person who is the
		current speaker doesn't see themselves; they see the previous speaker
		in their large screen area.</t>
		<t>
		One of the points of this example is that Endpoints A and B each
		want to receive three Capture Encodings for their large display areas,
		and nine Encodings for their smaller areas. A and B are be able to
		each send the same Configure message to the MCU, and each receive
		the same conceptual Media Captures from the MCU. The differences
		are in how they are Rendered and are purely a local matter at A and
		B.</t>
		<t>The Advertisements for such a scenario are described below.

		</t>
		<table anchor="ref-advertisement-received-at-the-mcu-from-endpoints-a-
		to-d" align="center">
		<name>Advertisement Received at the MCU from Endpoints A to D</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=Endpoint x</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">VC2</td>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">VC3</td>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">AC1</td>
		<td align="left">EncodingGroup=2</td>
		</tr>
		<tr>
		<td align="left">CSV1(VC1, VC2, VC3)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV2(AC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<table anchor="ref-advertisement-received-at-the-mcu-from-endpoints-e-
		to-g" align="center">
		<name>Advertisement Received at the MCU from Endpoints E to G</name>
		<thead>
		<tr>
		<th align="left"> Capture Scene #1</th>
		<th align="left"> Description=Endpoint y</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="left">VC1</td>
		<td align="left">EncodingGroup=1</td>
		</tr>
		<tr>
		<td align="left">AC1</td>
		<td align="left">EncodingGroup=2</td>
		</tr>
		<tr>
		<td align="left">CSV1(VC1)</td>
		<td align="left"/>
		</tr>
		<tr>
		<td align="left">CSV2(AC1)</td>
		<td align="left"/>
		</tr>
		</tbody>
		</table>
		<t>
		Rather than considering what is displayed, CLUE concentrates more
		on what the MCU sends. The MCU doesn't know anything about the
		number of screens an Endpoint has.</t>
		<t>
		As Endpoints A to D each advertise that three Captures make up a
		Capture Scene, the MCU offers these in a "site switching" mode.
		That is, there are three Multiple Content Captures (and
		Capture Encodings) each switching between Endpoints. The MCU
		switches in the applicable Media into the Stream based on voice
		activity. Endpoint A will not see a Capture from itself.</t>
		<t>
		Using the MCC concept, the MCU would send the following
		Advertisement to Endpoint A:</t>

		<table anchor="ref-advertisement-sent-to-endpoint-a-source-part">
		<name>Advertisement Sent to Endpoint A - Source Part</name>
		<tbody>
		<tr>
		<th>Capture Scene #1</th><th>Description=Endpoint B</th>
		</tr>

		<tr><td>VC4</td> <td>CaptureArea=Left</td></tr>
		<tr><td>VC5</td> <td>CaptureArea=Center</td></tr>
		<tr><td>VC6</td> <td>CaptureArea=Right</td></tr>
		<tr><td>AC1</td> <td/></tr>
		<tr><td>CSV(VC4,VC5,VC6)</td> <td/></tr>
		<tr><td>CSV(AC1)</td> <td/></tr>
		</tbody>
		<tbody>
		<tr>
		<th>Capture Scene #2</th><th>Description=Endpoint C</th>
		</tr>
		<tr><td>VC7</td> <td>CaptureArea=Left</td></tr>
		<tr><td>VC8</td> <td>CaptureArea=Center</td></tr>
		<tr><td>VC9</td> <td>CaptureArea=Right</td></tr>
		<tr><td>AC2</td> <td/></tr>
		<tr><td>CSV(VC7,VC8,VC9)</td> <td/></tr>
		<tr><td>CSV(AC2)</td> <td/></tr>
		</tbody>
		<tbody>
		<tr>
		<th>Capture Scene #3</th><th>Description=Endpoint D</th>
		</tr>

		<tr><td>VC10</td> <td>CaptureArea=Left</td></tr>
		<tr><td>VC11</td> <td>CaptureArea=Center</td></tr>
		<tr><td>VC12</td> <td>CaptureArea=Right</td></tr>
		<tr><td>AC3</td> <td/></tr>
		<tr><td>CSV(VC10,VC11,VC12)</td> <td/></tr>
		<tr><td>CSV(AC3)</td> <td/></tr>
		</tbody>
		<tbody>
		<tr>
		<th>Capture Scene #4</th><th>Description=Endpoint E</th>
		</tr>

		<tr><td>VC13</td> <td/></tr>
		<tr><td>AC4</td> <td/></tr>
		<tr><td>CSV(VC13)</td> <td/></tr>
		<tr><td>CSV(AC4)</td> <td/></tr>
		</tbody>
		<tbody>
		<tr>
		<th>Capture Scene #5</th><th>Description=Endpoint F</th>
		</tr>

		<tr><td>VC14</td> <td/></tr>
		<tr><td>AC5</td> <td/></tr>
		<tr><td>CSV(VC14)</td> <td/></tr>
		<tr><td>CSV(AC5)</td> <td/></tr>
		</tbody>
		<tbody>
		<tr>
		<th>Capture Scene #6</th><th>Description=Endpoint G</th>
		</tr>

		<tr><td>VC15</td> <td/></tr>
		<tr><td>AC6</td> <td/></tr>
		<tr><td>CSV(VC15)</td> <td/></tr>
		<tr><td>CSV(AC6)</td> <td/></tr>

		</tbody>
		</table>
		<t>
		The above part of the Advertisement presents information about the
		sources to the MCC. The information is effectively the same as the
		received Advertisements, except that there are no Capture Encodings
		associated with them and the identities have been renumbered.</t>
		<t>
		In addition to the source Capture information, the MCU advertises
		site switching of Endpoints B to G in three Streams.</t>
		<table anchor="table_22">
		<name>Advertisement Sent to Endpoint A - Switching Parts</name>
		<thead>
		<tr>
		<th>Capture Scene #7</th><th>Description=Output3streammix</th>
		</tr>
		</thead>
		<tbody>

		<tr>
		<td>MCC1(VC4,VC7,VC10,&zwsp;VC13)</td> <td>CaptureArea=Left
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1</td>
		</tr>

		<tr>
		<td>MCC2(VC5,VC8,VC11,&zwsp;VC14)</td> <td>CaptureArea=Center
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1</td>
		</tr>

		<tr>
		<td>MCC3(VC6,VC9,VC12,&zwsp;VC15)</td> <td>CaptureArea=Right
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1</td>
		</tr>
		<tr>
		<td>MCC4() (for audio)</td> <td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=2</td>
		</tr>
		<tr>
		<td>MCC5() (for audio)</td> <td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:1
		<br/>EncodingGroup=2</td>
		</tr>
		<tr>
		<td>MCC6() (for audio)</td> <td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:2
		<br/>EncodingGroup=2</td>
		</tr>

		<tr>
		<td>MCC7() (for audio)</td> <td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:3
		<br/>EncodingGroup=2</td>
		</tr>

		<tr>
		<td>CSV(MCC1,MCC2,MCC3)</td> <td/></tr>

		<tr>
		<td>CSV(MCC4,MCC5,MCC6,&zwsp;MCC7)</td> <td/></tr>

		</tbody></table>
		<t>
		The above part describes the three main switched Streams that relate to
		site switching. MaxCaptures=1 indicates that only one Capture from
		the MCC is sent at a particular time. SynchronizationID=1 indicates
		that the source sending is synchronized. The Provider can choose to
		group together VC13, VC14, and VC15 for the purpose of switching
		according to the SynchronizationID. Therefore, when the Provider
		switches one of them into an MCC, it can also switch the others
		even though they are not part of the same Capture Scene.</t>
		<t>
		All the audio for the Conference is included in Scene #7.
		There isn't necessarily a one-to-one relation between any Audio
		Capture and Video Capture in this Scene. Typically, a change in
		the loudest talker will cause the MCU to switch the audio Streams more
		quickly than switching video Streams.</t>
		<t>
		The MCU can also supply nine Media Streams showing the active and
		previous eight speakers. It includes the following in the
		Advertisement:</t>

		<table anchor="table_23">
		<name>Advertisement Sent to Endpoint A - 9 Switched Parts</name
		>
		<thead>
		<tr>
		<th>Capture Scene #8</th><th>Description=Output9stream</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="right">MCC8(VC4,VC5,VC6,VC7,
		<br/>VC8,VC9,VC10,VC11,
		<br/>VC12,VC13,VC14,VC15)</td>

		<td>MaxCaptures=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1</td>
		</tr><tr>

		<td align="right">MCC9(VC4,VC5,VC6,VC7,
		<br/>VC8,VC9,VC10,VC11,
		<br/>VC12,VC13,VC14,VC15)
		</td>

		<td>MaxCaptures=1
		<br/>Policy=SoundLevel:1
		<br/>EncodingGroup=1</td>
		</tr><tr>
		<th align="center">to</th><th align="center">to</th>
		</tr><tr>
		<td align="right">MCC16(VC4,VC5,VC6,VC7,
		<br/>VC8,VC9,VC10,VC11,
		<br/>VC12,VC13,VC14,VC15)</td>

		<td>MaxCaptures=1
		<br/>Policy=SoundLevel:8
		<br/>EncodingGroup=1</td>
		</tr><tr>

		<td align="right">CSV(MCC8,MCC9,MCC10,
		<br/>MCC11,MCC12,MCC13,
		<br/>MCC14,MCC15,MCC16)</td>
		<td/>
		</tr>
		</tbody>
		</table>
		<t>
		The above part indicates that there are nine Capture Encodings. Each
		of the Capture Encodings may contain any Captures from any source
		site with a maximum of one Capture at a time. Which Capture is
		present is determined by the policy. The MCCs in this Scene do not
		have any spatial attributes.</t>
		<t>
		Note: The Provider alternatively could provide each of the MCCs
		above in its own Capture Scene.</t>
		<t>
		If the MCU wanted to provide a composed Capture Encoding containing
		all of the nine Captures, it could advertise in addition:</t>
		<table anchor="ref-advertisement-sent-to-endpoint-a-9-composed-part">
		<name>Advertisement Sent to Endpoint A - 9 Composed Parts</name
		>
		<thead>
		<tr>
		<th>Capture Scene #9</th><th>Description=NineTiles</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td align="right">MCC13(MCC8,MCC9,MCC10,<br/>
		MCC11,MCC12,MCC13,<br/>
		MCC14,MCC15,MCC16)</td>

		<td>MaxCaptures=9<br/>
		EncodingGroup=1</td>
		</tr>
		<tr>
		<td>CSV(MCC13)</td><td/>
		</tr>
		</tbody>
		</table>

		<t>
		As MaxCaptures is 9, it indicates that the Capture Encoding contains
		information from nine sources at a time.</t>
		<t>
		The Advertisement to Endpoint B is identical to the above, other
		than the fact that Captures from Endpoint A would be added and the Captures
		from Endpoint B would be removed. Whether the Captures are Rendered
		on a four-screen display or a three-screen display is up to the
		Consumer to determine. The Consumer wants to place Video Captures
		from the same original source Endpoint together, in the correct
		spatial order, but the MCCs do not have spatial attributes. So, the
		Consumer needs to associate incoming Media packets with the
		original individual Captures in the Advertisement (such as VC4,
		VC5, and VC6) in order to know the spatial information it needs for
		correct placement on the screens. The Provider can use the RTCP
		CaptureId source description (SDES) item and associated RTP header extension,
		as
		described in <xref target="RFC8849" format="default"/>, to convey this
		information to the Consumer.</t>
		</section>
		<section anchor="s-12.3.4" numbered="true" toc="default">
		<name>Heterogeneous Conference with Voice-Activated Switching</name>
		<t>
		This example illustrates how multipoint "voice-activated switching"
		behavior can be realized, with an Endpoint making its own decision
		about which of its outgoing video Streams is considered the "active talker" f
		rom that Endpoint. Then, an MCU can decide which is the
		active talker among the whole Conference.</t>
		<t>
		Consider a Conference between Endpoints with the following
		characteristics:</t>
		<dl newline="false" spacing="normal">
		<dt>Endpoint A -</dt>
		<dd>3 screens, 3 cameras</dd>

		<dt>Endpoint B -</dt>
		<dd>3 screens, 3 cameras</dd>

		<dt>Endpoint C -</dt>
		<dd>1 screen, 1 camera</dd>
		</dl>
		<t>
		This example focuses on what the user at Endpoint C sees. The
		user would like to see the Video Capture of the current talker,
		without composing it with any other Video Capture. In this
		example, Endpoint C is capable of receiving only a single video
		Stream. The following tables describe Advertisements from Endpoints A and B
		to the MCU, and from the MCU to Endpoint C, that can be used to accomplish
		this.</t>
		<table anchor="ref-advertisement-received-at-the-mcu-from-endpoints-a-
		and-b">
		<name>Advertisement Received at the MCU from Endpoints A and B</name
		>
		<thead>
		<tr>
		<th>Capture Scene #1</th><th>Description=Endpoint x</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td>VC1</td> <td>CaptureArea=Left
		<br/>EncodingGroup=1</td>
		</tr>
		<tr>
		<td>VC2</td> <td>CaptureArea=Center
		<br/>EncodingGroup=1</td>
		</tr>
		<tr>
		<td>VC3</td> <td>CaptureArea=Right
		<br/>EncodingGroup=1</td>
		</tr>
		<tr>
		<td>MCC1(VC1,VC2,VC3)</td> <td>MaxCaptures=1
		<br/>CaptureArea=whole Scene
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1</td>
		</tr>
		<tr>
		<td>AC1</td> <td>CaptureArea=whole Scene
		<br/>EncodingGroup=2</td>
		</tr>

		<tr>
		<td>CSV1(VC1, VC2, VC3)</td><td/>
		</tr>
		<tr>
		<td>CSV2(MCC1)</td><td/>
		</tr>
		<tr>
		<td>CSV3(AC1)</td><td/>
		</tr></tbody>
		</table>
		<t>
		Endpoints A and B are advertising each individual Video Capture,
		and also a switched Capture MCC1 that switches between the other
		three based on who is the active talker. These Endpoints do not
		advertise distinct Audio Captures associated with each individual
		Video Capture, so it would be impossible for the MCU (as a Media
		Consumer) to make its own determination of which Video Capture is
		the active talker based just on information in the audio Streams.</t>
		<table anchor="ref-advertisement-sent-from-the-mcu-to-c">
		<name>Advertisement Sent from the MCU to Endpoint C</name>

		<thead>
		<tr><th>Capture Scene #1</th><th>Description=conference</th>
		</tr>
		</thead>
		<tbody>
		<tr>
		<td>MCC1()</td>
		<td>CaptureArea=Left
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr>
		<td>MCC2()</td><td>CaptureArea=Center
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr>
		<td>MCC3()</td><td>CaptureArea=Right
		<br/>MaxCaptures=1
		<br/>SynchronizationID=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr>
		<td>MCC4()</td><td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=1
		</td>
		</tr>
		<tr>
		<td>MCC5() (for audio)</td><td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:0
		<br/>EncodingGroup=2
		</td>
		</tr>
		<tr>
		<td>MCC6() (for audio)</td><td>CaptureArea=whole Scene
		<br/>MaxCaptures=1
		<br/>Policy=SoundLevel:1
		<br/>EncodingGroup=2
		</td>
		</tr>
		<tr><td>CSV1(MCC1,MCC2,MCC3)</td><td/></tr>
		<tr><td>CSV2(MCC4)</td><td/></tr>
		<tr><td>CSV3(MCC5,MCC6)</td><td/></tr>
		</tbody>
		</table>
		<t>
		The MCU advertises one Scene, with four video MCCs. Three of them
		in CSV1 give a left, center, and right view of the Conference, with
		site switching. MCC4 provides a single Video Capture
		representing a view of the whole Conference. The MCU intends for
		MCC4 to be switched between all the other original source
		Captures. In this example, Advertisement of the MCU is not giving all
		the information about all the other Endpoints' Scenes and which of
		those Captures are included in the MCCs. The MCU could include all
		that if it wants to give the Consumers more
		information, but it is not necessary for this example scenario.</t>
		<t>
		The Provider advertises MCC5 and MCC6 for audio. Both are
		switched Captures, with different SoundLevel policies indicating
		they are the top two dominant talkers. The Provider advertises
		CSV3 with both MCCs, suggesting the Consumer should use both if it
		can.</t>
		<t>
		Endpoint C, in its Configure Message to the MCU, requests to
		receive MCC4 for video and MCC5 and MCC6 for audio. In order for
		the MCU to get the information it needs to construct MCC4, it has
		to send Configure Messages to Endpoints A and B asking to receive MCC1 from
		each of them, along with their AC1 audio. Now the MCU can use
		audio energy information from the two incoming audio Streams from
		Endpoints A and B to determine which of those alternatives is the current
		talker. Based on that, the MCU uses either MCC1 from A or MCC1
		from B as the source of MCC4 to send to Endpoint C.</t>
		</section>
		</section>
		</section>
		<section anchor="s-14" numbered="true" toc="default">
		<name>IANA Considerations</name>
		<t>
		This document has no IANA actions.
		</t>
		</section>
		<section anchor="s-15" numbered="true" toc="default">
		<name>Security Considerations</name>
		<t>
		There are several potential attacks related to telepresence,
		specifically the protocols used by CLUE. This is the case due to
		conferencing sessions, the natural involvement of multiple
		Endpoints, and the many, often user-invoked, capabilities provided
		by the systems.</t>
		<t>
		An MCU involved in a CLUE session can experience many of the same
		attacks as a conferencing system such as the one enabled by
		the Conference
		Information Data Model for Centralized Conferencing (XCON) framework <xref ta
		rget="RFC5239" format="default"/>. Examples of attacks include the
		following: an Endpoint attempting to listen to sessions in which
		it is not authorized to participate, an Endpoint attempting to
		disconnect or mute other users, and theft of service by an
		Endpoint in attempting to create telepresence sessions it is not
		allowed to create. Thus, it is <bcp14>RECOMMENDED</bcp14> that an MCU
		implementing the protocols necessary to support CLUE follow the
		security recommendations specified in the conference control
		protocol documents.
		In the case of CLUE, SIP is the conferencing
		protocol, thus the security considerations in <xref target="RFC4579" format="
		default"/> <bcp14>MUST</bcp14> be
		followed. Other security issues related to MCUs are discussed in
		the XCON framework <xref target="RFC5239" format="default"/>. The use of xCar
		d with potentially
		sensitive information provides another reason to implement
		recommendations in <xref section="11" sectionFormat="of" target="RFC5239" for
		mat="default"/>.</t>
		<t>
		One primary security concern, surrounding the CLUE framework
		introduced in this document, involves securing the actual
		protocols and the associated authorization mechanisms. These
		concerns apply to Endpoint-to-Endpoint sessions as well as
		sessions involving multiple Endpoints and MCUs. <xref target="ref-basic-infor
		mation-flow" format="default"/> in
		<xref target="s-5" format="default"/> provides a basic flow of information ex
		change for CLUE
		and the protocols involved.</t>
		<t>
		As described in <xref target="s-5" format="default"/>, CLUE uses SIP/SDP to
		establish the session prior to exchanging any CLUE-specific
		information. Thus, the security mechanisms recommended for SIP
		<xref target="RFC3261" format="default"/>, including user authentication and
		authorization, <bcp14>MUST</bcp14> be supported. In addition, the Media <bcp1
		4>MUST</bcp14> be
		secured. Datagram Transport Layer Security (DTLS) / Secure Real-time
		Transport Protocol (SRTP) <bcp14>MUST</bcp14> be supported and <bcp14>SHOULD<
		/bcp14> be used unless the
		Media, which is based on RTP, is secured by other means (see <xref target="RF
		C7201" format="default"/> <xref target="RFC7202" format="default"/>). Media sec
		urity is
		also discussed in <xref target="RFC8848" format="default"/> and <xref target=
		"RFC8849" format="default"/>. Note that SIP call setup is done before any
		CLUE-specific information is available, so the authentication and
		authorization are based on the SIP mechanisms. The entity that will
		be authenticated may use the Endpoint identity or the Endpoint user
		identity; this is an application issue and not a CLUE-specific
		issue.</t>
		<t>
		A separate data channel is established to transport the CLUE
		protocol messages. The contents of the CLUE protocol messages are
		based on information introduced in this document. The CLUE data
		model <xref target="RFC8846" format="default"/> defines, through an XML
		schema, the syntax to be used. One type of information that could
		possibly introduce privacy concerns is the xCard information, as
		described in <xref target="s-7.1.1.10" format="default"/>. The decision about
		which xCard
		information to send in the CLUE channel is an application policy
		for point-to-point and multipoint calls based on the authenticated
		identity that can be the Endpoint identity or the user of the
		Endpoint. For example, the telepresence multipoint application can
		authenticate a user before starting a CLUE exchange with the
		telepresence system and have a policy per user.</t>
		<t>
		In addition, the (text) description field in the Media Capture
		attribute (<xref target="s-7.1.1.6" format="default"/>) could possibly reveal
		sensitive
		information or specific identities. The same would be true for the
		descriptions in the Capture Scene (<xref target="s-7.3.1" format="default"/>)
		and CSV
		(<xref target="s-7.3.2" format="default"/>) attributes. An implementation <bc
		p14>SHOULD</bcp14> give users
		control over what sensitive information is sent in an
		Advertisement. One other important consideration for the
		information in the xCard as well as the description field in the
		Media Capture and CSV attributes is that while the
		Endpoints involved in the session have been authenticated, there
		are no assurance that the information in the xCard or description
		fields is authentic. Thus, this information <bcp14>MUST NOT</bcp14> be used
		to
		make any authorization decisions.</t>
		<t>
		While other information in the CLUE protocol messages does not
		reveal specific identities, it can reveal characteristics and
		capabilities of the Endpoints. That information could possibly
		uniquely identify specific Endpoints. It might also be possible
		for an attacker to manipulate the information and disrupt the CLUE
		sessions. It would also be possible to mount a DoS attack on the
		CLUE Endpoints if a malicious agent has access to the data
		channel. Thus, it <bcp14>MUST</bcp14> be possible for the Endpoints to estab
		lish
		a channel that is secure against both message recovery and
		message modification. Further details on this are provided in the
		CLUE data channel solution document <xref target="RFC8850" format="default"/>
		.</t>
		<t>
		There are also security issues associated with the authorization
		to perform actions at the CLUE Endpoints to invoke specific
		capabilities (e.g., rearranging screens, sharing content, etc.).
		However, the policies and security associated with these actions
		are outside the scope of this document and the overall CLUE
		solution.</t>
		</section>
		</middle>
		<back>
		<references>
		<name>References</name>
		<references>
		<name>Normative References</name>

		<!--&I-D.ietf-clue-datachannel; is 8850 -->
		<reference anchor="RFC8850" target="https://www.rfc-editor.org/info/rfc8850">
		<front>
		<title>Controlling Multiple Streams for Telepresence (CLUE) Protocol Data
		Channel</title>
		<author initials="C." surname="Holmberg" fullname="Christer Holmberg">
		<organization/>
		</author>
		<date month="January" year="2021"/>
		</front>
		<seriesInfo name="RFC" value="8850"/>
		<seriesInfo name="DOI" value="10.17487/RFC8850"/>
		</reference>

		<!--&I-D.ietf-clue-data-model-schema; is 8846-->
		<reference anchor="RFC8846" target="http://www.rfc-editor.org/info/rfc8846">
		<front>
		<title>An XML Schema for the Controlling Multiple Streams for Telepr
		esence (CLUE) Data Model</title>
		<author initials="R" surname="Presta" fullname="Roberta Presta">
		<organization/>
		</author>
		<author initials="S P." surname="Romano" fullname="Simon Romano">
		<organization/>
		</author>
		<date month="January" year="2021"/>
		</front>
		<seriesInfo name="RFC" value="8846"/>
		<seriesInfo name="DOI" value="10.17487/RFC8846"/>
		</reference>

		<!--&I-D.ietf-clue-protocol; is 8847 -->
		<reference anchor='RFC8847' target='https://www.rfc-editor.org/info/rfc8847'>
		<front>
		<title>Protocol for Controlling Multiple Streams for Telepresence (CLUE)</title>
		<author initials='R' surname='Presta' fullname='Roberta Presta'>
		<organization />
		</author>
		<author initials='S P.' surname='Romano' fullname='Simon Pietro Romano'>
		<organization />
		</author>
		<date month='January' year='2021' />
		</front>
		<seriesInfo name="RFC" value="8847" />
		<seriesInfo name='DOI' value='10.17487/RFC8847' />
		</reference>

		<!--&I-D.ietf-clue-signaling; is 8848 -->
		<reference anchor="RFC8848"
		target="https://www.rfc-editor.org/info/rfc8848">
		<front>
		<title>Session Signaling for Controlling Multiple Streams for
		Telepresence (CLUE)</title>
		<author initials="R" surname="Hanton" fullname="Robert Hanton">
		<organization/>
		</author>
		<author initials="P" surname="Kyzivat" fullname="Paul Kyzivat">
		<organization/>
		</author>
		<author initials="L" surname="Xiao" fullname="Lennard Xiao">
		<organization/>
		</author>
		<author initials="C" surname="Groves" fullname="Christian Groves">
		<organization/>
		</author>
		<date month="January" year="2021"/>
		</front>
		<seriesInfo name="RFC" value="8848"/>
		<seriesInfo name="DOI" value="10.17487/RFC8848"/>
		</reference>

		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		2119.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		3261.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		3264.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		3550.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		4566.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		4579.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		5239.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		5646.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		6350.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		6351.xml"/>
		<xi:include href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/refer
		ence.RFC.8174.xml"/>

		</references>
		<references>
		<name>Informative References</name>

		<!-- &I-D.ietf-clue-rtp-mapping; is 8849 -->
		<reference anchor='RFC8849' target="https://www.rfc-editor.org/info/rfc8849">
		<front>
		<title>Mapping RTP Streams to Controlling Multiple Streams for Telepresence
		(CLUE) Media Captures</title>
		<author initials='R' surname='Even' fullname='Roni Even'>
		<organization />
		</author>
		<author initials='J' surname='Lennox' fullname='Jonathan Lennox'>
		<organization />
		</author>
		<date month='January' year='2021' />
		</front>
		<seriesInfo name='RFC' value='8849' />
		<seriesInfo name="DOI" value="10.17487/RFC8849"/>
		</reference>

		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		4353.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		7667.xml"/>

		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		7201.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		7202.xml"/>
		<xi:include
		href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.
		7205.xml"/>
		<xi:include href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/refer
		ence.RFC.7262.xml"/>

		</references>
		</references>
		<section anchor="acks" numbered="false" toc="default">
		<name>Acknowledgements</name>
		<t>
		<contact fullname="Allyn Romanow"/> and <contact fullname="Brian Baldino"/> w
		ere
		authors of early draft versions.
		<contact fullname="Mark Gorzynski"/> also contributed much to the initial app
		roach.
		Many others also contributed,
		including <contact fullname="Christian Groves"/>,
		<contact fullname="Jonathan Lennox"/>,
		<contact fullname="Paul Kyzivat"/>,
		<contact fullname="Rob Hanton"/>,
		<contact fullname="Roni Even"/>,
		<contact fullname="Christer Holmberg"/>,
		<contact fullname="Stephen Botzko"/>,
		<contact fullname="Mary Barnes"/>,
		<contact fullname="John Leslie"/>, and
		<contact fullname="Paul Coverdale"/>.</t>
		</section>
		</back>

		</rfc>

End of changes. 1 change blocks.
	lines changed or deleted	lines changed or added
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/