CLUE WG C. Groves Internet-Draft W. Yang Intended status: Informational R. Even Expires: March 14, 2013 Huawei September 10, 2012 CLUE media capture description draft-groves-clue-capture-attr-00.txt Abstract This memo discusses how media captures are described and in particular the content attribute in the current CLUE framework document and proposes several alternatives. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on March 14, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Groves, et al. Expires March 14, 2013 [Page 1] Internet-Draft New capture attributes September 2012 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Issues with Content attribute . . . . . . . . . . . . . . . . 4 3.1. Ambiguous definition . . . . . . . . . . . . . . . . . . . 4 3.2. Multiple functions . . . . . . . . . . . . . . . . . . . . 5 3.3. Limited Stream Support . . . . . . . . . . . . . . . . . . 5 3.4. Insufficient information for individual parameters . . . . 5 3.5. Insufficient information for negotiation . . . . . . . . . 5 4. Capture description attributes . . . . . . . . . . . . . . . . 6 4.1. Presentation . . . . . . . . . . . . . . . . . . . . . . . 7 4.2. View . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3. Language . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4. Role . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.5. Priority . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.6. Others . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.6.1. Dynamic . . . . . . . . . . . . . . . . . . . . . . . 9 4.6.2. Embedded Text . . . . . . . . . . . . . . . . . . . . 10 4.6.3. Supplementary Description . . . . . . . . . . . . . . 10 4.6.4. Telepresence . . . . . . . . . . . . . . . . . . . . . 11 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 8. Security Considerations . . . . . . . . . . . . . . . . . . . 12 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 9.1. Normative References . . . . . . . . . . . . . . . . . . . 12 9.2. Informative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 Groves, et al. Expires March 14, 2013 [Page 2] Internet-Draft New capture attributes September 2012 1. Introduction One of the fundamental aspects of the CLUE framework is the concept of media captures. The media captures are sent from a provider to a consumer. This consumer then selects which captures it is interested in and replies back to the consumer. The question is how does the consumer choose between what may be many different media captures? In order to be able to choose between the different media captures the consumer must have enough information regarding what the media capture represents and to distinguish between the media captures. The CLUE framework draft currently defines several media capture attributes which provide information regarding the capture. The draft indicates that Media Capture Attributes describe static information about the captures. A provider uses the media capture attributes to describe the media captures to the consumer. The consumer will select the captures it wants to receive. Attributes are defined by a variable and its value." One of the media capture attributes is the content attribute. As indicated in the draft it is a field with enumerated values which describes the role of the media capture and can be applied to any media type. The enumerated values are defined by [RFC4796] The values for this attribute are the same as the mediacnt values for the content attribute in [RFC4796] This attribute can have multiple values, for example content={main, speaker}." [RFC4796] defines the values as: o slides: the media stream includes presentation slides. The media type can be, for example, a video stream or a number of instant messages with pictures. Typical use cases for this are online seminars and courses. This is similar to the 'presentation' role in H.239. o speaker: the media stream contains the image of the speaker. The media can be, for example, a video stream or a still image. Typical use cases for this are online seminars and courses. o sl: the media stream contains sign language. A typical use case for this is an audio stream that is translated into sign language, which is sent over a video stream. o main: the media stream is taken from the main source. A typical use case for this is a concert where the camera is shooting the performer. Groves, et al. Expires March 14, 2013 [Page 3] Internet-Draft New capture attributes September 2012 o alt: the media stream is taken from the alternative source. A typical use case for this is an event where the ambient sound is separated from the main sound. The alternative audio stream could be, for example, the sound of a jungle. Another example is the video of a conference room, while the main stream carries the video of the speaker. This is similar to the 'live' role in H.239. Whilst the above values appear to be a simple way of conveying the content of a stream the Contributors believe that there are multiple issues that make the use of the existing "Content" tag insufficient for CLUE and multi-stream telepresence systems. These issues are described in section 3. Section 4 proposes new capture description attributes. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119[RFC2119] and indicate requirement levels for compliant RTP implementations. 3. Issues with Content attribute 3.1. Ambiguous definition There is ambiguity in the definitions that may cause problems for interoperability. A clear example is "slides" which could be any form of presentation media. Another example is the difference between "main" and "alt". In a telepresence scenario the room would be captured by the "main cameras" and a speaker would be captured by an alternative "camera". This runs counter with the definition of "alt". Another example is a university use case where: The main site is a university auditorium which is equipped with three cameras. One camera is focused on the professor at the podium. A second camera is mounted on the wall behind the professor and captures the class in its entirety. The third camera is co-located with the second, and is designed to capture a close up view of a questioner in the audience. It automatically zooms in on that student using sound localization. For the first camera, it's not clear whether to use "main" or "speaker". According to the definition and example of "speaker" in Groves, et al. Expires March 14, 2013 [Page 4] Internet-Draft New capture attributes September 2012 RFC 4796, maybe it's more proper to use "speaker" here? For the third camera it could fit the definition of "main" or "alt" or "speaker". 3.2. Multiple functions It appears that the definitions cover disparate functions. "Main" and "alt" appear to describe the source from which media is sent. "Speaker" indicates a role associated with the media stream. "Slides" and "Sign Language" indicates the actual content. Also indirectly some prioritization is applied to these parameters. For example: the IMTC document on best practices for H.239 indicates a display priority between "main" and "alt". This mixing of functions per code point can lead to ambiguous behavior and interoperability problems. It also is an issue when extending the values. 3.3. Limited Stream Support The values above appear to be defined based on a small number of video streams that are typically supported by legacy video conferencing. E.g. a main video stream (main), a secondary one (alt) and perhaps a presentation stream (slides). It is not clear how this value scales when many media streams are present. For example if you have several main streams and several presentation streams how would an endpoint distinguish between them? 3.4. Insufficient information for individual parameters Related to the above point is that some individual values do not provide sufficient information for an endpoint to make an educated decision on the content. For example: Sign language (sl) - If a conference provides multiple streams each one containing a sign interpretation in a different sign language how does an endpoint distinguish between the languages if "sl" is the only label? Also for accessible services other functions such a real time captioning and video description where an additional audio channel is used to describe the conference for vision impaired people should be supported. Note: SDP provide a language attribute. 3.5. Insufficient information for negotiation CLUE negotiation is likely to be at the start of a session initiation. At this point of time only a very simple set of SDP (i.e. limited media description) may be available (depending on call flow). In most cases the supported media captures may be agreed upon before the full SDP information for each media stream. The effect of Groves, et al. Expires March 14, 2013 [Page 5] Internet-Draft New capture attributes September 2012 this is that detailed information would not be available for the initial decision about which capture to choose. The obvious solution is to provide "enough" data in the CLUE provider messages so that a consumer can choose the appropriate media captures. The current CLUE framework already partly addresses this through the "Content" attribute however based on the current "Content" values it appears that the information is not sufficient to fully describe the content of the captures. The purpose of the CLUE work is to supply enough information for negotiating multiple streams. CLUE framework [I-D.ietf-clue-framework] addresses the spatial relation between the streams but it looks like it does not provide enough information about the semantic content of the stream to allow interoperability. Some information is available in SDP and may be available before the CLUE exchange but there are still some information missing. 4. Capture description attributes As indicated above it is proposed to introduce a new attribute/s that allows the definition of various pieces of information that provide metadata about a particular media capture. This information should be described in a way that it only supplies one atomic function. It should also be applicable in a multi-stream environment. It should also be extensible to allow new information elements to be introduced in the future. As an initial list the following attributes are proposed for use as metadata associated with media captures. Further attributes may be identified in the future. This document propose to remove the "Content" attribute. Rather than describing the "source device" in this way it may be better to describe its characteristics. i.e. o An attribute to indicate "Presentation" rather than the value "Slides" o An attribute to describe the "Role" of a capture rather than the value "Speaker". o An attribute to indicate the actual language used rather than a value "Sign Language". This is also applicable to multiple audio streams. Groves, et al. Expires March 14, 2013 [Page 6] Internet-Draft New capture attributes September 2012 o With respect to "main" and "alt" in a multiple stream environment it's not clear these values are needed if the characteristics of the capture are described. An assumption may be that a capture is "main" unless described otherwise. Note: CLUE may have missed a media type "text". How about a real time captioning or a real time text conversation associated with a video meeting? It's a text based service. It's not necessarily a presentation stream. It's not audio or visual but a valid component of a conference. The sections below contain an initial list of attributes. 4.1. Presentation This attribute indicates that the capture originates from a presentation device, that is one that provides supplementary information to a conference through slides, video, still images, data etc. Where more information is known about the capture it may be expanded hierarchically to indicate the different types of presentation media, e.g. presentation.slides, presentation.image etc. Note: It is expected that a number of keywords will be defined that provide more detail on the type of presentation. 4.2. View The Area of capture attribute provides a physical indication of a region that the media capture captures. However the consumer does not know what this physical region relates to. In discussions on the IETF mailing list it is apparent that some people propose to use the "Description" attribute to describe a scene. This is a free text field and as such can be used to signal any piece of information. This leads to problems with interoperability if this field is automatically processed. For interoperability purposes it is proposed to introduce a set of keywords that could be used as a basis for the selection of captures. It is envisaged that this list would be extendable to allow for future uses not covered by the initial specification. Therefore it is proposed to introduce a number of keywords (that may be expanded) indicating what the spatial region relates to? I.e. Room, table, etc. this is an initial description of an attribute introducing these keywords. This attribute provides a textual description of the area that a media capture captures. This provides supplementary information in addition to the spatial information (i.e. area of capture) regarding the region that is captured. Groves, et al. Expires March 14, 2013 [Page 7] Internet-Draft New capture attributes September 2012 Room - Captures the entire scene. Table - Captures the conference table with seated participants Individual - Captures an individual participant Lectern - Captures the region of the lectern including the presenter in classroom style conference Audience - Captures a region showing the audience in a classroom style conference. Others - TBD 4.3. Language As indicated in the discussion in section 2 captures may be offered in different languages in case of multi-lingual and/or accessible conferences. It is important to allow the remote end to distinguish between them. It is noted that SDP already contains a language attribute however this may not be available at the time that an initial CLUE message is sent. Therefore a language attribute is proposed for CLUE. This indicates which language is associated with the capture. For example: it may provide a language associated with an audio capture or a language associated with a video capture when sign interpretation or text is used. The possible values for a language tag are the values of the 'Subtag' column for the "Type: language" entries in the "Language Subtag Registry" defined in [RFC5646] 4.4. Role The original definition of "Content" allows the indication that a particular media stream is related to the speaker. CLUE should also allow this identification for captures. In addition with the advent of XCON there may be other formal roles that may be associated with media/captures. For instance: a remote end may like to always view the floor controller. It is envisaged that a remote end may also chose captures depending on the role of the person/s captured. For example: the people at the remote end may wish to always view the chairmen. This indicates that the capture is associated with an entity that has a particular role in the conference. The values are: Speaker - indicates that the capture relates to the current speaker Floor - indicates that the capture relates to the current floor controller of the conference Groves, et al. Expires March 14, 2013 [Page 8] Internet-Draft New capture attributes September 2012 Chairman- indicates who the chairman of the meeting is. Others - ? 4.5. Priority As has been highlighted in discussions on the CLUE mailing list there appears to be some desire to provide some relative priority between captures when multiple alternatives are supplied. This priority can be used to determine which captures contain the most important information (according to the provider). This may be important in case where the consumer has limited resources and can on render a subset of captures. Priority may also be advantageous in congestion scenarios where media from one capture may be favoured over other captures in any control algorithms. This could be supplied via "ordering" in a CLUE data structure however this may be problematic if people assume some spatial meaning behind ordering, i.e. given three captures VC1, VC2, VC3: it would be natural to send VC1,VC2,VC3 if the images are composed this way. However if your boss sits in the middle view the priority may be VC2,VC1,VC3. Explicit signalling is better. Additionally currently there are no hints to relative priority among captures from different capture scenes. In order to prevent any misunderstanding with implicit ordering a numeric number that may be assigned to each capture. The "priority" attribute indicates a relative priority between captures. For example it is possible to assign a priority between two presentation captures that would allow a remote endpoint to determine which presentation is more important. Priority is assigned at the individual capture level. It represents the provider's view of the relative priority between captures with a priority. The same priority number may be used across multiple captures. It indicates they are equally as important. If no priority is assigned no assumptions regarding relative important of the capture can be assumed. 4.6. Others 4.6.1. Dynamic In the framework it has been assumed that the capture point is a fixed point within a telepresence session. However depending on the conference scenario this may not be the case. In tele-medical or tele-education cases a conference may include cameras that move during the conference. For example: a camera may be placed at different positions in order to provide the best angle to capture a Groves, et al. Expires March 14, 2013 [Page 9] Internet-Draft New capture attributes September 2012 work task, or may include a camera worn by a participant. This would have an effect of changing the capture point, capture axis and area of capture. In order that the remote endpoint can chose to layout/ render the capture appropriately an indication of if the camera is dynamic should be indicated in the initial capture description. This indicates that the spatial information related to the capture may be dynamic and change through the conference. Thus captures may be characterised as static, dynamic or highly dynamic. The capture point of a static capture does not move for the life of the conference. The capture point of dynamic captures is categorised by a change in position followed by a reasonable period of stability. High dynamic captures are categorised by a capture point that is constantly moving. This may assist an endpoint in determining the correct display layout. If the "area of capture", "capture point" and "line of capture" attributes are included with dynamic or highly dynamic captures they indicate spatial information at the time a CLUE message is sent. No information regarding future spatial information should be assumed. 4.6.2. Embedded Text In accessible conferences textual information may be added to a capture before it is transmitted to the remote end. In the case where multiple video captures are presented the remote end may benefit from the ability to choose a video stream containing text over one that does not. This attribute indicates that a capture provides embedded textual information. For example the video capture may contain speech to text information composed with the video image. This attribute is only applicable to video captures and presentation streams with visual information. 4.6.3. Supplementary Description Some conferences utilise translators or facilitators that provide an additional audio stream (i.e. a translation or description of the conference). These persons may not be pictured in a video capture. Where multiple audio captures are presented it may be advantageous for an endpoint to select a supplementary stream instead of or additional to an audio feed associated with the participants from a main video capture. Therefore an attribute is proposed for this. Depending on the results of the discussion of the source device this parameter may be another value for the source. This indicates that a capture provides additional description of the conference. For example an additional audio stream that provides a Groves, et al. Expires March 14, 2013 [Page 10] Internet-Draft New capture attributes September 2012 commentary of a conference that provides supplementary information (e.g. a translation) or extra information to participants in accessible conferences. 4.6.4. Telepresence In certain use cases scenarios it is important to maintain a feeling of "Telepresence" associated with captures when they are played at the remote end. For example: in medical use cases it is important to maintain the colour of images. It is important to note that CLUE is used to describe multi-stream conferences. These may or may not be "telepresence" conferences. Alternatively it could be assumed that all captures possess this attribute and the only captures not subject to processing to create "telepresence" this are those marked with "presentation". We did discuss the aspect of how an endpoint determines if a capture relates to a computer generated image or a real environment. An endpoint may apply different images processing depending on a source, i.e. it may or not apply image processing to adjust lighting levels for a telepresence experience. This parameter indicates that "telepresence" should be associated with the capture. E.g. real world environmental conditions are associated with this capture. Lighting, spatial and timing information are important aspects of the telepresence session. The remote should apply the appropriate capture processing to maintain integrity of this information. For example: the colour related information associated with the original capture is important and should be replicated when displayed/played. 5. Summary The main proposal is a to remove the Content Attribute in favour of describing the characteristics of captures in a more functional(atomic) way using the above attributes as the attributes to describe metadata regarding a capture. 6. Acknowledgements place holder 7. IANA Considerations TBD Groves, et al. Expires March 14, 2013 [Page 11] Internet-Draft New capture attributes September 2012 8. Security Considerations TBD. 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 9.2. Informative References [I-D.ietf-clue-framework] Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino, "Framework for Telepresence Multi-Streams", draft-ietf-clue-framework-06 (work in progress), July 2012. [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description Protocol (SDP) Content Attribute", RFC 4796, February 2007. [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying Languages", BCP 47, RFC 5646, September 2009. Authors' Addresses Christian Groves Huawei Australia Email: Christian.Groves@nteczone.com Weiwei Yang Huawei P.R. China Email: tommy@huawei.com Groves, et al. Expires March 14, 2013 [Page 12] Internet-Draft New capture attributes September 2012 Roni Even Huawei Tel Aviv, Israel Email: roni.even@mail01.huawei.com Groves, et al. Expires March 14, 2013 [Page 13]