HTTPbis                                                        K. Morgan
Internet-Draft                                              C. Brunhuber
Intended status: Standards Track                                    IAEA
Expires: December 4, 2014                                   June 2, 2014


                EZFLATE: Token-based DEFLATE Compression
                      draft-morgan-ezflate-00

Abstract

   This specification defines EZFLATE, a token-based DEFLATE compression
   algorithm for secure compression within encrypted communication
   channels.

Editorial Note (To be removed by RFC Editor)

   Discussion of this draft takes place on the HTTPBIS working group
   mailing list (ietf-http-wg@w3.org), which is archived at
   <http://lists.w3.org/Archives/Public/ietf-http-wg/>.

   Working Group information can be found at
   <http://tools.ietf.org/wg/httpbis/>;

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on December 4, 2014.

Copyright Notice

   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents


Morgan & Brunhuber      Expires December 4, 2014                [Page 1]

Internet-Draft                   EZFLATE                       June 2014


   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
   2.  Compression Attacks . . . . . . . . . . . . . . . . . . . . . . 3
     2.1.  CRIME . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
     2.2.  BREACH  . . . . . . . . . . . . . . . . . . . . . . . . . . 4
   3.  Tokenization  . . . . . . . . . . . . . . . . . . . . . . . . . 4
     3.1.  "Take a Bite out of CRIME"  . . . . . . . . . . . . . . . . 4
     3.2.  Delimiters & Tokenization Methods . . . . . . . . . . . . . 5
       3.2.1.  Method 1: Keep Delimiters as Separate Tokens  . . . . . 5
       3.2.2.  Method 2: Keep Delimiters with Preceding Token  . . . . 5
       3.2.3.  Method 3: Keep Delimiters with Succeeding Token . . . . 6
     3.3.  Tokenization Rules  . . . . . . . . . . . . . . . . . . . . 6
   4.  EZFLATE Compression Algorithm Details . . . . . . . . . . . . . 6
   5.  Security Considerations . . . . . . . . . . . . . . . . . . . . 8
   6.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . 9
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . . . 9
     7.1.  Normative References  . . . . . . . . . . . . . . . . . . . 9
     7.2.  Informative References  . . . . . . . . . . . . . . . . . . 9


Morgan & Brunhuber      Expires December 4, 2014                [Page 2]

Internet-Draft                   EZFLATE                       June 2014


1.  Introduction

   DEFLATE [DEFLATE] is one of the most widely used compression
   mechanisms.  It forms the basis of ZIP [ZIP], GZIP [GZIP] and ZLIB
   [ZLIB].  However, DEFLATE poses a security vulnerability when
   messages comprised of secret and attacker-controlled data are
   delivered through a secure communication channel, as demonstrated by
   the CRIME [CRIME] and BREACH [BREACH] attacks.

   A key observation of the CRIME and BREACH attacks is that they
   exploited the DEFLATE _algorithm_ as described in RFC 1951 [DEFLATE],
   not the DEFLATE _format_ (also described in RFC 1951).

   The first paragraph of Section 4, "Compression algorithm details", in
   RFC 1951 explicitly states that the "deflate" format is not tied to
   any particular algorithm.  To quote, it says, "While it is the intent
   of this document to define the 'deflate' compressed data format
   without reference to any particular compression algorithm, the format
   is related to the compressed formats produced by LZ77 ... it is
   strongly recommended that the implementor of a compressor follow the
   general algorithm presented here ... [however] a compressor need not
   follow it in order to be compliant [DEFLATE]."

   This document describes EZFLATE, a token-based DEFLATE _algorithm_.
   The key feature of the EZFLATE algorithm (and primary difference to
   the algorithm described in RFC 1951), is that LZ77 <length, backward
   distance> tuples may only reference a duplicated token (octet string)
   occurring in a previous block, up to 32K input bytes before, if the
   current token exactly matches in length _and_ octet values.  In other
   words, LZ77 tuples must not reference partial matches, neither
   smaller nor longer, occurring in a previous block.

2.  Compression Attacks

2.1.  CRIME

   SPDY [SPDY] used DEFLATE [DEFLATE] to compress redundant HTTP
   [HTTP-p1] headers.  The CRIME [CRIME] attack exploited DEFLATE to
   guess secret information contained in the HTTP request headers (e.g.
   "Cookies").  Since DEFLATE looks for matches character-by-character,
   the attackers were able to guess the secret one character at a time.
   Each failed guess resulted in a compressed block of size n.  Correct
   guesses resulted in a compressed block of size m < n.  For example,
   consider the following sample HTTP requests (taken from [CRIME]):


Morgan & Brunhuber      Expires December 4, 2014                [Page 3]

Internet-Draft                   EZFLATE                       June 2014


   Attacker guesses the first character is 'a':

   GET /twid=a
   Host: twitter.com
   User-Agent: Chrome
   Cookie: twid=secret

   ... (more guesses) ...

   Attacker guesses the first character is 's':

   GET /twid=s
   Host: twitter.com
   User-Agent: Chrome
   Cookie: twid=secret

   After guessing 's' is the first character of the secret, the attacker
   sees a corresponding increase in compression (i.e. decrease in
   length) indicating the guess is correct.  The procedure is repeated
   until the entire secret is revealed.

   Although this is a brute-force attack of sorts, the average number of
   guesses required to extract a secret is considerably lower, and
   therefore tractable, thanks to being able to guess character by
   character.  If the number of possible characters is 256, then on
   average it will require 128 guesses per character to make a correct
   guess.  An n-character secret then only requires, on average, n * 128
   guesses to discover the secret.  For example, a 12-character secret
   requires 12 * 128 = 1,536 guesses.  In many cases the set of possible
   characters is smaller than 256, reducing the search space even more!

2.2.  BREACH

   While CRIME exploited SPDY (and TLS compression too), BREACH [BREACH]
   exploited compression of HTTP _response bodies_.  The mechanism of
   BREACH is similar to CRIME with slightly different details, therefore
   a detailed review of BREACH will not be given here.

3.  Tokenization

3.1.  "Take a Bite out of CRIME"

   Taking a "bite out of crime" requires more than a security dog named
   McGruff [MCGRUFF].  EZFLATE fights CRIME by using token matching,
   rather than character-by-character matching, to make the search space
   for an attacker equivalent to a full brute-force attack.  At that
   point there are easier ways to execute the brute-force attack.


Morgan & Brunhuber      Expires December 4, 2014                [Page 4]

Internet-Draft                   EZFLATE                       June 2014


   As stated in Section 1, the key feature of the the EZFLATE algorithm
   (and primary difference to the algorithm described in RFC 1951), is
   that LZ77 <length, backward distance> tuples may only reference a
   duplicated token (octet string) if the current token and referenced
   token exactly match in length _and_ octet values.  By selecting
   appropriate tokenization rules, potential attackers are forced to
   guess n-character secrets n characters at a time rather than one
   character at a time.  Taking the example from Section 2.1, a 12-
   character secret has a search space of 2^12 * 8 = 2^96 =
   79,228,162,514,264,337,593,543,950,336 possible secrets (of course,
   on average it would "only" take an attacker half that many guesses to
   discover the secret by brute force).  Even a Base64-encoded 12-
   character secret has a search space of 64^12 =
   4,722,366,482,869,645,213,696 possible secrets.

3.2.  Delimiters & Tokenization Methods

   The classic tokenization method splits an arbitrary-length octet
   string by a subset of octets that act as "delimiters".  The result is
   typically a set of octet strings with the delimiter octets omitted.

   In the likely case where full reconstruction of the original octet
   string is desired, the delimiters obviously cannot be omitted.  There
   are three different choices, outlined below, for keeping the
   delimiters.  Each has their advantages and disadvantages.  The best
   method for a particular data type can be discovered by experimenting.
   Note also that a sequence of consecutive delimiters are treated as a
   single unit.

3.2.1.  Method 1: Keep Delimiters as Separate Tokens

   The first tokenization method simply keeps the delimiter(s) as
   separate tokens.  For example, consider the following string of ASCII
   text to be tokenized by white space and puncuation.

   ASCII string to be tokenized:

                   "The quick; brown fox."

   Resulting token set after Method 1:

                   { "The", " ", "quick", "; ", "brown", " ", "fox", "." }

3.2.2.  Method 2: Keep Delimiters with Preceding Token

   The second tokenization method keeps the delimiter(s) with the token
   which immediately precedes the delimiter(s).


Morgan & Brunhuber      Expires December 4, 2014                [Page 5]

Internet-Draft                   EZFLATE                       June 2014


   Resulting token set after Method 2:

                   { "The ", "quick; ", "brown ", "fox." }

3.2.3.  Method 3: Keep Delimiters with Succeeding Token

   The third tokenization method keeps the delimiter(s) with the token
   which immediately succeeds the delimiter(s).

   Resulting token set after Method 2:

                   { "The", " quick", "; brown", " fox", "." }

3.3.  Tokenization Rules

   Note that EZFLATE is agnostic to tokenization methods and delimiter
   sets.  Appropriate tokenization rules and delimiter sets depend on
   the type of data to be compressed.  For a particular data type, a
   global set of generally applicable tokenization rules ought to be
   selected if at all possible.  This makes it easier for decompressors
   to verify, if desired, that the compressed data conforms to the rules
   and delimiter sets.  However, specialized rules might be necessary
   for special cases, particularly cases which might negatively affect
   security.  The selection of appropriate rules and delimiter sets for
   a particular data type is beyond the scope of this document.

4.  EZFLATE Compression Algorithm Details

   The EZFLATE algorithm, described in this Section, is simply an
   alternate to the algorithm described in Section 4 of [DEFLATE], it is
   not intended to replace that algorithm, which provides better
   general-purpose compression.

   The compressor MUST always operates on a single token (octet string).
   The compressor uses a chained hash table to find duplicated tokens.
   Any hash function may be used and the hash function may operate on
   the entire octet string or any subset.  The compressor also uses a
   separate "length table" to track the token lengths for each position
   in the input window.  At any given point during compression, let T be
   the next token at position P with length L (not necessarily all
   different octets, of course).  First, the compressor computes the
   hash value of the token T and updates the hash table with the hash
   value of T and updates the "length table" with L at position P. Next,
   the compressor examines the hash chain for the token T. If the chain
   is empty, the compressor simply emits the L literal octets of T. If
   the hash chain is not empty, this indicates that the token T (or, if
   there was a hash collision, some other octet string with the same
   hash value) has occurred recently.  The compressor follows the hash


Morgan & Brunhuber      Expires December 4, 2014                [Page 6]

Internet-Draft                   EZFLATE                       June 2014


   chain to find the first entry within the current window which has
   length L (stored in the length table) and is octet-for-octet
   equivalent to the L octets in T. If an exact match is found, the
   compressor may emit a LZ77 <length, backward distance> tuple, where
   the length is L and the backward distance is the position of the
   current token minus the position of the exactly matching token.  If
   no exact match is found, the compressor simply emits the L literal
   octets of T.

   Just as stated in Section 4 of [DEFLATE], "the compressor searches
   the hash chains starting with the most recent [octet] strings, to
   favor small distances and thus take advantage of the Huffman encoding
   [HUFF].  The hash chains are singly linked.  There are no deletions
   from the hash chains; the algorithm simply discards matches that are
   too old.  To avoid a worst-case situation, very long hash chains are
   arbitrarily truncated at a certain length, determined by a run-time
   parameter."

   To improve overall compression, the compressor may optionally defer
   emitting a match in order to find a longer match that is a
   consecutive sequence of matching tokens.  After a set of one or more
   matches with positions Pm ...  Pn have been found for the current
   token T1 at position P1 with length L1, the compressor may defer
   emitting the match and examine the next token T2 to determine if that
   token appears at any of the positions Pm + L1 ...  Pn + L1, i.e.
   directly after any of the matches for T1.  It is critical to note
   that the same rules apply for checking matches of T2 at positions Pm
   + L1 ...  Pn + L1 as if the compressor were finding matches for T2
   using the hash chain.  In particular the length of a given token at
   position Pi + L1 must be equal to L2 _and_ be octet-for-octet
   equivalent to the L2 octets in T2.  The compressor reduces the set of
   matching positions in Pm ...  Pn to only include those which had an
   exact match for T2 at a given position Pi + L1.  If one or more
   matches remains, the process is repeated until either the set of
   deferred matches cannot be extended by the next token or the length
   of the next token is such that the overall length of the match would
   exceed the maximum length of a match (258 octets) allowed by the
   DEFLATE format (see Section 3.2.5 of [DEFLATE]).  Note that the
   compressor must keep enough information about the currently deferred
   match(es), that once the final match is found the total length and
   backward distance are known or can be computed.  Once the set of
   matches cannot be extended any longer, the compressor returns to the
   normal process of searching for matches using the hash table.

   Just as stated in Section 4 of [DEFLATE], "the compressor terminates
   a block when it determines that starting a new block with fresh trees
   would be useful, or when the block size fills up the compressor's
   block buffer."


Morgan & Brunhuber      Expires December 4, 2014                [Page 7]

Internet-Draft                   EZFLATE                       June 2014


   A token T with length L less than the minimum length of a match (3
   octets) allowed by the DEFLATE format (see Section 3.2.5 of
   [DEFLATE]) is processed by simply updating the "length table" and
   emiting the L literal octets.  These short tokens should, however,
   participate in extending deferred matches.  The same matching rules
   apply as for matching tokens in all other situations, namely the
   length of the potentially matching token stored in the "length table"
   must be equal to L and the L octets of the stored token must be
   octet-for-octet equivalent to the L octets in T.

   Note that the formatting rules and static/dynamic Huffman encoding
   rules in RFC 1951 [DEFLATE], apply to any implementation of the
   EZFLATE algorithm described here.

5.  Security Considerations

   Care must be taken when implementing deferred matching such that
   checking for matches of the current token against the octets
   immediately following the deferred set of matches not only checks for
   octet equivalency, but length equivalency as well.  If the length is
   not also checked, an attacker would be able to perform character-by-
   character guessing by injecting the token immediately preceding the
   secret, which in many cases may be a known identifier.

   [TODO: A lot of the remaining text is completely lifted from the
   HPACK Internet Draft; re-write or give credit somehow.]

   An EZFLATE compressor can still act as an oracle to an attacker
   probing the compression context.  However, the attacker can only find
   out if the guess of the entire token is correct or not.

   Still, an attacker could take advantage of this limited information
   for breaking low-entropy secrets using a brute-force attack.  A
   server usually has some protections against such brute-force attack.
   Here, the attack would target the client, where it would be harder to
   detect.  The attack would be even more dangerous if the attacker is
   able to prevent the traffic generated by its brute-force attack from
   reaching the server.

   To offer protection against such type of attacks, users of an EZFLATE
   compressor SHOULD use non-compressed DEFLATE blocks (see Section
   3.2.4 of [DEFLATE]) disable compression of any low-entropy secrets
   which could be put at risk by a brute-force attack.

   There is currently no known threat taking advantage of the use of a
   fixed Huffman encoding.  A study has shown that using a fixed Huffman
   encoding table created an information leakage, however this same
   study concluded that an attacker could not take advantage of this


Morgan & Brunhuber      Expires December 4, 2014                [Page 8]

Internet-Draft                   EZFLATE                       June 2014


   information leakage to recover any meaningful amount of information
   (see [PETAL]).

6.  Acknowledgements

   Roberto Peon and Herve Ruellan.

7.  References

7.1.  Normative References

   [DEFLATE]  Deutsch, P., "DEFLATE Compressed Data Format Specification
              version 1.3", RFC 1951, May 1996.

7.2.  Informative References

   [BREACH]   Gluck, Y., "BREACH: REVIVING THE CRIME ATTACK", July 2013,
              <http://breachattack.com/resources/
              BREACH%20-%20SSL,%20gone%20in%2030%20seconds.pdf>.

   [CRIME]    Rizzo, J. and T. Duong, "The CRIME Attack",
              September 2012, <https://docs.google.com/a/twist.com/
              presentation/d/
              11eBmGiHbYcHR9gL5nDyZChu_-lCa2GizeuOfaLU2HOU/
              edit#slide=id.g1eb6c1b5_3_6>.

   [GZIP]     Deutsch, P., "GZIP file format specification version 4.3",
              RFC 1951, May 1996.

   [HTTP-p1]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
              Protocol (HTTP/1.1): Message Syntax and Routing",
              draft-ietf-httpbis-p1-messaging-26 (work in progress),
              February 2014.

   [HUFF]     Huffman, D., "A Method for the Construction of Minimum
              Redundancy Codes", Proceedings of the Institute of Radio
              Engineers Volume 40, Number 9, pp. 1098-1101,
              September 1952, <http://ieeexplore.ieee.org/xpl/
              articleDetails.jsp?arnumber=4051119>.

   [MCGRUFF]  McGruff, M., "About McGruff", 2014,
              <http://www.ncpc.org/about/about-mcgruff>.

   [PETAL]    Tan, J. and J. Nahata, "PETAL: Preset Encoding Table
              Information Leakage", April 2013, <http://www.pdl.cmu.edu/
              PDL-FTP/associated/CMU-PDL-13-106.pdf>.

   [SPDY]     Belshe, M. and R. Peon, "SPDY Protocol",


Morgan & Brunhuber      Expires December 4, 2014                [Page 9]

Internet-Draft                   EZFLATE                       June 2014


              draft-mbelshe-httpbis-spdy-00 (work in progress),
              February 2012.

   [ZIP]      Katz, P., ".ZIP File Format Specification",
              September 2012,
              <http://www.pkware.com/documents/casestudies/APPNOTE.TXT>.

   [ZLIB]     Deutsch, P., "ZLIB Compressed Data Format Specification
              version 3.3", RFC 1950, May 1996.

Authors' Addresses

   Keith Shearl Morgan
   International Atomic Energy Agency

   EMail: k.morgan@iaea.org


   Christoph Brunhuber
   International Atomic Energy Agency

   EMail: c.brunhuber@iaea.org


Morgan & Brunhuber      Expires December 4, 2014               [Page 10]