API development: ETags and Conditional Get

My curiosity about ETags (Entity Tags) was first piqued when reading over Github's API documentation. Looking into them, I saw that there explanation often pointed to them as a "cacheing" mechanism.

In fact, it is - or at least can be. There are two uses for ETags:

Conditional GET (the so-called "cacheing mechanism")
Concurrency Control

I will cover Conditional GETs here.

Conditional GETs allow a client to ask a server if a resource has changed. If it has not changed, it can assume it's current knowledge is up to date. If it has changed, the server will send the resource back to the client.

An Example

Let's say you have a resource. This resource is a To Do item in your To Do list. It's reachable at example.com/api/todo/4.

A client request might look like:

$ curl -i example.com/api/todo/4

The servers response:

HTTP/1.1 200 OK
Date: Sat, 09 Feb 2013 16:09:50 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Sat, 02 Feb 2013 12:02:47 GMT
ETag: "c0947-b1-4d0258df1f625"
Content-Type: application/json

{
	id: 4,
	item: "take out the trash",
	created: "Sat, 02 Feb 2013 08:29:53 GMT",
	updated: "Sat, 02 Feb 2013 12:02:47 GMT",
}

Pretty standard. We ask for a resource and get it returned to us. Notice, however, that the result has an ETag. How can a client use that information?

Well, if our client stored this To Do response, it can later use the ETag to see if that To Do has changed.

To accomplish this, the client uses the "If-None-Match" header with the ETag set as the value. It's asking the server "Return the resource if it's ETag does not match mine." Because an ETag changes when a resource has changed, this effectively asks the server to return the resource only if it has changed.

Note that a "If-Match" and "If-Range" header is also defined in the HTTP 1.1 spec.

Let's see an example of that.

Here's what the client requests:

$ curl -i -H "If-None-Match: c0947-b1-4d0258df1f625" example.com/api/todo/4

If we suppose that the To Do items has NOT changed, we may see this result:

HTTP/1.1 304 Not Modified
Date: Sat, 09 Feb 2013 16:09:50 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Sat, 02 Feb 2013 12:02:47 GMT
ETag: "c0947-b1-4d0258df1f625"

If, however, the To Do items HAS changed, we may see this result:

HTTP/1.1 200 OK
Date: Sat, 09 Feb 2013 16:29:24 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Sat, 02 Feb 2013 14:33:21 GMT
ETag: "c7493-d7-a6b64d37f6cc3"	# New ETag!
Content-Type: application/json

{
	id: 4,
	item: "Take out the trash, TODAY!",
	created: "Sat, 02 Feb 2013 08:29:53 GMT",
	updated: "Sat, 02 Feb 2013 14:33:21 GMT",
}

If the server matches an ETag given by a client to the Etag on a server (for a given resource), then the server knows the resource hasn't changed since the last time the client checked for it. The server can return a 304 Not Modified response, rather than the resource itself.

This process is called a conditional get.

Implementation notes

The devil is in the details with implementing ETags.

Generating an ETag

The generation of ETags is not specified in any HTTP spec. However, it is common to use a hashing mechanism, such as MD5, or a SHA-* variant.

An ETag's job, in context of a conditional get is to differentiate a resource from another version of itself. If a resource is updated (changed), so to must its ETag.

Therefore, generating an ETag needs to be tied to the value of the resource and/or related meta data. An example might be to generate an md5 based on ID, content and update timestamp.

// Find a To Do item
$todo = Todo::find($id);

// Generate its ETag
$etag = md5( $todo->id . $todo->description . $todo->updated_at );

Performance Gains

On the face of it, ETags have 2 advantages:

Saving in bandwidth by returning 304 response with no body content
Saving in query/server computation time

Bandwidth

This is straight-forward. Bandwidth savings are real in this implementation as a server can send back a bodiless 304 response.

Query/Computation [server load]

This is where it gets interesting. On each request, the server needs to have an ETag to check against.

In order to know if your content has changed, the server needs to generate an ETag for it. This is what I did in the example above.

However, you may note that this (potentially) requires querying data on each request, whether the server sends back a response body or not.

In that respect, our savings seems to only be in bandwidth, and not necessarily in query or computation (server load). Based on this assumption, ETags for Conditional GETs become a feature of scale, where cost savings would be most realized.

Surely, however, ETags can provide benefit for query/computation time? Indeed.

One strategy is to generate and store an ETag anytime a resource is updated or created (on any PUT or POST request).

This results in an ETag always being available to check against. The query on each request then becomes one of looking up and matching the ETag, rather than generating an ETag on the fly. While there is still some querying to do, this is generally less than what it would take to generate an ETag on each request, especially if an in-memory cache is available for use.

Overview

We've covered ETags for Conditional GETs. These

allow a client to check if a resource has changed on a server
must be generated and updated as resources are
potentially save bandwidth and server load.

However, there is a second, more functional use for ETags. This is Concurrency Control and will be covered soon.