When talking about and designing gestural applications, there are many new terms and concepts floating around. Some concepts were being compared where I felt we were talking about apples and oranges. In an effort to organize some of the ideas present in gestural applications, I thought up this Gestural Application Model. (It is similar in form to the TCP/IP layer model.)
Layer 0: Device
This layer contains raw sensor data, perhaps processed by device drivers or a very low-level API. For example, a stream of X,Y coordinate pairs from a capacitance touch screen would fit into this layer. Also, in the case of Microsoft Surface, the stream consisting of finger, blob, tag, and raw visual data goes here. Everything on the upper layers depends upon the type of data available from the device.
Layer 1: Event
Raw sensor data streams are grouped into events that describe the type and value of the raw data, basic state transitions, and any other relevant information for the sensor type. State transitions include things like Contact Down, Contact Moved, and Contact Up. Internally, this requires interpreting the data stream into a persistent object. For example, is this X,Y coordinate a new touch or is it the same one from the previous time step but moved? Additional data might include, in the case of the Surface: Position, Size, Orientation, and Object ID (for byte tags.)
Layer 2: Gesture
The application collects all the events within a time frame, organizes them, and interprets them as gestures. A gesture can be composed of many events.
Touching the Surface by itself is not a gesture, but if you touch and release within a certain time frame, it becomes a tap gesture. Alternately, if you touch and hold for longer, or move your finger, then release, it could become a hold gesture or a move gesture. Each gesture consists of multiple events.
The same set of events could be interpreted as different gestures, depending upon what the application is expecting or cares about. That move gesture could be a hold gesture if the application doesn't care if the user moves the finger a little bit, or a lot.
One key difference between an event and a gesture is that an event is instantaneous, but a gesture has a beginning, middle, and end. Gestures can be in progress or completed.
Event: This [sensor data] did [state transition] at [time]
Gesture: [Gesture Type] is happening or happened.
Layer 3: Intent
Here gestures are married with context to determine what the user intended to do. Once the application interprets the user's intent, it can take action.
Intent depends highly upon context. The context includes where the gesture was done (relative to visual interface elements) as well as application modes. Compare a tap gesture in the middle of nowhere with a tap gesture over a button interface element. The two might be identical but without the context it is hard to figure out what the user wants. Similarly, the user might drag a finger over an image but want different things depending upon whether the application is in a pen/drawing mode or a panning/moving mode.
Part of the application designer's job is to also figure out situations where a user might use a gesture in the wrong context (i.e. when the user's intent and the interpretation of his or her gestures are not the same) and minimize or eliminate the effects of mis-interpreting intent. Ideally a single gesture will only ever be used for a single action. If the application supports multi-touch hardware, then there are many gestures available so reuse should not be a problem.
(I originally drafted this model in a comment at Point & Do. I decided it should be a model, rather than a stack due to the unfortunate acronym that stack creates.)