Theory
In this section I'll detail the different layers of research this project spans, and how I approach each of them:
1. | Knowledge representation |
2. | Language parsing |
3. | Voice interfaces |
4. | Kitchen computing |
5. | Mobile computing |
Knowledge representationWhat is knowledge representation? It is a scheme for representing information about the world in a structured way. The specific knowledge representation I'm using is the project's inmost layer and is what got me excited to work on the project. It draws upon a couple main ideas:
1. | An entity represents a thing or a concept. |
2. | A relationship can be used to relate two things via a relation. |
For example,
DanielBigham,
person, and
is_a are all entities in the system, and the relationship
DanielBigham is_a person links the entities
DanielBigham and
person via the
is_a relation.
Read moreLanguage parsingAlthough knowledge representation is an interesting topic, I also wanted to explore language parsing. What that means is breaking a sentence down into a more structured representation that encodes its meaning. While some language parsers simply try to assign attributes such as "possessive noun" or "adjective" to the various words of a sentence, my specific goal was to be able to transform a statement or question into the knowledge representation I describe above.
The first step in parsing a statement is considering what
entities the various words in the statement may refer to. For instance, the word "Daniel" may refer to the
name "Daniel", or it may refer to me, Daniel Bigham, or it may refer to Daniel of the Bible. What is needed here is a giant mapping of
words to
entities. In fact, it's a little more complicated than this since some entities are represented by two or more words. For example, the two words "first name" represent an entity with the ID
first_name. It should also be noted that words themselves are entities.
At this point, we have a list of possible entities for each word. Before moving forward, we need to realize that, for each word, only
one of those entities is correct.
To use an example, the fragment
Daniel's name might ultimately map to the following entities:
| DanielBigham "'s" first_name |
|
In other words:
  | "Daniel" maps to the entity with the ID DanielBigham |
  | "'s" doesn't need to be mapped since in this case it refers to the word "'s" itself (remember that words themselves are entities) |
  | The words "first name" map to the entity with the ID first_name |
The language parsing algorithm needs to consider each possibility, but to keep things simple, let's assume for this example that we've settled on the above entities as being correct.
Now for the real guts of the process: The primary language parsing strategy I used in this project was the concept of a
transformation. A transformation has a left side and a right side. For example, if you want to parse the fragment
Daniel's first name, you would:
1. | Map the words to entities, as described: |
 | | DanielBigham "'s" first_name |
|
|
2. | Apply the following transformation: |
 | | {noun} "'s" {noun} -> $1.$2 |
|
|
What this says is that if you have a noun followed by an apostrophe-S, followed by another noun, sometimes the second noun refers to a property of the thing that the first noun refers to. You may have noticed that the right side of the transformation uses the project's knowledge representation, which is described above.
The project's language parser analyzes a statement and performs a search, considering the different transformations that can be applied to that statement. After a statement has had a transformation applied to it, it may take several additional transformations before it has been fully transformed into something that can be represented by the project's knowledge representation, at which point the software can either look up the value being requested, modify its knowledge representation to reflect the statement being made, or perform the requested action. (Depending on whether a question is being asked, a statement is being made, or a command is being issued)
Read moreVoice interfacesOnce a knowledge representation and language parser are in place, a voice interface can be developed.
Ideally, what is known as
dictation could be used, whereby the computer tries to hear what the user says, without restriction. But voice recognition isn't good enough to do this yet, at least not using an array microphone, and even with a headset I have never had good results with dictation.
Instead, what is known as a
command and control grammar is needed. This is unfortunate because it means that even if the language parser is smart enough to handle a wide spectrum of input, the voice grammar still needs to be developed to handle all possible inputs.
My approach was to develop a simple language that would be an order of magnitude more compact that writing an XML grammar by hand, and to integrate the grammar compiler with the knowledge representation's
is_a hierarchy and entity-to-word mappings. For example, you can write:
... and an XML grammar will be produced allowing the user to say in place of
food_item any word in the system which maps to an entity which has a direct or indirect
is_a relationship with
food_item. For example perhaps the words "apple pie" maps to the entity
apple_pie which has an
is_a relationship with
pie which has an
is_a relationship with
food_item.
Additionally, you can write a word in parentheses to make it optional. For example:
| we need (more) {food_item} |
|
And finally, the "|" character can be used to create a list of possible words that can be used. Surrounding this list with round brackets makes it optional that any of these words be spoken, while surrounding this list with square brackets makes it necessary that one of them be spoken. For example:
| play the song (named|called) {song_name} |
|
Or:
| we need more [{food_item} | {household_item}] |
|
Kitchen computingOnce a system has a knowledge representation, a language parser, and a voice interface, there are a variety of applications that it can be targeted for. In my mind, the room of the house best suited for this kind of voice interface is the kitchen, since most of the time you're busy doing something or your hands are grimy, and stopping what you're doing to type something into the computer is interruptive. Instead, it is useful to be able to speak a question or command, and have the answer spoken back, or the command executed.
In my mind, the most interesting applications of this technology are such things as nutritional tracking and calendaring.
I focused on calendaring since I felt it was an attainable goal, and the result is a talking calendar. Here are some example commands:
  | "I have a work appointment tomorrow at nine am" |
  | "My next dentist appointment is April seventh at two forty five pm" |
  | "I have a get together with Doug on April fifth at seven thirty" |
And here are some example queries:
  | "What appointments do I have today?" |
  | "What appointments do I have on Saturday?" |
  | "What appointments do I have next Wednesday?" |
  | "What appointments do I have left this week?" |
  | "What appointments does Meredith have next week?" |
  | "When is my next massage appointment?" |
  | etc. |
The application integrates with Google Calendar (in one direction) so that appointments can be viewed online. The next step was configuring my work calendar to sync with Google Calendar. The end result is being able to say "I have a massage appointment next Wednesday at four forty five pm" while standing in my kitchen, and then seeing that appointment when I get to work. Voila!
Here are some other interesting applications of this technology:
  | The car |
  | The smart phone |
  | The entertainment center |
  | The alarm clock |
Mobile computingHaving a voice-enabled system for managing your calendar and grocery list (etc) is great, but it's limited if it can only be accessed in one location. The ideal is for that information to be in the cloud, so to speak, accessible on a variety of devices from a variety of locations.
There are a variety of ways to accomplish this, the simplest being to host the application on an Internet server. The strategy I've used with Grace, primarily, is to do just this: Have a server running the application, at home, accessible via a web interface that allows the application to be used from any device that has a web browser.
Another strategy I've used is to synchronize data from the Grace application out to Google Calendar, and from there, to my work calendar. The main drawback here is that, as of now, calendars can't be edited outside of the Grace application. Rather, the web interface would be used, and synchronization would then carry those changes out to the other views.
Another useful technique is to develop an application targeted for a mobile device. In my case, I targeted an application for the BlackBerry, allowing either voice queries/commands or typed queries/commands to be given.