Taking your example step by step, there are a few operations which I believe are common between human and machine text interpretation. I work on Siri, so I'll use examples from the Assistant domain.It is all to do with shape recognition, think about reading. We look at the page and interpret the shapes so as to form words and these are pre stored in our memory from when we learnt to read. We can connect the words to form meaningful sentences so we can understand, AI can easily have a camera that would do the same and end up with a word but no understanding. Not a problem if you are just comparing the word to a table of words where each results in some action as in a digital system but that is not AI, to be AI it would need to just make the decision based on knowledge which is a very long way off. AI would need to read and understand in order to learn which I think is still very sci-fi at the moment.
- Given the shapes, what are the words? Humans and machines do this in functionally-equivalent ways - essentially it's as you say, we compare the squiggles on the page to lots of learned examples and make a decision. Both humans and machines are biased towards 'common' sequences of words eg 'I need some groceries so I'm off to the shops' is much more likely than 'I need some groceries so I'm off to the ships' - both humans and AI use their previous experience of language to bias their predictions about the squiggles.
- Given the words, what is the meaning? This is where humans have almost unlimited advantages, and where systems like chatGPT are moving the needle for machines. Does 'hold my beer' really mean that there is a beverage nearby which requires immobilisation? Sometimes yes, but usually, no. So, how do you learn what 'hold my beer' means? You read and understand the context. If it's a paragraph where somebody is literally drinking a beer and needs to relinquish it temporarily, it might be literal. More likely, you see 'hold my beer' in a meme or other high-frequency graphical motif, or at the end of a social media post. You learn it's actually a metaphor. AI can learn that too, and from the same signals, and the clever bit about GPT and other transformer-based large language model systems, is that the AI can learn it without humans having to label all of the training data - just like humans can.
- What should be the response? Here the AI is often on more solid ground, simply because its range of responses is smaller. 'Hey X, be a a love and turn on the lights while I find my glasses' has a lot of content in it, and if you are X, you can probably construct a whole mental scene around it suggesting how the person feels about you, where they are, what's the likely event context, where they left their glasses etc etc, and therefore what is the most salient response. But if X=Siri, the most salient response is smart_lights.TurnOn(). As assistants mature, more of that sentence might be important eg 'turning the lights on, also I can see your glasses on the hall table' (creepy!).
Last edited: