From the meetings that I have recorded, the ‘AI notes’ seem to lose a certain essence of some parts of the discussion, and the suggested follow-up tasks often don’t make sense. Aside from the often hilarious Teams chat response suggestions that have been in the product for a while now1, Teams Premium is my first encounter with the new large language model-based AI products from Microsoft. What concerns me is not how poor the product is today, but how close to perfect it is going to get over time.
I think the most worrisome aspect of AI systems in the short term is that we will give them too much autonomy without being fully aware of their limitations and vulnerabilities.
(Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans)
I see the output of these AI large language models on a spectrum. At one end, the tools may spit out complete and utter garbage, perhaps not even words. Their uselessness would be obvious to everyone who uses them. At the other end, the AI could output a perfect response (or summary of a meeting, in the case of Teams Premium) every single time. The problem lies in the middle, and gets worse the closer the system is to being consistently perfect:
Right now, I think that the quality of the Teams Premium ‘AI notes’ feature sits somewhere in the green area. It’s good and useful a lot of the time. For example, I can scan the notes and check whether a topic was mentioned. If that topic isn’t in the notes, it doesn’t mean that it wasn’t discussed; I’d have to watch the video back to check that the AI didn’t miss it. If the meeting was very important and needed to be formally minuted, I would still rely on the video.
As the product improves over time, we’ll move out of the green zone and into the yellow. At this point, I may consciously or subconsciously decide to stop routinely verifying the AI-generated output. It’s good enough, most of the time. Again, if a meeting is really important, I may watch the video.
The real danger comes in the red zone. Here, the AI output is superb most of the time, so much so that I never check it. I rely on the summary even for my important meeting minutes. But it’s not quite at the ‘completely perfect’ end of the spectrum. Occasionally it will trip up. Something will get missed — maybe one meeting in a hundred — and perhaps that something is critical to the conversation we’ve had. Perhaps it will attribute a comment to the wrong person, or miss the nuance of a discussion which was important to get exactly right. We may only find out that the AI produced flawed output for this meeting when an incident arises down the line.
This isn’t a concern about AI getting ‘too good’ and becoming ‘sentient’ in a general sense.2 It’s more that we have decided to stop thinking, that we have handed control of some part of our workflow over to the AI and no longer verify its output. For me personally, one bad output every 100 recorded meetings might be tolerable. But if we scale this across a large organisation where hundreds or thousands of meetings take place every day, we’re going to have problems.
Baldur Bjarnason explores this in his book The Intelligence Illusion:
I mentioned two of [the flaws] before, automation and anchoring biases. We, as human beings, have a strong tendency to trust machines over our own judgement. This kills people, as it’s been a major problem in aviation. Anchoring bias comes from our tendency to let the initial perceptions, thoughts, and ideas set the context for everything that follows. AI adds a third issue: anthropomorphism. Even the smartest people you know will fall for this effect as large language models are incredibly convincing. These biases combined lead people to feel even more confident in the AI’s work and believe that it’s done a better job than it has.
We’re using the AI tools for cognitive assistance. This means that we are specifically using them to think less. In every other industry this dynamic inevitably triggers our automation bias and compromises our judgement of the work done by the tools. We use the assistant to think less, so we do.
These models are incredibly fluent and—as we saw at the start of this book—are consistently presented by their vendors as near-AGI. This triggers our instinct towards anthropomorphism, making us feel like we have a fully human-level intelligence assisting us, creating an intelligence illusion that again hinders are ability to properly assess the work it’s doing for us.
AI-generated meeting summaries in Teams Premium is a useful starting point for thinking about this technology. There’s no user input beyond hitting the ‘record’ button during a meeting, and everyone with a Teams Premium licence gets access to exactly the same summary. The possibility for getting something wrong is limited to how good or bad the summary of the meeting is. So far, so harmless. But Microsoft 365 Copilot will be arriving soon, vastly expanding the problem space with its interactive, prompt driven approach. Where on the ‘useless to perfect’ spectrum will it land? What if just being ‘very good’ isn’t good enough?