The Agentic AI Digest (10/10)| Asynchronous Streaming, A Framework for LLM Evaluation & New Standards in AI Security
This week: We provide a framework for evaluating LLMs beyond basic benchmarks, unpack the new wave of agentic security standards, and implement asynchronous streaming in our Go agent.
Hi everyone,
Welcome to your weekly briefing from the Agentic AI Roundtable. Our goal is to cut through the noise and deliver the most relevant signals, patterns, and community wins to help you build more effectively.
Let’s dive in.
🛠️ Community Commits: Building in the Open
After completing a basic implementation of the synchronous “SendMessage” A2A method last week, this week explores the asynchronous streaming equivalent, i.e. “SendStreamingMessage”. This streaming method is particularly useful for longer-running tasks that produce incremental results or where it is useful to provide updates over the task’s lifecycle.
Watch Me Build: A2A Agent in Go - Part 5: Streaming Events
Useful links
A2A Protocol Specification - Streaming and Asynchronous Operations
Community Insights
Discussions on patterns / learnings from the community
This week in the community chat, a discussion sparked by Peter Mbui and Andrew Murdoch offered a deep dive into the practical challenges and solutions for processing long-form audio and video, a common hurdle for agentic applications like meeting summarizers.
Peter shared his experience building an “AI minutes builder,” where the initial approach of sending entire meeting recordings to a multimodal model led to significant bottlenecks:
Model Repetition: The model would often get stuck in a repetitive loop, filling the context window without providing a complete summary.
Truncated & Invalid JSON: Outputs were frequently incomplete, resulting in JSON parsing errors.
This led them to pivot to a more robust, decoupled workflow:
Transcription: Transcription was accomplished by segmenting the audio into smaller chunks, transcribing each chunk individually, then recombining the transcripts for each chunk to form the final transcript. Some notes were taken on the implementation:
Model Selection is Key: gemini-2.5-pro proved to be significantly more reliable than gemini-2.5-flash for transcribing segments. Flash was often inconsistent and struggled to follow instructions across the different chunks.
Speaker Identification is Difficult: While diarization works within a single segment, matching speakers across different segments (e.g., knowing “speaker 1” in chunk A is the same as “speaker 2” in chunk B) remains a significant challenge. As such, the transcripts were generated without taking note of who was speaking.
The Segment Length Trade-off: The team found an empirical sweet spot. Longer segments increase the risk of hallucination or hitting “resource exhausted” errors. Shorter segments are more reliable but increase the number of API calls, risking hitting Requests Per Minute (RPM) limits. Their optimal balance was 5-minute chunks processed 10 at a time.
LLM for Analysis: Second, pass the clean, structured transcript to Gemini. Freed from the task of transcription, the model can focus on higher-level tasks like summarization, identifying key topics, and structuring the final JSON output.
The key takeaway is that for complex, stateful tasks like analyzing long meetings, a monolithic approach often fails. A more resilient pattern involves decoupling the workflow with specialized tools and implementing a thoughtful segmentation strategy that carefully balances performance, cost, and API limitations.
📒 From the Workbench: Patterns to Pocket
This week, we’re continuing our series on generation configuration. Last week, we covered model choice. However, we were focused only on the difference between two specific models: Gemini 2.5 Pro and Gemini 2.5 Flash. However, the AI landscape is vast and constantly evolving. Google will eventually release the next version of Gemini, after which last week’s article will become outdated. Rather than only giving you recommendations that will eventually expire, our goal as the AI Roundtable is to equip you to make better decisions long-term. Therefore, this week we’re going deeper into model evaluation metrics and benchmarks. With this knowledge, you’ll be able to make better informed choices when selecting a model. There are several key metrics we can use to compare models:
End-to-end generation time
Intelligence
Price
Inherent capabilities
Developer experience
End-to-end generation time
This is the total time from when you send a request until the last token is generated and returned. It is the sum of two distinct phases:
Latency (Time to First Token): The time it takes from request until the first token is generated. While latency varies between models, it is also a function of the input size. Due to the self-attention mechanism in Transformer architectures, the relationship between input tokens and latency is primarily quadratic.
Output Speed (Tokens per Second): The rate at which subsequent tokens are generated after the first one. For any given model, this speed is typically constant. The total time to generate the output is simply the output speed multiplied by the number of tokens you need.
A quick note on Context Caching: While not typically part of standard benchmarks, using a context cache can significantly decrease latency for requests that share a large, common prefix (like a lengthy document for Q&A).
Generation time (latency, speed, and end-to-end) is benchmarked on the LLM leaderboard.
Intelligence
“Intelligence” is a useful proxy for the quality of generated content and the model’s ability to succeed at complex tasks. When comparing models, we expect to see the greatest differences in output quality on tasks that require a high degree of reasoning.
For example, Flash and Pro might both score 100% on a simple grammar test. However, on a graduate-level exam, they might score 50% and 90%, respectively. Therefore, always keep the complexity of your task in mind when evaluating a model’s intelligence.
The Artificial Analysis Intelligence Index is a respected aggregate benchmark for comparing models. It incorporates results from over 10 different evaluations. Here is a sample to give you an idea of what they measure:
MMLU-Pro: (Massive Multitask Language Understanding) A broad exam covering 57 subjects like history, math, and law to test general knowledge and problem-solving.
GPQA Diamond: (Graduate-Level Google-Proof Q&A) A set of challenging questions in biology, physics, and chemistry designed to be difficult for search engines to answer, testing deep expert reasoning.
Humanity’s Last Exam: A diverse, multi-subject exam with creative and puzzle-like problems that test for flexible thinking.
LiveCodeBench: A benchmark consisting of real-world coding challenges to measure a model’s practical programming and problem-solving abilities.
Once again, model intelligence is benchmarked on the LLM leaderboard.
Cost
Model pricing is multifaceted. You are typically billed per token, with different rates for input and output tokens. Often, different rates also apply to different modes of input. Multimodal inputs, such as images and audio files, are typically priced more than plain text.
As a general rule, intelligence and price go hand in hand. More capable models like Gemini Pro are more expensive than highly efficient models like Gemini Flash.
To save costs on high-volume, repetitive tasks, you can use a context cache. With caching enabled, you are billed a small fee per token, per hour to store the context, but in return, you receive a significant discount (around 75% for Gemini models) on the cached input tokens in subsequent requests. For the most accurate details, check the official Gemini pricing documentation.
Capabilities and Specialized Tools
Models also vary in their built-in capabilities. For instance, the latest Gemini models can be grounded with Google Search, allowing them to access and process information from the internet in real-time. See the Gemini documentation for more information about their capabilities.
Developer experience
Considering the developer experience is crucial, as a powerful model is only as good as the tools and support surrounding it.
SDKs and Documentation: Prioritize models with high-quality official SDKs and clear documentation. A well-supported SDK will drastically speed up development. Google actively maintains the Golang Genai SDK.
Community and Support: An active developer community and accessible official support channels are essential for troubleshooting and resolving issues quickly. Google AI forums is a useful place to look if you encounter an issue.
Prototyping and Model Selection Tools
Besides looking at benchmarks and leaderboards, we can also interact with models before committing to writing any code. Google provides two powerful, code-free tools to help you experiment and select the right model:
Model Garden: A comprehensive catalog where you can discover, explore, and see demos for a wide variety of Google and third-party models.
Vertex AI Studio: A hands-on UI environment for rapid prototyping. Here, you can design and test prompts, tune model behavior with configuration, and directly compare the outputs of different models for your specific use case.
We hope this guide serves as a durable framework for navigating the model landscape, empowering you to build with confidence today and in the future.
📡 On the Radar: What’s Moving the Needle
A curated look at the articles, papers, resources, and updates that are worth your time this week. With several major announcements, the spotlight is firmly on the evolving landscape of AI security.
Google Releases SAIF, a Cybersecurity Framework for AI: Google has introduced the Secure AI Framework (SAIF), a comprehensive guide inspired by the principles of secure-by-design and secure-by-default. SAIF is designed to help organizations manage the unique security risks associated with AI systems. It outlines six core elements, including securing the AI supply chain, hardening technical infrastructure, and promoting responsible release policies. For builders, it serves as a valuable high-level playbook for integrating security throughout the entire AI development lifecycle.
A2AS: A New Security Standard for Agentic Runtimes: A broad coalition of tech companies—including Google, AWS, Meta, and Salesforce—has introduced the A2AS (Agentic AI Runtime Security and Self-Defense) framework. Positioned as a security layer for AI agents “similar to how HTTPS secures HTTP” , A2AS aims to provide a defense-in-depth strategy without introducing significant latency, model retraining, or architectural complexity. It’s built on the BASIC security model, which includes primitives like **(B)**ehavior certificates, **(A)**uthenticated prompts, **(S)**ecurity boundaries, **(I)**n-context defenses, and **(C)**odified policies to ensure context integrity and enforce certified agent behavior. This is a critical step toward a universal standard for agentic security.
Solving “Headless” Authentication for Multi-Agent Systems: A thought-provoking post from Auth0 tackles a key security challenge in agent-to-agent (A2A) communication: how do agents securely take action on a user’s behalf, especially in complex multi-agent workflows? The article highlights that typical OAuth flows, which rely on browser redirects, are not suitable for these “headless” exchanges. It proposes Client Initiated Backchannel Authentication (CIBA) as a plausible and powerful flow. CIBA decouples the device where the user authenticates from the device where the action is consumed, making it a natural fit for enabling the next generation of autonomous and collaborative agentic systems.
🤝 Want to Get Involved in the Community?
This roundtable is driven by its members. To join the conversation, share your work, or ask a question, you have two great options:
Join our private Google Chat space for real-time discussions and to participate in the weekly Open Thread. [Link to Chat Space]
Send a message to our community Google Group at roundtable-community@agentic-ai.build.
We look forward to hearing from you.
The Agentic AI Roundtable Core Team




