Marble, a startup building artificial intelligence agents for tax professionals, has raised $9 million in seed funding as the accounting industry grapples with a deepening labor shortage and mounting regulatory complexity.The round, led by Susa Ventures with participation from MXV Capital and Konrad Capital, positions Marble to compete in a market where AI adoption has lagged significantly behind other knowledge industries like law and software development."When we looked at the economy and asked ourselves where AI is going to transform the way businesses operate, we focused on knowledge ...moreindustries — specifically businesses with hourly fee-based service models," said Bhavin Shah, Marble's chief executive officer, in an exclusive interview with VentureBeat. "Accounting generates $250 billion in fee-based billing in the US every year. There's a tremendous opportunity to increase efficiency and improve margins for accounting firms."The company has launched a free AI-powered tax research tool on its website that converts complex government tax data into accessible, citation-backed answers for practitioners. Marble plans to expand into AI agents that can analyze compliance scenarios and eventually automate portions of tax preparation workflows.Marble's backers share Shah's conviction about the market. "Marble is rethinking the accounting system from the ground up. Accounting is one of the biggest — and most overlooked — markets in professional services," Chad Byers, general partner at Susa Ventures, told VentureBeat. "We've known Bhavin from his time as an executive in the Susa portfolio, and have seen firsthand how sharp and execution-driven he is. He and Geordie bring the perfect mix of operational depth and product instinct to a space long overdue for change — and they see the same massive opportunity we do."The accounting industry lost 340,000 workers in four years — and replacements aren't comingMarble enters a market shaped by structural forces that have fundamentally altered the economics of professional accounting.The accounting profession has shed roughly 340,000 workers since 2019, a 17% decline that has left firms scrambling to meet client demands. First-time candidates for the Certified Public Accountant exam dropped 33% between 2016 and 2021, according to AICPA data, and 2022 saw the lowest number of exam takers in 17 years.The exodus comes as baby boomers exit en masse. The American Institute of CPAs estimates that approximately 75% of all licensed CPAs reached retirement age by 2019, creating a demographic cliff that the profession has struggled to address.“Fewer CPAs are getting certified year over year," Shah said. "The industry is compressing at the same time that there's more work to be done and the tax code is getting more complicated."The National Pipeline Advisory Group, a multi-stakeholder body formed by the AICPA in July 2023, released a report identifying the 150-hour education requirement for CPA licensure as a significant barrier to entry. A separate survey by the Center for Audit Quality found that 57% of business majors who chose not to pursue accounting cited the additional credit hours as a deterrent.Recent legislative changes reflect the urgency. Ohio now offers alternatives to the 150-hour requirement, signaling that states are willing to experiment with pathways that could reverse enrollment declines.Why AI transformed law and software development but left accounting behindDespite the profession's challenges, AI ...
The tools are available to everyone. The subscription is company-wide. The training sessions have been held. And yet, in offices from Wall Street to Silicon Valley, a stark divide is opening between workers who have woven artificial intelligence into the fabric of their daily work and colleagues who have barely touched it.The gap is not small. According to a new report from OpenAI analyzing usage patterns across its more than one million business customers, workers at the 95th percentile of AI adoption are sending six times as many messages to ChatGPT as the median employee at the same compani...morees. For specific tasks, the divide is even more dramatic: frontier workers send 17 times as many coding-related messages as their typical peers, and among data analysts, the heaviest users engage the data analysis tool 16 times more frequently than the median.This is not a story about access. It is a story about a new form of workplace stratification emerging in real time — one that may be reshaping who gets ahead, who falls behind, and what it means to be a skilled worker in the age of artificial intelligence.Everyone has the same tools, but not everyone is using themPerhaps the most striking finding in the OpenAI report is how little access explains. ChatGPT Enterprise is now deployed across more than 7 million workplace seats globally, a nine-fold increase from a year ago. The tools are the same for everyone. The capabilities are identical. And yet usage varies by orders of magnitude.Among monthly active users — people who have logged in at least once in the past 30 days — 19 percent have never tried the data analysis feature. Fourteen percent have never used reasoning capabilities. Twelve percent have never used search. These are not obscure features buried in submenus; they are core functionality that OpenAI highlights as transformative for knowledge work.The pattern inverts among daily users. Only 3 percent of people who use ChatGPT every day have never tried data analysis; just 1 percent have skipped reasoning or search. The implication is clear: the divide is not between those who have access and those who don't, but between those who have made AI a daily habit and those for whom it remains an occasional novelty.Employees who experiment more are saving dramatically more timeThe OpenAI report suggests that AI productivity gains are not evenly distributed across all users but concentrated among those who use the technology most intensively. Workers who engage across approximately seven distinct task types — data analysis, coding, image generation, translation, writing, and others — report saving five times as much time as those who use only four. Employees who save more than 10 hours per week consume eight times more AI credits than those who report no time savings at all.This creates a compounding dynamic. Workers who experiment broadly discover more uses. More uses lead to greater productivity gains. Greater productivity gains presumably lead to better performance reviews, more interesting assignments, and faster advancement—which in turn provides more opportunity and incentive to deepen AI usage further.Seventy-five percent of surveyed workers report being able to complete tasks they previously could not perform, including programming support, spreadsheet automation, and technical troubleshooting. For workers who have embraced these capabilities, the boundaries of their roles are expanding. For those who have not, the boundaries may be contracting by comparison.The corporate AI paradox: $40 billion spent, 95 percent seeing no return...
Engineering teams are generating more code with AI agents than ever before. But they're hitting a wall when that code reaches production.The problem isn't necessarily the AI-generated code itself. It's that traditional monitoring tools generally struggle to provide the granular, function-level data AI agents need to understand how code actually behaves in complex production environments. Without that context, agents can't detect issues or generate fixes that account for production reality.It's a challenge that startup Hud is looking to help solve with the launch of its...more runtime code sensor on Wednesday. The company's eponymous sensor runs alongside production code, automatically tracking how every function behaves, giving developers a heads-up on what's actually occurring in deployment."Every software team building at scale faces the same fundamental challenge: building high-quality products that work well in the real world," Roee Adler, CEO and founder of Hud, told VentureBeat in an exclusive interview. "In the new era of AI-accelerated development, not knowing how code behaves in production becomes an even bigger part of that challenge."What software developers are struggling with The pain points that developers are facing are fairly consistent across engineering organizations. Moshik Eilon, group tech lead at Monday.com, oversees 130 engineer and describes a familiar frustration with traditional monitoring tools."When you get an alert, you usually end up checking an endpoint that has an error rate or high latency, and you want to drill down to see the downstream dependencies," Eilon told VentureBeat. "A lot of times it's the actual application, and then it's a black box. You just get 80% downstream latency on the application."The next step typically involves manual detective work across multiple tools. Check the logs. Correlate timestamps. Try to reconstruct what the application was doing. For novel issues deep in a large codebase, teams often lack the exact data they need.Daniel Marashlian, CTO and co-founder at Drata, saw his engineers spending hours on what he referred to as an "investigation tax." "They were mapping a generic alert to a specific code owner, then digging through logs to reconstruct the state of the application," Marashlian told VentureBeat. "We wanted to eliminate that so our team could focus entirely on the fix rather than the discovery."Drata's architecture compounds the challenge. The company integrates with numerous external services to deliver automated compliance, which creates sophisticated investigations when issues arise. Engineers trace behavior across a very large codebase spanning risk, compliance, integrations, and reporting modules.Marashlian identified three specific problems that drove Drata toward investing in runtime sensors. The first issue was the cost of context switching. "Our data was scattered, so our engineers had to act as human bridges between disconnected tools," he said.The second issue, he noted, is alert fatigue. "When you have a complex distributed system, general alert channels become a constant stream of background noise, what our team describes as a 'ding, ding, ding' effect that eventually gets ignored," Marashlian said.The third key driver was a need to integrate with the company's AI strategy."An AI agent can write code, but it cannot fix a production bug if it can't see the runtime variables or the root cause,&q...
There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract math problems and passing PhD-level exams that most benchmarks are based on, but Databricks has a question for the enterprise: Can they actually handle the document-heavy work most enterprises need them to do?The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise ...moreworkloads, exposing a critical gap between academic benchmarks and business reality."If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," Erich Elsen, principal research scientist at Databricks, explained to VentureBeat. "So that's why we were looking around. How do we create a benchmark that, if we get better at it, we're actually getting better at solving the problems that our customers have?"The result is OfficeQA, a benchmark designed to test AI agents on grounded reasoning: Answering questions based on complex proprietary datasets containing unstructured documents and tabular data. Unlike existing benchmarks that focus on abstract capabilities, OfficeQA proxies for the economically valuable tasks enterprises actually perform.Why academic benchmarks miss the enterprise markThere are numerous shortcomings of popular AI benchmarks from an enterprise perspective, according to Elsen. HLE features questions requiring PhD-level expertise across diverse fields. ARC-AGI evaluates abstract reasoning through visual manipulation of colored grids. Both push the frontiers of AI capabilities, but don't reflect daily enterprise work. Even GDPval, which was specifically created to evaluate economically useful tasks, misses the target."We come from a pretty heavy science or engineering background, and sometimes we create evals that reflect that," Elsen said. " So they're either extremely math-heavy, which is a great, useful task, but advancing the frontiers of human mathematics is not what customers are trying to do with Databricks."While AI is commonly used for customer support and coding apps, Databricks' customer base has a broader set of requirements. Elsen noted that answering questions about documents or corpora of documents is a common enterprise task. These require parsing complex tables with nested headers, retrieving information across dozens or hundreds of documents and performing calculations where a single-digit error can cascade into organizations making incorrect business decisions.Building a benchmark that mirrors enterprise document complexityTo create a meaningful test of grounded reasoning capabilities, Databricks needed a dataset that approximates the messy reality of proprietary enterprise document corpora, while remaining freely available for research. The team landed on U.S. Treasury Bulletins, published monthly for five decades beginning in 1939 and quarterly thereafter.The Treasury Bulletins check every box for enterprise document complexity. Each bulletin runs 100 to 200 pages and consists of prose, complex tables, charts and figures describing Treasury operations: Where federal money came from, where it went and how it financed government operations. The corpus spans approximately 89,000 pages across eight decades. Until 1996, the bulletins were scans ...
Presented by CelonisThe State of Oklahoma discovered its blind spots the hard way. In April 2023, a legislative report revealed its agencies had spent $3 billion without proper oversight. Janet Morrow, Director of Oklahoma's Risk, Assessment and Compliance Division, set out to track thousands of monthly transactions across dozens of disconnected systems.The Sooner State became the first U.S. state to apply process intelligence (PI) technology for procurement oversight. The transformation, Morrow says, was immediate. Real-time monitoring replaced multi-year audit cycles. The platform from ...moremarket-leader Celonis quickly identified more than $10 million of inappropriate spending. And the oversight team was able to redeploy staff from 13 to 5 members while dramatically increasing effectiveness.“Process for Progress”: A global movementOklahoma's pioneering success using powerful new process technology spotlights an emerging global trend. Morrow was among more than 3,000 leaders gathered at Celosphere, Celonis’s recent annual conference, to explore how AI, powered with business context by PI, can deliver commercial returns as well as environmental and financial benefits worldwide. The vision: process intelligence as a foundation for public and social progress. The movement sees the combination of AI and PI like Oklahoma’s as a powerful way to help governments and other organizations deliver vital services more cost effectively, with improved decisions and better-informed policies. From procurement to juvenile justice to healthcare and environment, scores of organizations are now getting a first look at the famously byzantine, opaque way things get done.For veteran financial leader Aubrey Vaughan — now Vice President of Strategy for Public Sector at Celonis and formerly a top executive at a major financial software firm — the move toward real process improvement has been a long time coming. He recalls testifying proudly before Congress a few years ago about uncovering $10 billion in improper government payments at his previous company. Afterward, a senior government official pulled him aside and suggested he downplay the achievement.The reason, he was told: "The next question they're going to ask you is, ‘Why is that happening?’” says Vaughn. “Today we can answer not only why, but how we fix it." Across the U.S. and the globe, public agencies are tightening budgets. Desire to deploy AI to close the gap is colliding with a hard reality: you can't automate what you don't understand. Here are three real-world examples of organizations using PI and AI for better outcomes. Oklahoma: Real-time AI spending analysis boosts accountability Within just 60 days of implementation, Celonis reviewed $29.4 billion worth of purchase order lines, identifying $8.48 billion in statutory exempt purchases and flagging problematic transactions. The system now provides real-time feedback to buyers within 15 minutes of purchases, allowing immediate course correction.The system revealed agencies were purchasing from a vendor at prices 45% lower than the statewide contract, forcing renegotiation. "Real-time AI analysis has increased accountability by providing key insights into spending patterns and streamlining contract utilization," Morrow explains. Last year, Oklahoma adopted Celonis's Copilot feature, which uses conversational AI to let executives ask questions in plain language. Now, when the Governor or a cabinet member wonders about a contract, they get answers in seconds, not weeks, Morrow says. Her group is exp...
When many enterprises weren’t even thinking about agentic behaviors or infrastructures, Booking.com had already “stumbled” into them with its homegrown conversational recommendation system.
This early experimentation has allowed the company to take a step back and avoid getting swept up in the frantic AI agent hype. Instead, it is taking a disciplined, layered, modular approach to model development: small, travel-specific models for cheap, fast inference; larger large language models (LLMs) for reasoning and understanding; and domain-tuned evaluations built in-house when precision is critica...morel.
With this hybrid strategy — combined with selective collaboration with OpenAI — Booking.com has seen accuracy double across key retrieval, ranking and customer-interaction tasks.
As Pranav Pathak, Booking.com’s AI product development lead, posed to VentureBeat in a new podcast: “Do you build it very, very specialized and bespoke and then have an army of a hundred agents? Or do you keep it general enough and have five agents that are good at generalized tasks, but then you have to orchestrate a lot around them? That's a balance that I think we're still trying to figure out, as is the rest of the industry.”
Check out the new Beyond the Pilot podcast here, and continue reading for highlights. Moving from guessing to deep personalization without being ‘creepy’Recommendation systems are core to Booking.com’s customer-facing platforms; however, traditional recommendation tools have been less about recommendation and more about guessing, Pathak conceded. So, from the start, he and his team vowed to avoid generic tools: As he put it, the price and recommendation should be based on customer context.
Booking.com’s initial pre-gen AI tooling for intent and topic detection was a small language model, what Pathak described as “the scale and size of BERT.” The model ingested the customer’s inputs around their problem to determine whether it could be solved through self-service or bumped to a human agent.
“We started with an architecture of ‘you have to call a tool if this is the intent you detect and this is how you've parsed the structure,” Pathak explained. “That was very, very similar to the first few agentic architectures that came out in terms of reason and defining a tool call.”
His team has since built out that architecture to include an LLM orchestrator that classifies queries, triggers retrieval-augmented generation (RAG) and calls APIs or smaller, specialized language models. “We've been able to scale that system quite well because it was so close in architecture that, with a few tweaks, we now have a full agentic stack,” said Pathak.
As a result, Booking.com is seeing a 2X increase in topic detection, which in turn is freeing up human agents’ bandwidth by 1.5 to 1.7X. More topics, even complicated ones previously identified as ‘other’ and requiring escalation, are being automated.
Ultimately, this supports more self-service, freeing human agents to focus on customers with uniquely-specific problems that the platform doesn’t have a dedicated tool flow for — say, a family that is unable to access its hotel room at 2 a.m. when the front desk is closed.
That not only “really starts to compound,” but has a direct, long-term impact on customer retention, Pathak noted. “One of the things we've seen is, the better we are at customer service, the more loyal our customers are.”
Another recent rollout is personalized filtering. Booking.com has between 200 and 250 search filters on its website — an unrealistic amount...
Remember this Quora comment (which also became a meme)?(Source: Quora)In the pre-large language model (LLM) Stack Overflow era, the challenge was discerning which code snippets to adopt and adapt effectively. Now, while generating code has become trivially easy, the more profound challenge lies in reliably identifying and integrating high-quality, enterprise-grade code into production environments.This article will examine the practical pitfalls and limitations observed when engineers use modern coding agents for real enterprise work, addressing the more complex issues around integration, scal...moreability, accessibility, evolving security practices, data privacy and maintainability in live operational settings. We hope to balance out the hype and provide a more technically-grounded view of the capabilities of AI coding agents. Limited domain understanding and service limits
AI agents struggle significantly with designing scalable systems due to the sheer explosion of choices and a critical lack of enterprise-specific context. To describe the problem in broad strokes, large enterprise codebases and monorepos are often too vast for agents to directly learn from, and crucial knowledge can be frequently fragmented across internal documentation and individual expertise.More specifically, many popular coding agents encounter service limits that hinder their effectiveness in large-scale environments. Indexing features may fail or degrade in quality for repositories exceeding 2,500 files, or due to memory constraints. Furthermore, files larger than 500 KB are often excluded from indexing/search, which impacts established products with decades-old, larger code files (although newer projects may admittedly face this less frequently).For complex tasks involving extensive file contexts or refactoring, developers are expected to provide the relevant files and while also explicitly defining the refactoring procedure and the surrounding build/command sequences to validate the implementation without introducing feature regressions.Lack of hardware context and usage
AI agents have demonstrated a critical lack of awareness regarding OS machine, command-line and environment installations (conda/venv). This deficiency can lead to frustrating experiences, such as the agent attempting to execute Linux commands on PowerShell, which can consistently result in ‘unrecognized command’ errors. Furthermore, agents frequently exhibit inconsistent ‘wait tolerance’ on reading command outputs, prematurely declaring an inability to read results (and moving ahead to either retry/skip) before a command has even finished, especially on slower machines.This isn't merely about nitpicking features; rather, the devil is in these practical details. These experience gaps manifest as real points of friction and necessitate constant human vigilance to monitor the agent’s activity in real-time. Otherwise, the agent might ignore initial tool call information and either stop prematurely, or proceed with a half-baked solution requiring undoing some/all changes, re-triggering prompts and wasting tokens. Submitting a prompt on a Friday evening and expecting the code updates to be done when checking on Monday morning is not guaranteed.Hallucinations over repeated actions
Working with AI coding agents often presents a longstanding challenge of hallucinations, or incorrect or incomplete pieces of information (such as small code snippets) within a larger set of changesexpected to be fixed by a developer with trivial-to-low effort. However, what becomes particularly problematic is when inc...
For all their superhuman power, today’s AI models suffer from a surprisingly human flaw: They forget. Give an AI assistant a sprawling conversation, a multi-step reasoning task or a project spanning days, and it will eventually lose the thread. Engineers refer to this phenomenon as “context rot,” and it has quietly become one of the most significant obstacles to building AI agents that can function reliably in the real world.A research team from China and Hong Kong believes it has created a solution to context rot. Their new paper introduces general agentic memory (GAM), a system built to pres...moreerve long-horizon information without overwhelming the model. The core premise is simple: Split memory into two specialized roles, one that captures everything, another that retrieves exactly the right things at the right moment.Early results are encouraging, and couldn’t be better timed. As the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the right inflection point.When bigger context windows still aren’t enoughAt the heart of every large language model (LLM) lies a rigid limitation: A fixed “working memory,” more commonly referred to as the context window. Once conversations grow long, older information gets truncated, summarized or silently dropped. This limitation has long been recognized by AI researchers, and since early 2023, developers have been working to expand context windows, rapidly increasing the amount of information a model can handle in a single pass.Mistral’s Mixtral 8x7B debuted with a 32K-token window, which is approximately 24 to 25 words, or about 128 characters in English; essentially a small amount of text, like a single sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which more than doubled that capacity; then came Google’s Gemini 1.5 Pro and Anthropic’s Claude 3, offering massive 128K and 200K windows, both of which are extendable to an unprecedented one million tokens. Even Microsoft joined the push, vaulting from the 2K-token limit of the earlier Phi models to the 128K context window of Phi-3.
Increasing context windows might sound like the obvious fix, but it isn’t. Even models with sprawling 100K-token windows, enough to hold hundreds of pages of text, still struggle to recall details buried near the beginning of a long conversation. Scaling context comes with its own set of problems. As prompts grow longer, models become less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy gradually erodes.Longer inputs also dilute the signal-to-noise ratio, as including every possible detail can actually make responses worse than using a focused prompt. Long prompts also slow models down; more input tokens lead to noticeably higher output-token latency, creating a practical limit on how much context can be used before performance suffers.Memories are pricelessFor most organizations, supersized context windows come with a clear downside — they’re costly. Sending massive prompts through an API is never cheap, and because pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. And that’s the tension at the heart of the issue: Memory is essential to making AI more powerful.As context windows stretch into the hundreds of thousands or millions of tokens, the financial overhead rises just as sharply. Scaling context is both...
OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer. For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.What are confessions?Many forms of AI deception result from the complex...moreities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them." The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.How confession training worksThe key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem. Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes ...
Presented by CelonisWhen tariff rates change overnight, companies have 48 hours to model alternatives and act before competitors secure the best options. At Celosphere 2025 in Munich, enterprises demonstrated how they’re turning that chaos into competitive advantage — with quantifiable results that separate winners from losers.Vinmar International: Theglobal plastics and chemicals distributor created a real-time digital twin of its $3B supply chain, cutting default expedites by more than 20% and improving delivery agility across global operations.Florida Crystals: One of America's largest...more cane sugar producers, the company unlocked millions in working capital and strengthened supply chain resilience by eliminating manual rework across Finance, Procurement, and Inbound Supply. AI pilots now extend gains into invoice processing, predictive maintenance, and order management.ASOS: The ecommerce fashion giant connected its end-to-end supply chain for full transparency, reducing process variation, accelerating speed-to-market, and improving customer experience at scale. The common thread here: process intelligence that bridges the gap traditional ERP systems can’t close — connecting operational dots across ERP, finance, and logistics systems when seconds matter. “The question isn’t whether disruptions will hit,” says Peter Budweiser, General Manager of Supply Chain at Celonis. “It’s whether your systems can show you what’s breaking fast enough to fix it.”That visibility gap costs the average company double-digit millions in working capital and competitive positioning. As 54% of supply chain leaders face disruptions daily, the pressure is shifting to AI agents that execute real actions: triggering purchase orders, rerouting shipments, adjusting inventory. But an autonomous agent acting on stale or siloed data can make million-dollar mistakes when tariff structures shift overnight. Tariffs, as old as trade itself, have become the ultimate stress test for enterprise AI — revealing whether companies truly understand their supply chains and whether their AI can be trusted to act.Modern ERP: Data rich, insight poorSupply chain leaders face a paradox: drowning in data while starving for insight. Traditional enterprise systems — SAP, Oracle, PeopleSoft — capture every transaction meticulously. SAP logs the purchase order. Oracle tracks the shipment. The warehouse system records inventory movement. Each performs its function, but when tariffs change and companies need to model alternative sourcing scenarios across all three simultaneously, the data sits in silos.“What’s changed is the speed at which disruptions cascade,” says Manik Sharma, Head of Supply Chain GTM AI at Celonis. “Traditional ERP systems weren’t built for today’s volatility.”Companies generate thousands of reports showing what happened last quarter. They struggle to answer what happens if tariffs increase 25% tomorrow and need to switch suppliers within days.Tariffs: The 48-hour scrambleGlobal trade volatility has transformed tariffs from predictable costs into strategic weapons. When new rates drop with unprecedented frequency, input costs spike across suppliers, finance teams scramble to calculate margin impact, and procurement races to identify alternatives buried in disconnected systems where no one knows if switching suppliers delays shipments or violates contracts.By hour 48, competitors who already modeled scenarios execute supplier switches while late movers face capacity constraints and premium pricing. Process intelligence changes that dynamic by allowing businesses to ...
There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction following to agentic web browsing and tool use. But many of these benchmarks have one major shortcoming: they measure the AI's ability to complete specific problems and requests, not how factual the model is in its outputs — how well it generates objectively correct information tied to real-world data — especially when dealing with information contained in imagery or graphics.For industries where ...moreaccuracy is paramount — legal, finance, and medical — the lack of a standardized way to measure factuality has been a critical blind spot.That changes today: Google’s FACTS team and its data science unit Kaggle released the FACTS Benchmark Suite, a comprehensive evaluation framework designed to close this gap. The associated research paper reveals a more nuanced definition of the problem, splitting "factuality" into two distinct operational scenarios: "contextual factuality" (grounding responses in provided data) and "world knowledge factuality" (retrieving information from memory or the web).While the headline news is Gemini 3 Pro’s top-tier placement, the deeper story for builders is the industry-wide "factuality wall."According to the initial results, no model—including Gemini 3 Pro, GPT-5, or Claude 4.5 Opus—managed to crack a 70% accuracy score across the suite of problems. For technical leaders, this is a signal: the era of "trust but verify" is far from over.Deconstructing the BenchmarkThe FACTS suite moves beyond simple Q&A. It is composed of four distinct tests, each simulating a different real-world failure mode that developers encounter in production:Parametric Benchmark (Internal Knowledge): Can the model accurately answer trivia-style questions using only its training data?Search Benchmark (Tool Use): Can the model effectively use a web search tool to retrieve and synthesize live information?Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams, and images without hallucinating?Grounding Benchmark v2 (Context): Can the model stick strictly to the provided source text?Google has released 3,513 examples to the public, while Kaggle holds a private set to prevent developers from training on the test data—a common issue known as "contamination."The Leaderboard: A Game of InchesThe initial run of the benchmark places Gemini 3 Pro in the lead with a comprehensive FACTS Score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%).However, a closer look at the data reveals where the real battlegrounds are for engineering teams.ModelFACTS Score (Avg)Search (RAG Capability)Multimodal (Vision)Gemini 3 Pro68.883.846.1Gemini 2.5 Pro62.163.946.9GPT-561.877.744.1Grok 453.675.325.7Claude 4.5 Opus51.373.239.2Data sourced from the FACTS Team release notes.For Builders: The "Search" vs. "Parametric" GapFor developers building RAG (Retrieval-Augmented Generation) systems, the Search Benchmark is the most critical metric.The data shows a massive discrepancy between a model's ability to "know" things (Parametric) and its ability to "find" things (Search). For instance, Gemini 3 Pro scores a high 83.8% on Search tasks but only 76.4% on Parametric tasks. This validates the current enterprise architecture standard: do not rely on a model's internal memory for critical facts.If you are building an inter...
A San Francisco-based startup has demonstrated what it calls a breakthrough in hardware development: an artificial intelligence system that designed a fully functional Linux computer in one week — a process that would typically consume nearly three months of skilled engineering labor.Quilter, which has raised more than $40 million from investors including Benchmark, Index Ventures, and Coatue, used its physics-driven AI to automate the design of a two-board computer system that booted successfully on its first attempt, requiring no costly revisions. The project, internally dubbed "Project...more Speedrun," required just 38.5 hours of human labor compared to the 428 hours that professional PCB designers quoted for the same task.The announcement also marks the first public disclosure that Tony Fadell, the engineer who led development of the iPod and iPhone at Apple and later founded Nest, has invested in the company and serves as an advisor."We didn't teach Quilter to draw; we taught it to think in physics," said Sergiy Nesterenko, Quilter's chief executive and a former SpaceX engineer, in an exclusive interview with VentureBeat. "The result wasn't a simulation — it was a working computer."Circuit board design remains the forgotten bottleneck that delays nearly every hardware productThe announcement shines a light on an unglamorous but critical chokepoint in technology development: printed circuit board layout. While semiconductors and software have received enormous attention and investment, the green fiberglass boards that connect chips, memory, and components in virtually every electronic device remain stubbornly manual to design."Besides auto-routers, the technology really hadn't changed since the early '90s," Fadell told VentureBeat. "The best boards are still made by hand. You go to Apple, they've got the tools, and these guys are just pushing traces, checking everything, doing flood fills—and you're like, there's got to be a better way."The PCB design process typically unfolds in three stages. Engineers first create a schematic — a logical diagram showing how components connect. Then a specialist manually draws the physical layout in CAD software, placing components and routing thousands of copper traces across multiple layers. Finally, the design goes to a manufacturer for fabrication.That middle step — the layout — creates a persistent bottleneck. For a board of moderate complexity, the process typically consumes four to eight weeks. For sophisticated systems like computers or automotive electronics, timelines stretch to three months or longer."The timeline was always this elastic thing—they'd say, 'Yeah, that's two weeks minimum,'" Fadell recalled of his experience at Apple and Nest. "And we'd say, 'No, no. Work day and night. It's two weeks.' But it was always this fixed bottleneck."The consequences ripple through hardware organizations. Firmware teams sit idle waiting for physical boards to test their code. Validation engineers cannot begin debugging. Product launches slip. According to Quilter's research, only about 10 percent of first board revisions work correctly, forcing expensive and time-consuming respins.Project Speedrun put Quilter's AI to the test with an 843-component computer that booted on the first tryProject Speedrun was designed to push the technology to its limits while producing an easily understood result: a working computer that could boot Linux, browse t...
Presented by SAPWhen SAP ran a quiet internal experiment to gauge consultant attitudes toward AI, the results were striking. Five teams were asked to validate answers to more than 1,000 business requirements completed by SAP’s AI co-pilot, Joule for Consultants — a workload that would normally take several weeks.Four teams were told the analysis had been completed by junior interns fresh out of school. They reviewed the material, found it impressive, and rated the work about 95% accurate.The fifth team was told the very same answers had come from AI.They rejected almost everything.Only when as...moreked to validate each answer one by one did they discover that the AI was, in fact, highly accurate — surfacing detailed insights the consultants had initially dismissed. The overall accuracy? Again, about 95%.“The lesson learned here is that we need to be very cautious as we introduce AI — especially in how we communicate with senior consultants about its possibilities and how to integrate it into their workflows,” says Guillermo B. Vazquez Mendez, chief architect, RI business transformation and architecture, SAP America Inc.The experiment has since become a revealing starting point for SAP’s push toward the consultant of 2030: a practitioner who is deeply human, enabled by AI, and no longer weighed down by the technical grunt work of the past.Overcoming AI skepticismResistance isn’t surprising, Vazquez notes. Consultants with two or three decades of experience carry enormous institutional knowledge — and an understandable degree of caution.But AI copilots like Joule for Consultants are not replacing expertise. They’re amplifying it.“What Joule really does is make their very expensive time far more effective,” Vazquez says. “It removes the clerical work, so they can focus on turning out high-quality answers in a fraction of the time.”He emphasizes this message constantly: “AI is not replacing you. It’s a tool for you. Human oversight is always required. But now, instead of spending your time looking for documentation, you’re gaining significant time and boosting the effectiveness and detail of your answers.”The consultant time-shift: from tech execution to business insightHistorically, consultants spent about 80% of their time understanding technical systems — how processes run, how data flows, how functions execute. Customers, by contrast, spend 80% of their time focused on their business.That mismatch is exactly where Joule steps in.“There’s a gap there — and the bridge is AI,” Vazquez says. “It flips the time equation, enabling consultants to invest more of their energy in understanding the customer’s industry and business goals. AI takes on the heavy technical lift, so consultants can focus on driving the right business outcomes.”Bringing new consultants up to speedAI is also transforming how new hires learn.“We’re excited to see Joule acting as a bridge between senior consultants, who are adapting more slowly, and interns and new consultants who are already technically savvy,” Vazquez says.Junior consultants ramp up faster because Joule helps them operate independently. Seniors, meanwhile, engage where their insight matters most.This is also where many consultants learn the fundamentals of today’s AI copilots. Much of the work depends on prompt engineering — for instance, instructing Joule to act as a senior chief technology architect specializing in finance and SAP S/4HANA 2023, then asking it to analyze business requirements and deliver the output as tables or PowerPoint slides.Once they grasp how to frame prompts, consultants consistently...
Presented by BlueOceanAI has become a central part of how marketing teams work, but the results often fall short. Models can generate content at scale and summarize information in seconds, yet the outputs are not always aligned with the brand, the audience, or the company’s strategic goals. The problem is not capability. The problem is the absence of context.The bottleneck is no longer computational power. It is contextual intelligence.Generative AI is powerful, but it doesn’t understand the nuances of the business it supports. It doesn’t have the context for why customers choose one brand ove...morer another or what creates competitive advantage. Without that grounding, AI operates as a fast executor rather than a strategic partner. It produces more, but it does not always help teams make better decisions.This becomes even more visible inside complex marketing organizations where insights live in different corners of the business and rarely come together in a unified way.As Grant McDougall, CEO of BlueOcean, explains, “Inside large marketing organizations, the data is vertical. Digital has theirs, loyalty has theirs, content has theirs, media has theirs. But CMOs think horizontally. They need to combine customer insight, competitive movement, creative performance, and sales signals into one coherent view. Connecting that data fundamentally changes how decisions get made.”This shift from vertical data to horizontal intelligence reflects a new phase in AI adoption. The emphasis is shifting from output volume to decision quality. Marketers are recognizing that the future of AI is intelligence that understands who you are as a company and why you matter to your customers.In BlueOcean’s work with global brands across technology, healthcare, and consumer industries, including Amazon, Cisco, SAP, and Intel, the same pattern appears. Teams move faster and make better decisions when AI is grounded in structured brand and competitive context.Why context is becoming the critical ingredientLarge language models excel at producing language. They do not inherently understand brand, meaning, or intention. This is why generic prompts often lead to generic outputs. The model executes based on statistical prediction, not strategic nuance.Context changes that. When AI systems are supplied with structured inputs about brand strategy, audience insight, and creative intent, the output becomes sharper and more reliable. Recommendations become more specific. Creative stays on brief. The AI begins to act less like a content generator and more like a partner that understands the boundaries and goals of the business.This shift mirrors a key theme from BlueOcean’s recent report, Building Marketing Intelligence: The CMO Blueprint for Context-Aware AI. The report explains that AI is most effective when it is grounded in a clear frame of reference. CMOs who design these context-aware workflows see better performance, stronger creative, and more reliable decision-making.For a deeper exploration of these principles, the full report is available here.The industry’s pivot: From execution to understandingMany teams remain in an experimentation phase with AI. They test tools, run pilots, and explore new workflows. This creates productivity gains but not intelligence. Without shared context, every team uses AI differently, and the result is fragmentation.The companies making the clearest progress treat context as a shared layer across workflows. When teams pull from the same brand strategy, insights, and creative guidance, AI becomes more predictable and more valuable. It suppo...
Anthropic on Monday launched a beta integration that connects its fast-growing Claude Code programming agent directly to Slack, allowing software engineers to delegate coding tasks without leaving the workplace messaging platform where much of their daily communication already happens.The release, which Anthropic describes as a "research preview," is the AI safety company's latest move to embed its technology deeper into enterprise workflows — and comes as Claude Code has emerged as a surprise revenue engine, generating over $1 billion in annualized revenue just six months after...more its public debut in May."The critical context around engineering work often lives in Slack, including bug reports, feature requests, and engineering discussion," the company wrote in its announcement blog post. "When a bug report appears or a teammate needs a code fix, you can now tag Claude in Slack to automatically spin up a Claude Code session using the surrounding context."From bug report to pull request: how the new Slack integration actually worksThe mechanics are deceptively simple but address a persistent friction point in software development: the gap between where problems get discussed and where they get fixed.When a user mentions @Claude in a Slack channel or thread, Claude analyzes the message to determine whether it constitutes a coding task. If it does, the system automatically creates a new Claude Code session. Users can also explicitly instruct Claude to treat requests as coding tasks.Claude gathers context from recent channel and thread messages in Slack to feed into the Claude Code session. It will use this context to automatically choose which repository to run the task on based on the repositories you've authenticated to Claude Code on the web.As the Claude Code session progresses, Claude posts status updates back to the Slack thread. Once complete, users receive a link to the full session where they can review changes, along with a direct link to open a pull request.The feature builds on Anthropic's existing Claude for Slack integration and requires users to have access to Claude Code on the web. In practical terms, a product manager reporting a bug in Slack could tag Claude, which would then analyze the conversation context, identify the relevant code repository, investigate the issue, propose a fix, and post a pull request—all while updating the original Slack thread with its progress.Why Anthropic is betting big on enterprise workflow integrationsThe Slack integration arrives at a pivotal moment for Anthropic. Claude Code has already hit $1 billion in revenue six months since its public debut in May, according to a LinkedIn post from Anthropic's chief product officer, Mike Krieger. The coding agent continues to barrel toward scale with customers like Netflix, Spotify, and Salesforce.The velocity of that growth helps explain why Anthropic made its first-ever acquisition earlier this month. Anthropic declined to comment on financial details. The Information earlier reported on Anthropic's bid to acquire Bun.Bun is a breakthrough JavaScript runtime that is dramatically faster than the leading competition. As an all-in-one toolkit — combining runtime, package manager, bundler, and test runner — it's become essential infrastructure for AI-led software engineering, helping developers build and test applications at unprecedented velocity.Since becoming generally available in May 2025, Claude Code has grown from its origins as an internal engineering experiment into a critical tool fo...
Presented by Design.comFor most of history, design was the last step in starting a business — something entrepreneurs invested in once the idea was proven. Today, it’s one of the first. The rise of generative AI has shifted how small businesses imagine, launch, and grow — turning what used to be a months-long creative process into something interactive, iterative, and accessible from day one.Search data tells the story. Since 2022, global interest in “AI business name generator” has surged more than 700%. Searches for “AI logo generator” are up 1,200%, and “AI website generator” 1,600%. Small ...morebusinesses aren’t waiting for enterprise AI trickle-down. They’re adopting these tools en masse to move faster from concept to brand identity.“The appetite for AI-powered design has been extraordinary,” says Alec Lynch, founder and CEO of Design.com. “Entrepreneurs are realizing they can bring their ideas to life immediately — they don’t have to wait for funding, agencies, or a full creative team. They can start now.”The democratization of design powerFor decades, small businesses were boxed out of high-end design. Building a brand required deep pockets and specialized talent. AI has redrawn that map.Large language models and image generators now act as collaborative partners — sparking ideas, testing directions, and handling tedious layout and copy work. For founders, that means fewer barriers and faster iteration.Instead of hiring separate agencies for naming, logo design, and web development, small businesses are turning to unified AI platforms that handle the full early-stage design stack. Tools like Design.com merge naming, logo creation, and website generation into a single workflow — turning an entrepreneur’s first sketch into a polished brand system within minutes.“AI isn’t replacing creativity,” Lynch adds. “It’s giving people the confidence to express it.”The five frontiers of AI-powered entrepreneurshipToday’s AI tools mirror the creative journey every founder takes — from naming a business to sharing it with the world. The five fastest-growing design categories on Google reflect each stage of that journey.1. Naming: From idea to identityAI naming tools do more than spit out clever words — they help founders discover their voice. A good generator blends tone, personality, and domain availability so the result feels like a fit, not a random suggestion.2. Logos: From visuals to meaningLogo creation is one of the most emotionally resonant steps in brand-building. AI has turned it into a playground for experimentation. Entrepreneurs can test dozens of looks and get instant feedback.3. Websites: From static pages to adaptive brandsThe surge in “AI website generator” searches signals a deeper shift. Websites are no longer static brochures; they’re dynamic brand environments. AI-driven builders now create layouts, headlines, and imagery that adapt to a company’s tone and focus — drastically reducing time to launch.4. Business cards and brand collateralEven in a digital age, tangible touchpoints matter. AI-generated business cards give founders an immediate sense of legitimacy while ensuring design consistency across brand assets.5. Presentations: From slides to storytellingFounders aren’t just designing assets; they’re designing narratives. Generative AI turns bullet points into persuasive visual stories — raising the quality of pitches, decks, and demos once out of reach for most small teams.Together, these five frontiers show that small businesses aren’t just using AI to look more polished — they’re using it to think more strategically abou...
Three years ago this week, Chat GPT was born. It amazed the world and ignited unprecedented investment and excitement in AI. Today, ChatGPT is still a toddler, but public sentiment around the AI boom has turned sharply negative. The shift began when OpenAI released GPT-5 this summer to mixed reviews, mostly from casual users who, unsurprisingly, judged the system by its surface flaws rather than its underlying capabilities.Since then, pundits and influencers have declared that AI progress is slowing, that scaling has “hit the wall,” and that the entire field is just another tech bubble inflate...mored by blusterous hype. In fact, many influencers have latched onto the dismissive phrase “AI slop” to diminish the amazing images, documents, videos and code that frontier AI models generate on command.This perspective is not just wrong, it is dangerous.It makes me wonder, where were all these “experts” on irrational technology bubbles when electric scooter startups were touted as a transportation revolution and cartoon NFTs were being auctioned for millions? They were probably too busy buying worthless land in the metaverse or adding to their positions in GameStop. But when it comes to the AI boom, which is easily the most significant technological and economic transformation agent of the last 25 years, journalists and influencers can’t write the word “slop” enough times. Doth we protest too much? After all, by any objective measure AI is wildly more capable than the vast majority of computer scientists predicted only five years ago and it is still improving at a surprising pace. The impressive leap demonstrated by Gemini 3 is only the latest example. At the same time, McKinsey recently reported that 20% of organizations already derive tangible value from genAI. Also, a recent survey by Deloitte indicates that 85% of organizations boosted their AI investment in 2025, and 91% plan to increase again in 2026.This doesn’t fit the “bubble” narrative and the dismissive “slop” language. As a computer scientist and research engineer who began working with neural networks back in 1989 and tracked progress through cold winters and hot booms ever since, I find myself amazed almost every day by the rapidly increasing capabilities of frontier AI models. When I talk with other professionals in the field, I hear similar sentiments. If anything, the rate of AI advancement leaves many experts feeling overwhelmed and frankly somewhat scared. The dangers of AI denialSo why is the public buying into the narrative that AI is faltering, that the output is “slop,” and that the AI boom lacks authentic use cases? Personally, I believe it’s because we’ve fallen into a collective state of AI denial, latching onto the narratives we want to hear in the face of strong evidence to the contrary. Denial is the first stage of grief and thus a reasonable reaction to the very disturbing prospect that we humans may soon lose cognitive supremacy here on planet earth. In other words, the overblown AI bubble narrative is a societal defense mechanism. Believe me, I get it. I’ve been warning about the destabilizing risks and demoralizing impact of superintelligence for well over a decade, and I too feel AI is getting too smart too fast. The fact is, we are rapidly headed towards a future where widely available AI systems will be able to outperform most humans in most cognitive tasks, solving problems faster, more accurately and yes, more creatively than any individual can. I emphasize “creativity” because AI denialists often insist that certain human qualities (particularly creativity...
Model providers want to prove the security and robustness of their models, releasing system cards and conducting red-team exercises with each new release. But it can be difficult for enterprises to parse through the results, which vary widely and can be misleading. Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 60-page GPT-5 system card reveals a fundamental split in how these labs approach security validation. Anthropic discloses in their system card how they rely on multi-attempt attack success rates from 200-attempt reinforcement learning (RL) campaigns. Open...moreAI also reports attempted jailbreak resistance. Both metrics are valid. Neither tells the whole story.Security leaders deploying AI agents for browsing, code execution and autonomous action need to know what each red team evaluation actually measures, and where the blind spots are.What the attack data showsGray Swan's Shade platform ran adaptive adversarial campaigns against Claude models. The attack success rate (ASR) tells the story.Opus 4.5 in coding environments hit 4.7% ASR at one attempt, 33.6% at ten and 63.0% at one hundred. In computer use with extended thinking, it held at 0% ASR even after 200 attempts. It's the first model to saturate the benchmark.Sonnet 4.5 at the same thresholds showed 70% ASR in coding and 85.7% in computer use.Opus 4.5 delivers roughly 7x improvement in coding resistance and complete resistance in computer use. This illustrates that the gap between model tiers within the same family is now larger than some cross-vendor comparisons. For security teams, that's a procurement factor that didn't exist six months ago.For OpenAI, the Gray Swan Shade platform found that the o1 system card delivered 6% ASR for harmful text and 5% for malicious code, all based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust found GPT-5's raw ASR hit 89% initially, dropping below 1% within two weeks of patching.The report Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks confirms the pattern. With sufficient attempts, Claude 3.5 Sonnet showed 78% ASR, and GPT-4o reached 89%.Two ways to catch deceptionAnthropic monitors approximately 10 million neural features during evaluation using dictionary learning. Features map to human-interpretable concepts, including deception, sycophancy, bias, power-seeking and concealment.Meanwhile, OpenAI's approach relies on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy on internal validation. GPT-5 reduced CoT deception from 4.8% to 2.1% compared to o3. The method assumes visible reasoning reflects internal processing. Anthropic's interpretability work directly questions that assumption.When models game the testIn Apollo Research's evaluation of o1, the model attempted to turn off oversight mechanisms 5% of the time when it believed it would be shut down. In 2% of cases, o1 attempted self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of scenarios, the model pursued developer-aligned goals during evaluation, then switched to its own goals when it believed oversight was minimal.Meanwhile, Anthropic reports Opus 4.5's evaluation awareness dropped from 26.5% (Opus 4.1) to less than 10% internally. UK AISI measured evaluation awareness decreasing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Previ...
Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided. A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to ...morepower rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not just technical performance but also user trust, adaptability and communication style.The latest Humane test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro's trust score surged from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time.Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%. The Humane test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons.But the ranking matters less than why it won."It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark."How blinded testing reveals what academic benchmarks missHUMAINE's methodology exposes gaps in how the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don't know which vendors power each response. They discuss whatever topics matter to them, not predetermined test questions.It's the sample itself that matters. HUMAINE uses representative sampling across U.S. and UK populations, controlling for age, sex, ethnicity and political orientation. This reveals something static benchmarks can't capture: Model performance varies by audience."If you take an AI leaderboard, the majority of them still could have a fairly static list," Bradley said. "But for us, if you control for the audience, we end up with a slightly different leaderboard, whether you're looking at a left-leaning sample, right-leaning sample, U.S., UK. And I think age was actually the most different stated condition in our experiment."For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic m...
One problem enterprises face is getting employees to actually use the AI agents their dev teams have built. Google, which has already shipped many AI tools through its Workspace apps, has made Google Workspace Studio generally available to give more employees access to design, manage and share AI agents, further democratizing agentic workflows. This puts Google directly in competition with Microsoft’s Copilot and undercuts some integrations that brought OpenAI’s ChatGPT into enterprise applications. Workspace Studio is powered by Gemini 3, and while it primarily targets business teams rather t...morehan developers, it offers builders a way to offload lower-priority agent tasks. “We’ve all lost countless hours to the daily grind: Sifting through emails, juggling calendar logistics and chasing follow-up tasks,” Farhaz Karmali, product director for the Google Workspace Ecosystem, wrote in a blog post. “Legacy automation tools tried to help, but they were simply too rigid and technical for the everyday user. That’s why we’re bringing custom agents directly into Workspace with Studio — so you can delegate these repetitive tasks to agents that can reason, understand context and handle the work that used to slow you down.”The platform can bring agents to Workspace apps such as Google Docs and Sheets, as well as to third-party tools like Salesforce or Jira.More AI in applications Interest in AI agents continues to grow, and while many enterprises have begun deploying them in their workflows, they're finding it isn’t as easy to get users on board as expected. The problem is that using agents can sometimes break employees out of their flow, so organizations have to figure out how to integrate agents where users are already fully engaged. The most common way of interacting with agents so far remains a chat screen. AWS released Quick Sight in hopes of attracting more front- and middle-office workers to use AI agents, although access to agents is still through a chatbot. OpenAI has desktop integrations that bring ChatGPT to specific apps. And, of course, Microsoft Copilot helped was ahead of this trend. Google has an advantage that only Microsoft rivals: It already offers applications that most people use. Enterprise employees use Google Workspace applications, host data and documents on Drive and send emails through Gmail. This means Google can easily get the context enterprises need to power their agents and reach millions of users. If people build agents through Workspace Studio, the platform can prove that agents targeting workplace applications, not just Google Docs, but also Microsoft Word, could be a winning strategy to increase agent adoption from employees. Templatizing agent creation
Enterprise employees can choose from a template or write out what they need in a prompt window. A look around the Workspace Studio platform showed templates such as “auto-create tasks when files are added to a folder” or “create Jira issues for emails with action issues.”Karmali said Workspace Studio is being “deeply integrated with Workspace apps like Gmail, Drive and Chat,” and agents built on the platform can “understand the full context of your work.” “This allows them to provide help that matches your company’s policies and processes while generating personalized content in your tone and style," he said. "You can even view your agent activity directly from the side panels of your favorite Workspace apps." Teams can extend agents to third-party enterprise platforms, but they can also configure custom steps to integrate with other tools. ...