Guest Post: Applied AI in healthcare - what 15 digital health teams are actually building and how

Every company right now is feeling pressure to be "doing something with AI." As part of the research for this article I spoke with Domenic Fichera at ianacare who put it like this,

"There was pre-AI when this was on few peoples’ radar… then in a matter of 6 months there's this pressure of 'how do you already not know how to do this? Why isn't our product team building agents?'"

Engineers are getting open checkbooks for Cursor and Claude Code. Clinicians are taking demos from AI scribe vendors. Executives are fielding inbound from startups promising to automate the back office. But when you talk with people about what's actually working, the answers are a lot less flashy.

This blog post comes from my experience as an applied AI software engineer at Pair Team, where I was hired as the first engineer focused on AI. Like most people in this role, I was flying by the seat of my pants trying to figure out how to squeeze real value from this new technology. The most useful learning came not from press releases or vendor pitches, but, unsurprisingly, from conversations with other boots-on-the-ground builders. When I asked one engineer about a blog post his company had published about their, I’m paraphrasing, fleet of AI agents supporting clinical operations, trying to learn from their clearly enlightened agentic system, he laughed and said,

"Those are just OpenAI API calls in deterministic workflows in a trench coat we call agents."

That turned out to be a common theme. To get a fuller picture, I spoke with people from 15 different healthcare companies from the Health Tech Nerds community to understand what applied AI actually looks like right now: where teams are starting, how they're structuring their AI efforts, how they're approaching evals, what's still too cutting-edge to be practical, and how to think about build vs buy when the landscape shifts every few months.

The companies ranged from small practices with a few hundred patients to series B startups with tens of thousands. Most teams had 5-50 engineers and 50-500 clinical or ops staff. The interviews were with the AI builders, software engineers, clinicians, and operators in the weeds implementing these solutions. One caveat: I didn't speak with any very large companies, we reached out to several but it was tricky to connect with a front-lines builder who could talk about the details.

Where most teams start: Internal workflows, scribes, and chat bots

A question I asked every team was, “where did you start?” and almost unanimously people started with one of the following types of projects:

An internal chatbot connected in some way to an internal knowledge base. In a few cases these internal chatbots could also access patient-level data.
An “AI Scribe” for phone calls and video visits.
1. This means transcribing a conversation between a patient and provider, then using an LLM to summarize the conversation and create a structured note for the clinician.
Automation of a simple internal workflow with an LLM API call, like extracting structured information from a PDF, triaging and routing inbound text messages or emails, etc.

These were all pretty intuitive and unsurprising, AI chat is the interface everyone already understands and is well supported with off-the-shelf tooling like CometChat, Mastra, and OpenAI’s ChatKit; creating an AI Scribe workflow is a lower-lift easy win that can be accomplished with an API call to an LLM; its low risk to start with internal workflows where any LLM embarrassing mistakes and hallucinations will be seen by employees, not patients. The learnings can happen here before shipping something customer facing.

How Doro Mind is doing this

I found Doro Mind’s internal chat bot and the use cases enabled by it particularly interesting; they automatically add context to the LLM based on the patient profile you’re viewing and the AI has a tool for writing SQL to query the database to add data to the context window enabling clinicians to ask questions like, “who are the members I haven’t spoken to in the last week? What upcoming appointments are on my calendar, and include a summary of what each customer has been up to recently? Was a [x] task ever created for this member? Do we have images of this patient’s insurance card?, or “how many patients does Doro have?”

Build vs. Buy

Although it may sound like a familiar refrain, a clear pattern emerged from these conversations: many organizations are opting to build instead of buy, as tools like Claude Code and Cursor continue to drive down the cost of software development. This raises two important questions: how are teams building, and in the cases where they choose to buy, what makes that the better path?

Build

Team design: Core AI Eng team

For the teams that were building there was a common pattern in how these orgs are structured: they were almost always anchored on a centralized AI team. The centralized AI team is more often than not the folks who are doing the heavy R&D and AI product strategy, including figuring out how their company would build with AI, designing the shared infrastructure for other teams to use, keeping up with the firehose of new technology releases, and figuring out what best practices actually look like for the company. They’re also acting as internal evangelists, training the rest of the engineering org on how to leverage tools like Cursor or Claude Code.

However, this core team isn't the bottleneck for everything. As the basic skills diffuse outward (learning how to write a decent prompt, building comfort making a simple LLM API call, etc) you start seeing the simpler AI features being built by the whole product engineering team. The central team stays focused on the bleeding edge, the plumbing, the more sophisticated AI projects, while the rest of the org uses the shared infrastructure to automate the workflows they’re subject matter experts on to augment their workflows.

Some exceptions to this centralized approach were the really small teams with just a few engineers or ops members where everyone was an AI builder.

Iterating on prompts process

A common pattern I heard from multiple teams: an engineer gets a prompt from zero to one, but the refinement from one to ten happens with deep involvement (if not full ownership) from subject matter experts. Clinicians and ops teams who deeply understand the workflow, the data, and what good output actually looks like end up driving most of the iteration after initial development

Michael Wasser, co-founder of Neuranimus, described sessions where engineers sit down with physicians using Jupyter notebooks. The physician can see the prompt, review the output, and suggest changes in real time. "The MD will often give feedback on the prompt," he told me. This tight loop between builder and SME was a recurring theme across the teams I spoke with; just regular, good product practice.

Buy

Of course, not everyone is building everything. When I asked teams about what they chose to buy instead of build, a few patterns came up.

The most common pattern was getting AI features for free from existing vendors. Several teams I spoke with use Elation, which now ships AI-powered chart summarization and post-visit clinical notes out of the box. If it's already in the software you’re paying for, it’s hard to make a case to rebuild it without good reason.

Doro Mind is in a similar spot with Canvas, they're piloting Canvas's Hyperscribe feature for AI scribe instead of building their own because their EHR is adding it anyway.

The After Cancer team took a different approach. They built a patient-facing chatbot for cancer survivors, but they didn't build the chat interface from scratch. They bought CometChat, which gave them a HIPAA-compliant frontend with message threading and RAG over their knowledge base. The actual LLM API calls run through self-hosted Bedrock, but the chat UI is entirely off-the-shelf. Their engineer described it as a "building blocks" approach: use vendors for the pieces that aren't core to what you're doing, keep the flexibility to swap things out later.

Some use cases are just genuinely hard to build in-house and better to buy from a vendor with the scale to invest deeply. Zus Health's risk adjustment product is a good example, they're using AI to surface suspect conditions and find supporting evidence in patient records. Under the hood, it's a complex system of rules and LLMs, with human coders reviewing samples and leaving confidence scores, custom data labeling tooling, and eval infrastructure. This is a level of investment that won't make sense for every team to build from scratch. When there's a vendor who's gone deep on a specific problem, it's always worth asking why build. For transparency, the company I work for is a Zus customer.

Another team I spoke with bought Nabla for scribe because it was easy to set up, integrated with Elation, and let them put their scarce AI engineering effort toward more differentiated work. Why innovate on a relatively affordable, off the shelf tool.

And then there are the private practices with no in-house technology team, where buying is typically the only option. For these teams, the decision is both obvious and challenging: they have to buy but the market for AI-powered healthcare software is crowded and moving fast, and evaluating vendors without in-house technical expertise is genuinely tricky.

Getting more advanced, further along the adoption curve

Agents

At its core, an agent is just an LLM in a loop that receives a prompt then can make tool calls (tools can be API calls, triggering another agent, or running a function in code), decides what to do with new data, take an action(s), observes the result, and repeats until it’s decided it’s done. Anthropic published a canonical piece on this called Building Effective Agents that draws a useful distinction:

"Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."

By that definition, most of what I saw teams building are workflows. Eg, an LLM extracts structured data from a fax, a rules engine routes it, another LLM drafts a response, a human reviews it.

The few teams I spoke with who were building something closer to a true agent, where the LLM is genuinely deciding which tools to call and in what order, talked about it as a "new" thing the company was exploring, in an R&D phase. Interestingly, a few of these same teams also described agents as the future of their business. It’s a tension, it's still early, but nobody wants to be late.

One major tailwind is the emergence of MCP (Model Context Protocol) as a standard for how agents connect to external tools and data sources; I always describe MCP to people as just an API with additional written context to explain to the agent how to use each end point (“Tool”). MCP gives you a standardized way to expose your internal systems (EHRs, scheduling tools, databases) as tools an agent can call. The MCP standard, combined with better tooling for building agents that can consume the standard (Claude Code SDK, Mastra, LangGraph, etc) is lowering the barrier. I would expect that at the end of 2026 there are a lot more companies with deployed internal and patient facing agents.

AI voice calls

Pair Team is using ElevenLabs to make outbound phone calls to some patients and handle some inbound calls to do appointment scheduling, for example. This is where we're starting with AI voice, a relatively contained use case where we can learn what works before expanding to other call types. The goal is to gradually take on more call types while keeping humans in the loop for the ones that require more nuance, clinical judgment, to ensure relationship building with social workers.

This is also a good example of the build-and-buy dynamic. We're using an off-the-shelf vendor for their conversational AI product that integrates nicely with Twilio. They've done the hard work of developing the voice models and agent harness, however, we built our own simulation playground and eval pipeline on top of it. We needed a way to eval how the agent handles scenarios where a patient mentions suicidal ideation or domestic abuse before putting it in front of real patients, and there wasn't an off-the-shelf tool that fit our needs. So we built our own safety eval pipeline. A learning from this work was that there's no one-size-fits-all approach to evals, ElevenLabs has a simulation API, which gave us a starting point, but we needed custom tooling that let anyone in the company run our safety protocol against thousands of simulated calls before bringing a new voice agent to production.

How teams have approached evals

I only spoke with a few teams that did any robust and programmatic evals. Many teams would ship something after an engineer played with the workflow or chatbot and self-signed off, or they would do internal red-teaming of the chatbot with a spreadsheet of scenarios to test. Obviously use your own internal judgement on this, but my take is that these simpler protocols are fine for a lot of AI powered workflows if patient safety isn’t at stake.

The more advanced teams are treating evals less like a QA task and more like a data product and a critical part of the process. Building "Golden Datasets" (curated collections of inputs and ideal outputs, often human validated) and running offline evaluations against them prior to shipping. These teams are investing deeply in tools like Langfuse, Arize, or Braintrust for dataset management, logging evaluation runs, production traces, prompt experimentation.

Interestingly, essentially every team built some amount of custom internal software for managing a part of the process of dataset management, experiment running, and human annotation. There’s a Sam Altman quote that came to mind that we’re living in an era of fast fashion software; I wouldn’t say we’ve seen this bare out yet for consumer software, however, this internal tooling felt emblematic of fast-fashion vibe-coded software that’s providing real value. For example, at Pair Team we've built (vibe coded with Claude) custom internal tooling that lets our AI team and embedded clinical SMEs sync data to evaluation datasets in Langfuse, or run scaled evals for AI voice calls for agents built in ElevenLabs. Something to look out for here though, is over investment too early in programmatic evals instead of relying on vibes, SMEs, and taste to improve the pipeline early on.

A common theme was the importance of looking at the data. I know it sounds obvious, but with LLMs producing a huge amount of content, it can be easy to forget the fundamentals of closely reading prompts, getting familiar with the input data, and close review of outputs.

ianacare is a really good example of a team that leans into human review of the whole pipeline. Their secret is a standing weekly meeting where their Data Scientist runs the latest pipeline against a Golden Dataset of about 30 real-world scenarios. He dumps the outputs into a Notion template, and he and the VP of Product, Domenic, sit there and manually score them against a rubric. They read the outputs line-by-line. It sounds unglamorous, but this manual loop does two important things that automated scripts can't:

It builds intuition: they develop a much better understanding of why the certain outputs aren’t what they expect, rather than just seeing a "pass/fail" percentage.
It aligns the team: As Domenic put it, "It’s not just that we’re doing the evals, the important thing is that we’re talking about it together."

This weekly beat allowed a small team (just three people focused on this AI product) to ship a consumer-facing product that handles deeply emotional content safely.

The cutting edge: memory systems, evals, prompt management

Most teams I spoke with weren't building sophisticated "memory systems," running scaled automated evals, implementing multi-agent systems, or using LLM platforms like Langfuse, Arize, or Braintrust for observability and experiment tracking. Only a few teams mentioned using agent orchestration frameworks like LangGraph, Mastra, CrewAI, or Claude Code’s SDK. Only a couple teams were running online LLM-as-a-judge evaluations. No one was building self-improving prompts that automatically refine based on feedback with something like DSPY. Same for online safety guardrails as a separate layer from the system prompt itself. Minimal tool use. If you’re in the same AI Twitter bubble as me, these patterns and others are often talked about, but they haven't diffused into most applied teams yet.

This isn't a criticism. Real world value comes from AI with simpler approaches. But it does mean there's a gap between what's being talked about at the frontier and what's actually being used in production at most care delivery companies. The tooling and best practices around these patterns are still maturing, making it unclear which are worth adopting now versus which to wait on until the ecosystem solidifies.

To underscore the point, fine-tuning is worth calling out specifically. It's frequently mentioned, often poorly understood (I heard several people say they “fine tuned” but actually just meant they edited a prompt), and notably absent from every conversation I had. But that's fine for most use cases(!), true model fine-tuning just isn't necessary for so many use cases. The foundational models have gotten good enough that prompt engineering and in-context learning get you most of the way there, OpenAI even mentions this specifically in their documentation about model optimization.

I'd expect many of these patterns to become more common as tooling matures and best practices solidify and diffuse. Memory systems and RAG context management feel like the most likely candidates for more mainstream adoption in 2026. They're approachable because they build on pains and concepts teams already understand: LLMs hallucinate when they lack context, so providing enriched, relevant information helps. And memory systems solve a practical problem: they let you condense sprawling information into focused snippets that fit in an agent's context window without requiring a completely different mental model for how to build.

Some Takeaways

Don’t forgot the fundamentals of good product development

It surprisingly is easy to forget that we are still just building software for people. The teams finding the most success aren't skipping the boring parts of product development: user research, defining the job to be done, and sitting next to the customer.

Speaking from personal experience at Pair Team, an early mistake we made when standing up our AI focused product team was treating the work as a sort of research initiative that didn't follow the same product workflow. We've turned this around since then. Now our AI development looks like any other product team: a PM, we have a product roadmap, we’re embedding with clinical/ops SMEs, and the same workflow discovery and user research you'd do for non-AI features.

Again Domenic at ianacare is a great example. Their engineers aren’t writing prompts in isolation. They’re sitting side-by-side with their SMEs. He talked about buying Amazon gift cards to recruit real patients for research sessions and practiced "forward deployed engineering" by sitting side-by-side with their internal navigators.

A way to make LLMs work in the real world is to "fall back on proven product practices", and deeply understand the workflow you’re trying to improve. By observing the workflow firsthand, they realized that accuracy wasn't the only metric that mattered. You don't learn that getting the empathy right for a hospice care plan is as important as the medical facts by looking at precision and recall metrics. You learn it by watching a patient react, or maybe cringe, at the initial output of a prompt.

I heard similar sentiments from the team at Zus Health. They’re building complex risk adjustment features, but they were adamant that the heavy lifting wasn't the LLM itself. As one of their engineers put it, the real challenge is "fitting into the workflow." It doesn't matter how large or smart the model is; if it doesn't fit into the chaotic reality of how a doctor or coder spends their day, it’s useless.

If you don’t have a clear job to be done, an agent won't save you. The teams winning right now are treating AI as a tool to solve a business problem, not the solution itself. As the team at Doro reminded me regarding automation: "First you need to create a process. Then I can automate it."

Build vs buy is now speed vs. optionality

The build vs buy decision has always been tricky, but AI development tools have introduced new nuances that most teams are still figuring out. A year ago, the conventional wisdom was clear(er): buy off-the-shelf solutions unless you have a pretty specific need, at some scale where building in-house is cheaper, etc. That heuristic is no longer so straightforward.

Neuranimus, the neurology practice that Michael Wasser co-founded, built an entire EMR from scratch with a team of two engineers, put it bluntly: "The cost of building software has gotten so much cheaper." His practice built their own AI scribe, their own patient invoicing system (in two days, he says), and their own referral management tools. When I asked why they didn't just buy these things, he said, "dev teams used to say 'of course we can build this' and the business side would say 'absolutely not, I know what happened last time.' Now the dev team hears about a vendor search and comes back with a greenfield product they built in a couple days."

One team I spoke with started using OpenAI's API for identity verification during patient onboarding. They're just sending images of driver's licenses to OpenAI to confirm they're valid IDs that match information gathered at signup. As they put it, "OpenAI is so good at extracting details from images that there's no reason for us to pay a third-party provider to do this."

The voice AI space is a particularly stark example. Several teams I spoke with who wanted to build voice agents started by talking to Hippocratic AI, which felt like an obvious choice for a healthcare company. But the foundational models have gotten so good so fast that some teams are going directly to ElevenLabs instead. I spoke with one company who is switching from Retell AI to ElevenLabs because, as they discovered, Retell is using ElevenLabs under the hood anyway, and going direct is half the cost with better latency.

A tricky thing is that making these decisions well requires a decent amount of expertise. You need to understand what the foundational models can do, what specialized vendors add on top, and how fast the landscape is moving and what's likely to change in the next 12 months. An irony is that people who can make informed build vs buy decisions are often the same scarce engineers with AI expertise who could just build the thing themselves and, speaking from experience, can suffer from the desire to build it themselves instead of buy.

Another tension most teams are experiencing is this: you'll start using a vendor that feels obsolete faster than typical software vendors do, or you'll build something in-house because there wasn't good tooling for your particular problem, and then six months later a killer solution emerges. For teams building a lot with AI, there's going to be this constant feeling of buying something you wish you'd built, or building something right before great tooling appears, or betting on a startup that gets made obsolete by a foundational model. This is the exciting and challenging state of the market right now.