AI has become a hot topic across many industries. The hype surrounding tools like ChatGPT has sparked new ideas and experimentation in a wide range of fields, pushing the boundaries of creativity and leading to successes in endeavors that once seemed like science fiction.
However, I always emphasize that AI is not a simple plug-and-play tool. While the advent of foundation models has helped bridge gaps and facilitated both training and adaptation to new domains, every field has its own rules, constraints, and adoption challenges. Mapping and accounting for these factors can be just as challenging as implementing the machine learning models themselves.
This is especially true in clinical applications, where human health—and often life itself—is at stake.
You can’t simply take an off-the-shelf ML model, wrap it in a nice UI, and integrate it into a medical workflow.
There’s an entire landscape of data uncertainties, procedural nuances, and regulatory constraints that necessitate a customized approach. ML practitioners must avoid following standard software development practices and instead calibrate their methods to fit these unique demands.
After more than a decade of academic research in machine learning for specific clinical applications, I’ve learned some valuable lessons. In this post, I’ll share a few of them and explain how we’ve adapted our ML implementation strategies to ensure they align with the intricacies of clinical settings.
Clinical needs are the drivers of the technology, not the other way around
One of the first things I learned is that machine learning models, no matter how advanced, must serve real clinical needs if they are to make a meaningful impact. While there are impressive ML models designed to solve complex problems, only those with practical utility—and that integrate seamlessly into clinical workflows—are capable of driving real-world change.
I have encountered models that perform exceptionally well in academic settings but are unlikely to ever be used in real life because they either fail to address a critical clinical need or introduce unnecessary complexity to an established clinical workflow. We can always apply sophisticated mathematics to human data, but is that truly what patients and clinicians need?
The solution I’ve found to prevent this issue is to invest significant time in understanding the clinical processes associated with the (future) ML model before even writing a single line of code. In academia, this often involves conducting a few interviews with key stakeholders and engaging in extensive research. In industry, we leverage Design Thinking tools such as Design Sprints and user-centered workshops, which provide structured insights in a more efficient and targeted manner.
The goal of AI might be exciting and impactful, but it’s not until you collect, clean, and structure the clinical data that you truly understand how difficult the pursuit might be. You may say, “Well, I have thousands of samples for you to work with.” And that’s great! As ML practitioners, we always want as many samples as possible. However, healthcare data is often messy, incomplete, or siloed across various systems and platforms. Dealing with these challenges is no small task and shouldn’t be underestimated. After analyzing data inconsistencies and running through all the samples, we may discover that the actual size of the usable dataset is much smaller than anticipated, making it unsuitable for training the models we initially envisioned.
It’s also important to consider the quality of the target variables. Machine learning algorithms are used to train predictive models by automating tasks based on pairs of expected inputs and outputs within a dataset. Noise in the target variables—such as mislabeled samples—can result in models that are either underfitted for the task or simply erroneous.
To navigate this issue, I recommend accessing a subsample of the data as early as possible during the discovery phase to identify potential challenges. Be open-minded during the design process. And, yes, deep learning models are state-of-the-art for automating complex tasks, but it’s essential to have alternative approaches in mind if the data quality doesn’t meet expectations.
ML models as black boxes are okay(ish) for automating pricing or predicting market trends. If you don’t get the outcome you wanted, you can always blame the app and move on to the next option to improve results. But in clinical applications, this approach doesn’t work. ML models will never make autonomous decisions because you can’t hold a machine accountable for its impact on your health. Physicians and healthcare professionals will always intervene at some point in the process to either validate the outcomes or make a decision based on them.
Therefore, explainability isn’t an optional feature in this field—it’s mandatory.
Clinicians need to trust that the model’s output aligns with their own reasoning and clinical knowledge, and the only way to ensure this is through an explainable tool that shows what factors were considered in generating the output.
Incorporating explainable resources isn’t always straightforward, but it’s not impossible. It’s essential to keep this requirement in mind from the very beginning of the project. First, you need to understand what types of explanations the users require: Do they want heatmaps superimposed on images? Probabilistic outputs? Uncertainty estimates? How do they expect these to be presented? Once you have this information, you can select the right tool from the wide range of explainability options, implement it, and ultimately evaluate its effectiveness during the validation process to ensure it meets user expectations.
Regulation and compliance are fundamental, not afterthoughts
Medical ML models are subject to rigorous regulatory oversight, particularly by agencies like the FDA in the U.S. and the EMA in Europe. AI systems in healthcare may be classified as Software as a Medical Device (SaMD), depending on their intended use and impact on patient care. This classification brings them under strict regulatory scrutiny, as they are considered to be part of the medical decision-making process.
The FDA, for example, classifies SaMD into three categories—Class I, II, and III—based on the level of risk they pose to patients. Class I devices are considered low-risk and often require the least regulatory oversight, typically through general controls like proper labeling and adherence to basic manufacturing standards. Class II devices pose moderate risk and typically require special controls, including performance standards, post-market surveillance, and premarket notification via the 510(k) process. Class III devices are considered high-risk and usually require premarket approval (PMA), which involves a rigorous assessment of safety and effectiveness, including clinical trials.
For ML-based SaMD, regulatory approval often necessitates extensive clinical validation. This validation process includes demonstrating not only the accuracy of the model but also its robustness across diverse patient populations and clinical settings. The FDA’s Total Product Lifecycle (TPLC) approach for SaMD emphasizes continuous monitoring and post-market evaluation, ensuring that the software remains effective and safe as it evolves. Additionally, under the FDA’s Digital Health Innovation Action Plan, the agency has introduced programs like the Pre-Cert Pilot Program, aimed at fostering more efficient approval pathways for innovative AI and digital health technologies, while maintaining patient safety standards.
It’s critical to understand these regulatory pathways early in the development process. By considering whether your AI solution will be classified as Class I, II, or III SaMD right from the start, you can align your development steps with the necessary regulatory milestones, ensuring a smoother transition from prototype to clinical deployment.
Clinical validation is a marathon, not a sprint
While developing an ML model may take months, the process of validating it in clinical settings can take significantly longer—often stretching into years. It’s not just about proving that your model works in a controlled environment; it’s about demonstrating that it holds up in real-world, clinical conditions. This requires retrospective analyses, clinical trials, and extensive user testing. Each stage is crucial to ensure that the model not only performs as expected but also generalizes across diverse patient populations and clinical workflows.
One of the key lessons I’ve learned is that you can’t rush this process, but you can make it more efficient by planning ahead. Start thinking about generalizability early. During the development phase, test your model on a variety of data sources to assess its robustness across different patient demographics, imaging equipment, or healthcare settings. This approach helps to identify potential pitfalls long before formal validation begins, preventing nasty surprises when you get to the clinical trial stage.
Another recommendation is to involve clinicians from the outset. By incorporating their feedback during the development and early testing phases, you can ensure that the model meets practical clinical needs and integrates smoothly into existing workflows. This helps avoid costly iterations down the line and can significantly shorten the validation timeline.
Finally, try to incorporate clinical validation into the model’s iterative improvement cycles rather than treating it as a final step. If possible, start with retrospective analyses using historical data and move toward small-scale, prospective studies to build confidence in the model incrementally. This phased approach can help speed up regulatory approval and ultimately bring your model to market faster, without sacrificing the rigor needed for patient safety and effectiveness.
And my last note: AI’s role in medicine is to assist, not replace
One of the most important lessons I’ve learned over the years is that AI’s role in medicine is not to replace clinicians, but to augment their abilities. Despite the headlines about machines taking over jobs, the real strength of AI lies in its capacity to support medical professionals by enhancing their decision-making, not by making decisions autonomously. The most effective AI applications in healthcare aren’t about replacing expertise but about streamlining workflows, flagging potential issues, and reducing the cognitive load on clinicians.
This distinction between automation and augmentation is key. Automation is useful when AI can take over repetitive, time-consuming tasks—think data entry, image preprocessing, or basic diagnostic filtering. These are areas where machines excel, and by offloading these tedious tasks, doctors can focus their attention on the more nuanced, complex aspects of patient care that require human insight. Augmentation, on the other hand, involves AI working as a second set of eyes, offering recommendations, highlighting anomalies, or providing data-driven insights that complement a clinician’s judgment.
For clinicians, I’d invite you to think about the tasks you don’t want to keep doing—the ones that eat up your time but don’t require your medical expertise. Or consider areas where human error, fatigue, or data overload lead to inconsistencies in decision-making. These are the places where AI can make a real difference, by not just making your job easier, but by making your outcomes more reliable and less prone to error.
For ML practitioners and innovators, stay alert to these opportunities. The present future of AI in medicine isn’t about full automation or autonomy; it’s about collaboration. The products that will succeed are those that identify the gaps where clinicians need support and step in to fill those gaps, with precision and reliability. These are the applications that will stand the test of time, not by replacing human expertise but by enhancing it.
In conclusion, building AI for clinical applications is far from simple. It’s not just about developing powerful machine learning models; it’s about ensuring they truly meet the needs of healthcare professionals, comply with stringent regulations, and perform reliably in real-world clinical settings. Each step is challenging, but these challenges are what drive us to create tools that genuinely make a difference for both clinicians and patients.
If you’re considering how AI could transform your healthcare projects, or you’re looking for support to integrate machine learning into clinical workflows, we’re here to help. With our experience and passion for solving these kinds of problems, we’re ready to collaborate with you and turn your ideas into impactful solutions. Reach out to us to make them happen! And if you’re already working with AI to transform your Health Tech products, check out our Arionkoder Reshape Health Grants: an initiative to fuel Health Tech Innovation with AI and Design to support startups as they make a profound impact. Inscriptions will close soon, so don’t miss your chance!