Tackling Bias and Improving Generalization in AI Models for Skin Lesion Classification

Data Scientist at Arionkoder

One of the challenges I often face when working with AI models today is understanding how the data used for training can introduce unintended biases. We tend to think that a good split between training, validation, and test sets is enough to ensure our model will generalize well to real-world data. But sometimes, that’s simply not the case.

I still remember the first time I encountered this problem. We had built a segmentation model that performed very well on the task it was designed for—segmenting medical images. However, when we evaluated it on a new dataset collected from a different region or using a different imaging device, we noticed a significant drop in performance. This was due to variability in the data—a common issue in medical imaging, where changes in acquisition hardware, imaging protocols, and even anatomical differences between individuals (due to ethnicity, skin tone, or age) can greatly affect model performance.

So, how can we address this? In short, we want models that perform well regardless of who the patient is or how the image was captured. This is where domain generalization comes into play—a field focused on building models that generalize beyond the distribution of the training data.

About SkinCheck

Skin cancer is one of the most common cancers globally, and its incidence has continued to rise in recent decades. Early detection is critical—when identified and treated early, skin cancer has high survival rates. However, access to timely diagnosis remains a challenge in many regions due to a shortage of dermatologists and limited healthcare infrastructure.

SkinCheck is an innovative mobile application developed to help bridge this gap. Designed to run natively on iOS devices, it empowers users to photograph skin lesions, receive immediate AI-based classifications (benign vs. potentially malignant), and track changes over time. The app also supports teledermatology, enabling users to consult with medical professionals when needed. By combining on-device machine learning with telehealth capabilities, SkinCheck is helping make early skin cancer screening more accessible, especially in underserved communities.

When we partnered with SkinCheck, our shared goal was clear: to enhance the reliability and fairness of the underlying AI models so they could perform well across a wide range of users and environments. From the beginning, the SkinCheck team demonstrated a strong commitment to clinical responsibility and equitable access to care—values that deeply resonated with our own.

In our initial analysis, we discovered that, like many dermatology datasets, the training data used in the original model was imbalanced—particularly in terms of skin tone representation. This resulted in higher classification performance for lighter skin tones, a known and well-documented issue in the field.

Together with the SkinCheck team, we set out to address this challenge with a clear objective: build a model that generalizes across diverse skin tones, imaging devices, and acquisition conditions. What followed was a focused, data-driven effort to reduce bias and improve robustness, grounded in shared principles of transparency, reproducibility, and user safety.

Our Strategy to Reduce Bias and Improve Generalization

To support SkinCheck’s mission of providing equitable AI-powered skin cancer screening, we focused on building a model that performs consistently across diverse populations. Our strategy combined careful data handling with rigorous evaluation:

  • Dataset Diversification: We expanded and restructured the dataset to include greater variability in skin tones, acquisition devices, and lighting conditions.
  • Fair Sampling: We implemented custom batch sampling strategies to balance representation across subgroups—particularly by skin tone—encouraging the model to learn true lesion features rather than relying on spurious visual cues.
  • Reproducibility: All experiments were tracked and managed using PyTorch Lightning and Weights & Biases, ensuring transparency and repeatability.
  • Subgroup Analysis: We didn’t settle for overall accuracy. We analyzed performance across multiple demographic slices, using fairness-aware metrics to identify gaps and validate improvements.

To evaluate the results, we compared our enhanced model against the original SkinCheck baseline using ROC and Precision-Recall (PR) curves. As shown below, the updated model achieved higher AUC values across both metrics:

Our model became much better at spotting the cases we care about, especially in an imbalanced dataset. In technical terms, the global AUC rose from 0.510 to 0.656 — that’s about a 29% improvement in this accuracy‑like metric — meaning it can now correctly identify positive cases far more often while keeping false alarms low.

To assess statistical significance, we performed bootstrapped AUC estimation across 1,000 samples. The distributions below illustrate the performance gap:

The separation between distributions confirms the consistency of the improvement. The uplift was not only visible in global metrics, but also statistically significant—reinforcing the importance of deliberate dataset design and fairness-aware evaluation.


Fairness-Aware Performance: Skin Tone Analysis

A critical aspect of our evaluation was ensuring that performance gains were equally distributed across subgroups, especially between light and dark skin tones. We conducted a subgroup ROC analysis to assess whether our improvements translated into fairer outcomes.

As shown below, the ROC curves for light and dark skin tones are closely aligned, with nearly identical AUC scores:

  • Global AUC: 0.909
  • Dark Skin AUC: 0.906
  • Light Skin AUC: 0.900

This level of performance parity is not typical in dermatology AI models, which often show a significant drop in accuracy for darker skin tones due to dataset imbalances. Achieving near-identical results across subgroups highlights the effectiveness of our fairness-driven sampling strategy, as well as the importance of targeted evaluation beyond global metrics.


Looking Forward

This is just the beginning. There are many more strategies we hope to explore to further improve generalization and fairness—such as adversarial training, domain adaptation, and uncertainty estimation.

But above all, this experience reminded us that building reliable AI is not just about models—it’s about process. It requires scientific discipline, careful metric design, and a deep understanding of the data. As machine learning engineers and data scientists, it’s our responsibility to treat these systems as more than just code—they’re tools that will eventually affect lives.

By continuing to develop bias-aware, data-driven, and reproducible AI models in healthcare, we take a step closer to creating tools that are accessible, reliable, and inclusive for all—regardless of background or skin type.