Did you say Simpson?

When we push a new version of the Cool Maze mobile app, we get a bit anxious about regression bugs, user frustration, and the need for a hasty rollback. To mitigate this, we closely monitor a key indicator: the percentage of failed or unfinished user actions, after the new app release.

Successful actions Unfinished actions Unfinished ratio
Previous version 12981 1517 10%
New version 28422 5027 15%

The numbers, however, often tell a troubling story: the failure rate is frequently higher in the recently deployed version. It’s a frustrating situation — we work hard to improve stability and fix bugs, but the data suggests we’re getting worse. What gives?

Why actions fail

An unfinished action can be caused by a variety of factors: an instable internet connection, a server error, a frontend JavaScript bug, a mobile app crash, or even a confusing UI change that leads users to tap “Share with > Cool Maze” but then abandon and not scan any QR code.

The likelihood of an action failing can be influenced by several factors:

App rollouts

Unlike a centralized website, a mobile app rollout can take days or weeks to reach the full user base. This means the previous app version coexists with the new one. Our usage data is a mix of actions from users who have updated and those who haven’t.

To make sense of the data, we segment our users into two groups: “Newcomers” (fewer than 10 app uses) and “Experienced users” (10 or more uses).

Newcomers

Successful actions Unfinished actions Unfinished ratio
Previous version 1131 466 29%
New version 10946 3990 27%

Experienced users

Successful actions Unfinished actions Unfinished ratio
Previous version 11850 1051 8%
New version 17476 1037 6%

When we segment the data, a surprising and encouraging trend emerges:

This is a relief! Our bug fixes and stability improvements are working. The situation has improved for both our newest and most loyal users. So, how do we reconcile this segment-level improvement with the overall population’s degraded numbers?

Noncollapsibility

There’s a logical explanation for this seeming contradiction. The key insight is that all new users installing the app for the first time will get the newest version.

This creates a significant bias in our data:

Because new users are unfamiliar with the app, they are inherently more likely to abandon an action. Therefore, we should expect the new app version, with its disproportionately high number of new users, to have a higher overall rate of unfinished actions, regardless of the app’s quality.

This phenomenon is a classic example of Simpson’s paradox, where a trend that appears in several different groups vanishes or reverses when the groups are combined. While it can be tricky to determine the “right” trend, in our case, the segment-level data provides the clear and correct interpretation: our app’s stability has slightly improved, and we can confidently disregard the misleading overall comparison.