Did you say Simpson?

2025-09-15

When we push a new version of the Cool Maze mobile app, we get a bit anxious about regression bugs, user frustration, and the need for a hasty rollback. To mitigate this, we closely monitor a key indicator: the percentage of failed or unfinished user actions, after the new app release.

	Successful actions	Unfinished actions	Unfinished ratio
Previous version	12981	1517	10%
New version	28422	5027	15%

The numbers, however, often tell a troubling story: the failure rate is frequently higher in the recently deployed version. It’s a frustrating situation — we work hard to improve stability and fix bugs, but the data suggests we’re getting worse. What gives?

Why actions fail

An unfinished action can be caused by a variety of factors: an instable internet connection, a server error, a frontend JavaScript bug, a mobile app crash, or even a confusing UI change that leads users to tap “Share with > Cool Maze” but then abandon and not scan any QR code.

The likelihood of an action failing can be influenced by several factors:

Platform: Is the user on an Android or an iPhone?
Connection: Are they on Wi-Fi or cellular data?
Content: Is the user sharing a single image, multiple photos, or a video? Is the photo resized or full-resolution?
Experience: Is the user a newcomer or an experienced user?

App rollouts

Unlike a centralized website, a mobile app rollout can take days or weeks to reach the full user base. This means the previous app version coexists with the new one. Our usage data is a mix of actions from users who have updated and those who haven’t.

To make sense of the data, we segment our users into two groups: “Newcomers” (fewer than 10 app uses) and “Experienced users” (10 or more uses).

Newcomers

	Successful actions	Unfinished actions	Unfinished ratio
Previous version	1131	466	29%
New version	10946	3990	27%

Experienced users

	Successful actions	Unfinished actions	Unfinished ratio
Previous version	11850	1051	8%
New version	17476	1037	6%

When we segment the data, a surprising and encouraging trend emerges:

The Newcomers on the new app version have a slightly lower (better) percentage of unfinished actions than the Newcomers on the previous version.
The Experienced users on the new app version have a slightly lower (better) percentage of unfinished actions than the Experienced users on the previous version.

This is a relief! Our bug fixes and stability improvements are working. The situation has improved for both our newest and most loyal users. So, how do we reconcile this segment-level improvement with the overall population’s degraded numbers?

Noncollapsibility

There’s a logical explanation for this seeming contradiction. The key insight is that all new users installing the app for the first time will get the newest version.

This creates a significant bias in our data:

The previous version’s user base now consists mostly of Experienced users who haven’t yet updated their app.
The new version’s user base is a mix of Experienced users who have updated, and a large influx of Newcomers installing the app for the first time.

Because new users are unfamiliar with the app, they are inherently more likely to abandon an action. Therefore, we should expect the new app version, with its disproportionately high number of new users, to have a higher overall rate of unfinished actions, regardless of the app’s quality.

This phenomenon is a classic example of Simpson’s paradox, where a trend that appears in several different groups vanishes or reverses when the groups are combined. While it can be tricky to determine the “right” trend, in our case, the segment-level data provides the clear and correct interpretation: our app’s stability has slightly improved, and we can confidently disregard the misleading overall comparison.