Monitoring and Outages

2025-10-01

Service quality monitoring

While we prioritize user privacy by not tracking the content of shared data, we diligently collect metrics to monitor the quality and reliability of our service.

Our two core indicators are:

Total number of Share actions
Number of “failed” Share actions

Observability of the service

An action may fail because of a loss of internet connection at (2) or (4). It may fail because of a hardware or software failure in any of the components (1), (3), (5). Critically, a component crash often occurs silently and is invisible to our monitoring backend.

To ensure comprehensive oversight, we employ an “end-to-end” definition of failure:

A failed action is any action that is initiated but not ultimately reported as successful.

This broad definition includes failures due to technical faults and actions abandoned by users (e.g., a user starts a transfer but never proceeds to scan). This approach deliberately casts a wide net, eliminating blind spots in our observability.

Issues detected early

Here is how most problems are detected before they cause too much damage to the production system:

we test the backend, frontend, and mobile components before deployment,
we test the production system after deployment,
we monitor the percentage of failed actions: a surge indicates a serious issue,
we monitor the total number of actions: a sudden drop also signals a severe problem.

September 2025 outages

In September 2025, two separate incidents occurred over several days. An unfortunate sequence of events prevented our monitoring components from detecting the issues, exposing critical monitoring blind spots.

Outages at 2 observability blind spots

The first incident (A) was a misconfigured domain name in the backend, which resulted in an incorrect payload delivered to the web frontend destination, when shared from an Android device. However, as soon as the frontend could successfully decrypt the payload, it reported the action as successful. The decoded data turned out to be useless to the user. The bug manifested at the right end of the diagram, in a blind spot, after the reporting of the success.

The second incident (B) was an immediate crash of the mobile app, for iPhone users with the latest iOS version 26. The observed rate of failures did not change, as the crash was happening before any “action start” signal was registered, at the left end of the diagram, in a blind spot. Furthermore, since the iOS 26 user base was initially small, the total number of actions did not drop enough to trigger an alert.

User reports

Many users overall were impacted, and a few reported the problems in the bug tracker, which requires the friction of creating a GitHub account.

To make reporting seamless, we are adding a direct email option within the mobile apps. Users will be able to report issues by emailing bartalogsoftware@gmail.com. We hope this low-friction channel will improve our ability to detect production issues quickly, though we strive to make it a rarely needed option.