11. Post-Production Monitoring & Debugging¶
| What | Watching error and crash dashboards after every deploy, and the procedure for debugging live issues. |
| Owner | Each platform's engineer owns their platform; the Manager also checks dashboards daily. |
| Triggers | After every production deployment, and whenever an issue is reported. |
Summary¶
Shipping is not the end. After every production deploy, the team actively watches its error and crash reporting tools (Crashlytics for Android, New Relic for backend) to catch new or rising issues early. The watch is frequent right after a deploy, then tapers as the release proves stable. Each platform's engineer owns monitoring for that platform, and the Manager also reviews the dashboards at least once a day. When an issue is reported, there is a defined debugging order that starts with the monitoring tools, falls back to application logs, and feeds gaps back into the next build.
Full detail¶
Tools by platform¶
| Platform | Tool | Watches for |
|---|---|---|
| Android | Crashlytics | Crashes and non-fatal errors. |
| Backend | New Relic | Errors, exceptions, performance regressions. |
The monitoring schedule¶
After every production deploy:
- First 3 days (minimum): watch the relevant dashboard about 3 times a day, a few minutes each time, looking for new issues or a rise in existing ones.
- After the release is stable: taper to once a day per platform.
Taper trigger
The default is: 3 times a day for at least the first 3 days, then drop to once a day once the release is stable (no new issues and no rise in existing ones across the window). An engineer may hold at the higher frequency longer if something looks unsettled. The taper is a floor on attention, not a reason to stop looking.
Ownership¶
- Platform engineer: owns monitoring for their platform. This is a standing responsibility, treated like any other SOP.
- Manager: ensures the monitoring SOP is followed, as with all SOPs, and additionally MUST review the dashboards themselves at least once a day. This is an extra duty beyond general SOP oversight.
Debugging a reported issue¶
When an issue is reported, follow this order rather than jumping straight into code:
- Check the monitoring tool first. Look for a spike in errors or crashes on the relevant platform (Crashlytics or New Relic). A spike usually points straight at the cause.
- If nothing shows there, suspect a business-logic issue rather than a runtime error (no error was thrown, but behavior is wrong). Go to the application logs in the Google Cloud Console, ingested from RabbitMQ and fed by both the app and the server, and trace the user journey for the reported scenario.
- If the logs or user journey do not cover that area, the gap itself is a finding: add the missing logging in the current build so the area is observable next time.
- If investigation reveals a real error or handled exception that was never reported to Crashlytics or New Relic, close that gap in the fix release: make sure that error path is reported going forward, for future-proofing.
flowchart TD
A[Issue reported] --> B{Spike in Crashlytics / New Relic?}
B -->|yes| C[Investigate from the error/crash report]
B -->|no| D[Suspect business-logic issue]
D --> E[Check app logs in Google Cloud Console - RabbitMQ ingested]
E --> F{User journey covered in logs?}
F -->|no| G[Add missing logging in current build]
F -->|yes| H[Trace the journey to the root cause]
C --> I{Was the error reported to monitoring tool?}
H --> I
G --> I
I -->|no, found unreported error/handled exception| J[Cover it in the fix release for future-proofing]
I -->|yes| K[Proceed to fix via normal or hotfix flow]
J --> K
Why this matters¶
Two failure modes this guards against: a runtime crash that the team never notices because nobody watched the dashboard, and a silent business-logic bug that throws no error at all. The first is caught by the monitoring schedule; the second is caught by the logs-and-journey fallback. Anything that slipped past monitoring becomes a logging improvement so it cannot hide twice.
Example¶
After deploying
v1.4.0, the backend engineer checks New Relic three times a day. On day two a client reports that some CSV exports are empty. New Relic shows no error spike, so the engineer suspects business logic and opens the Google Cloud Console logs. The user-journey logs show the export query ran but returned zero rows for a specific account type, with no exception thrown. The fix release both corrects the query and adds an explicit warning log for the empty-result case, so the gap is observable next time.
Related¶
- Previous phase: Tagging & Deployment
- Related: Production Support, Hotfix, Escaped Defects Analysis
- Runbook template