Morteza Taghdisi

Writing11 min read
Technical illustration representing SDK error surface design and failure diagnostics
Architecture & Platform ThinkingJanuary 28, 2026

Developers Do Not Mind Errors. They Mind Unclear Errors.

Series

Mobile SDK Design

4 of 7 in the series

Article 4 of 7

An SDK's error model is part of its API contract. A production-grade SDK makes failures diagnosable without requiring source access or a support escalation.

sdkmobileandroidioserror-handlingobservability

Opaque Failure Surface is the failure mode where an SDK throws errors that describe internal state rather than the developer's mistake, the likely cause, and the next action.

It happens when the error model is designed for the people who built the SDK rather than for the people who integrate it. The clearest sign is an SDK that, in response to a real production problem, produces a message like "HTTP 403", "auth_module_failure", or "IllegalStateException: null". The developer knows something failed. They have no information about why, what they did wrong, or how to fix it.

Errors Are Part of the API Contract

An SDK's error model is not a secondary concern. It is part of the API contract.

A payment SDK that returns PAYMENT_ERROR for every failure has not provided a result type. It has provided a coin flip. A network SDK that throws an IOException with no message has not implemented error handling. It has transferred the debugging burden to the integrator.

The cost of opaque errors is not felt by the SDK team. It is felt by every developer who integrates the SDK, every time something goes wrong. Support tickets, Slack escalations, and GitHub issues filled with "how do I debug this error?" are the downstream cost of an SDK that did not design its error surface.

What Opaque Failure Looks Like

Consider a concrete illustration of the failure mode.

An analytics SDK returns 403 Forbidden during event tracking. The SDK propagates this as:

kotlin
throw RuntimeException("HTTP 403")

The developer checks each possibility in sequence. Is their API key wrong? Did they call the SDK before configuring it? Is the user account restricted? Is there a regional policy blocking the request? Is the event schema invalid? The error provides no information to eliminate any of these. The developer opens a support ticket.

The support team asks for logs. The developer sends logs. The support team asks for the correlation ID. The developer does not know what a correlation ID is or where to find one. The support team asks for debug mode output. The developer did not know debug mode existed.

Forty-eight hours after the first error, the root cause turns out to be a misconfigured event schema that the SDK validated on the server. The SDK could have said that in the original exception.

Typed Error Hierarchy

An untyped exception is a dead end for the integrator. A typed error hierarchy is a map.

kotlin
// Bad: caller cannot distinguish failure causes programmatically
throw RuntimeException("HTTP 403: auth_failure")
 
// Good: typed, discriminated, actionable
sealed class SDKException(
    override val message: String,
    override val cause: Throwable? = null
) : Exception(message, cause) {
 
    abstract val code: SDKErrorCode
    abstract val retryable: Boolean
    abstract val correlationId: String
 
    data class AuthorizationFailed(
        override val correlationId: String,
        override val code: SDKErrorCode = SDKErrorCode.AUTHORIZATION_FAILED,
        override val retryable: Boolean = false,
        val hint: String = "Verify your API key is valid and has not expired."
    ) : SDKException("Authorization failed. $hint")
 
    data class ConfigurationMissing(
        override val correlationId: String,
        override val code: SDKErrorCode = SDKErrorCode.NOT_CONFIGURED,
        override val retryable: Boolean = false,
        val hint: String = "Call SDK.configure() before using this API."
    ) : SDKException("SDK not configured. $hint")
 
    data class NetworkUnavailable(
        override val correlationId: String,
        override val code: SDKErrorCode = SDKErrorCode.NETWORK_UNAVAILABLE,
        override val retryable: Boolean = true
    ) : SDKException("Network unavailable. This operation can be retried.")
 
    data class RateLimited(
        override val correlationId: String,
        override val code: SDKErrorCode = SDKErrorCode.RATE_LIMITED,
        override val retryable: Boolean = true,
        val retryAfterSeconds: Int
    ) : SDKException("Rate limit exceeded. Retry after ${retryAfterSeconds}s.")
}

The caller can pattern match over the sealed class, handle each error type appropriately, read the retryable field to decide whether to retry, and include the correlationId in a support request. None of this requires knowing SDK internals.

The Swift equivalent, using a protocol and concrete error types:

swift
protocol SDKError: Error {
    var code: SDKErrorCode { get }
    var retryable: Bool { get }
    var correlationId: String { get }
}
 
struct AuthorizationFailed: SDKError {
    let code: SDKErrorCode = .authorizationFailed
    let retryable: Bool = false
    let correlationId: String
    let hint: String = "Verify your API key is valid and has not expired."
}
 
struct NetworkUnavailable: SDKError {
    let code: SDKErrorCode = .networkUnavailable
    let retryable: Bool = true
    let correlationId: String
}

The pattern is the same on both platforms: errors are types, not strings. Callers can check error.retryable before retrying. They can log error.correlationId for support. They can branch on the specific error type to handle each case.

What a Good Error Message Contains

Every SDK error message should answer four questions for the developer reading it.

What went wrong? A concise statement of the failure, not an internal exception class name or a raw HTTP status code.

Why it likely happened. The most probable cause from the developer's perspective, not the internal implementation detail.

What to do next. The specific action the developer should take to fix or investigate.

Where to look. The specific method, configuration key, or documentation section relevant to the fix.

kotlin
// Bad: answers none of these questions
throw IllegalStateException("null pointer at BootstrapCoordinator.kt:145")
 
// Good: answers all four
throw SDKException.ConfigurationMissing(
    correlationId = correlationId,
    hint = "Call SDK.configure() with a valid userId before calling requestPayment(). " +
        "requestPayment() requires an authenticated user context from configure()."
)

The second message takes the developer from error to fix without a support ticket. The first sends them into a debugger looking at internals they cannot access.

Retryable vs Non-Retryable Errors

Not all errors should be retried. Retrying a non-retryable error wastes time and delays the developer from fixing the actual problem.

kotlin
when (error) {
    is SDKException.NetworkUnavailable -> {
        // retryable: transient failure, wait and try again
        if (error.retryable) retryWithBackoff { requestPayment(request) }
    }
    is SDKException.RateLimited -> {
        // retryable: wait for the specified interval before retrying
        delay(error.retryAfterSeconds.seconds)
        requestPayment(request)
    }
    is SDKException.AuthorizationFailed -> {
        // not retryable: configuration problem, retrying will produce the same result
        showAuthError(error.hint)
    }
    is SDKException.ConfigurationMissing -> {
        // not retryable: developer error that requires a code change
        logError("SDK misconfiguration detected: ${error.message}")
    }
}

Expose this information through a typed retryable field rather than requiring developers to infer it from the error message text. Message text changes across SDK versions. A typed field is part of the API contract.

Correlation IDs

Every SDK error that involves a network operation or a server-side decision should carry a correlation ID. This is an opaque string, generated by the SDK or the backend, that uniquely identifies the operation.

When a developer opens a support request and includes the correlation ID, the SDK team can find the server-side trace for that specific operation without requiring the developer to reproduce the failure. Without it, the SDK team asks follow-up questions. The developer answers. Multiple round trips happen before the trace is found.

kotlin
val correlationId = UUID.randomUUID().toString()
 
try {
    val response = backendClient.processPayment(request, correlationId = correlationId)
    return PaymentResult.Success(transactionId = response.transactionId)
} catch (e: BackendException) {
    throw SDKException.NetworkUnavailable(correlationId = correlationId)
}

The correlation ID should appear in error objects, in debug logs, and in the SDK's integration-health event stream if the SDK exposes one.

Debug Mode

An SDK should have a debug mode that integrators enable in development builds. Debug mode enables verbose logging and additional diagnostic context in error messages and logs. Sensitive values remain redacted even in debug mode: the goal is more diagnostic detail, not looser data handling.

kotlin
SDK.bootstrap(context) {
    apiKey = BuildConfig.SDK_API_KEY
    debugMode = BuildConfig.DEBUG
}

Debug mode should be disabled by default. Production builds should not emit verbose logs. This protects integrator data, reduces log noise in production monitoring, and avoids leaking SDK internals into production log aggregators that may be accessible to more people than the developer intends.

Redacted Logs and PII Safety

SDK logs must not contain values that identify users or expose credentials, in any build configuration.

kotlin
// Bad: a token fragment appears in what looks like a safe info message
Log.i("SDK", "Configuring with userId=$userId, token=${token.take(8)}...")
 
// Good: logs that the operation was attempted, not what it contained
Log.d("SDK", "configure() called [correlationId=$correlationId]")

Categories to redact in all logs:

  • API keys and tokens, including partial values
  • User IDs, email addresses, phone numbers
  • Device identifiers that can be correlated to individuals
  • Payment card numbers, account numbers, or financial identifiers
  • Any value the SDK's data handling policy classifies as personally identifiable

Use placeholder strings where the presence of a value matters for diagnostics but the value itself does not: userId=[REDACTED], apiKey=[PRESENT]. This confirms to the developer that the value was received without exposing it.

Integrator-Facing Events vs Internal Telemetry

These are two different things. Confusing them produces APIs that expose SDK internals and create breaking changes when the SDK team adjusts their internal monitoring.

Integrator-facing events are the callbacks the host app receives to observe the SDK's behavior from the outside. They are part of the API contract. They should be stable, named for the developer's perspective, and documented.

kotlin
interface SDKListener {
    fun onConfigurationComplete(sessionId: String)
    fun onConfigurationFailed(error: SDKException)
    fun onPaymentCompleted(result: PaymentResult)
    fun onSessionExpired()
}

Internal SDK telemetry is the analytics the SDK team uses to monitor SDK health in production. It is not exposed to integrators. It should not appear in the public API surface.

The mistake is exposing internal monitoring events through the public interface:

kotlin
// Bad: internal event type exposed as public API
interface SDKListener {
    fun onInternalEvent(type: String, metadata: Map<String, Any>)
}

When onInternalEvent changes in the next SDK release because the team renamed an internal event type, every host app that listens to it breaks. Internal event names should not cross the API boundary.

Crash Reporter Safety

Host apps that use Crashlytics, Firebase Crash Reporting, Sentry, or similar tooling automatically capture SDK stack frames, log attachments, breadcrumbs, and custom metadata in crash reports.

If the SDK logs a token, user ID, or other sensitive value anywhere in code that can appear in a crash context, that value will appear in the crash report. Crash reports are often accessible to multiple engineering team members and, in some tooling configurations, are retained for extended periods.

kotlin
// Bad: token appears in a stack frame or log message that crash reporters capture
Log.e("SDK", "Request failed with token: ${token.substring(0, 12)}...")
 
// Good: correlation ID only - identifies the operation without exposing credentials
Log.e("SDK", "Request failed [correlationId=$correlationId, code=${error.code}]")

The check to apply before each log statement: if a crash reporter captured this line along with its local variable context, would any sensitive value appear in the report? If yes, change the log statement.

This check applies to breadcrumbs and custom keys in crash reporting SDKs, not just to direct log calls. Breadcrumbs written before a crash appear in the crash report alongside the stack trace.

Support Handoff

When a developer opens a support request, the SDK team needs enough information to find the server-side trace without requiring the developer to reproduce the failure. At the same time, developers should not need to send sensitive data to get support.

What the SDK should make easy to include:

  • Correlation ID from the failing operation
  • SDK version and build variant
  • Platform (Android or iOS) and OS version
  • App bundle ID or package name (not sensitive in most contexts)
  • The specific API call that failed and the error code or error type

What developers should not need to send:

  • API keys or tokens
  • User IDs or personal data
  • Full request or response bodies from production traffic
  • Internal SDK logs from production builds

The SDK documentation should include a support template that makes this clear. A developer who follows the template gets support faster. The SDK team gets what they need without a round-trip asking for the correlation ID they should have asked for first.

A Debuggability Checklist

Before shipping a new SDK version, validate the error surface against these questions:

  • Are all error cases strongly typed? Can callers branch on error type without parsing strings?
  • Does every error message state what went wrong, why it likely happened, and what to do next?
  • Does every error carry a retryable field?
  • Do network-bound operations carry correlation IDs through to every error they produce?
  • Is debug mode disabled by default and documented?
  • Do production logs redact tokens, user IDs, and personally identifiable information?
  • Are placeholder strings used where value presence matters but value content does not?
  • Is the integrator-facing event interface stable and named for the developer's perspective?
  • Are internal SDK telemetry events kept off the public API surface?
  • Has the SDK been tested with a crash reporter to verify no sensitive values appear in crash metadata or breadcrumbs?
  • Does the README include a support template showing how to collect a correlation ID and SDK version without sending credentials?

The next article in the series covers the distribution side of SDK trust: how packaging, versioning, binary size, dependency management, and release quality determine whether a host-app tech lead approves the SDK for production use.