Outboxes

Outboxes are database backed deferred units of work that drive a large portion of our system's eventual consistency workflows. They were designed with a couple of key features in mind:

  • Durability: Outboxes are stored as database rows that are typically committed transactionally alongside associated model updates. They are retained until the processing code signals that work has been completed successfully.
  • Retriability: If an outbox processor fails for whatever reason, it will be reprocessed indefinitely until it succeeds.
    • Note: We do not support deadlettering, which means a head of line blocking outbox will continue to backlog work until the processing code is fixed.

An outbox consists of the following parts:

shard_scope

: The operational group the outbox belongs to. This tends to be aligned with the models or domains the outbox applies to.

shard_identifier

: The shard, or grouping Identifier.

object_identifier

: The ID of the impacted model object (if applicable).

Note: Not every outbox has an explicit model target, so this can be set to an arbitrary unique value if necessary. Just be aware that this identifier is used in tandem with the shard_identifier for coalescing purposes.

category

: The specific operation the outbox maps to, for example USER_UPDATE and PROVISION_ORGANIZATION .

payload

: An arbitrary JSON blob of additional data required by the outbox processing code.

scheduled_for

: The datetime when the outbox should be run next

scheduled_from

: The datetime of when the scheduled_for date was last set.

Outbox shards are groups of outboxes representing an interdependent chunk of work that must be processed in order. An outbox’s shard is determined via the combination of its shard_scope and shard_identifier columns, and selecting appropriate values for both is essential.

ControlOutbox instances have 1 additional sharding column to consider: region , which ensures that each region processes its own shard of work independent of the other.

Some pragmatic examples:

  • Organization Scoped outboxes use the organization’s ID as the shard identifier, meaning Organization, Project, Org Auth Token Updates (and more!) for a single organization are all processed in order. If any of these different outboxes get stuck, the entire shard will begin to backlog, so be mindful of failure points.
  • Audit log event outboxes live in their own scope and use distinct shard identifiers for every outbox to ensure they are processed individually without the threat of being blocked by other outboxes.

When creating a new outbox message, you can either attempt to immediately process it synchronously, or defer the work for later. Deciding which approach to take depends entirely on the use case and the source of the outbox's creation.

Some quick heuristics for deciding:

  1. Was the outbox created by an API request that needs to report the status of the operation to the requestor? (process it synchronously)
  2. Was the outbox a result of a task, automatic process, or webhook request? (process it asynchronously)

Thankfully, choosing the desired behavior is simple:

Copied
from django.db import router, transaction
from sentry.models.outbox import outbox_context

# Synchronous outbox processing
with outbox_context(transaction.atomic(router.db_for_write(<model_name_here>)):
	RegionOutbox(...).save()

# Asynchronous outbox processing
with outbox_context(flush=False):
	RegionOutbox(...).save()

Both examples require the usage of the outbox_context context manager, but the key difference is in the arguments supplied.

Supplying a transaction to the outbox_context signals our the intent to immediately process the outbox after the provided transaction has been committed. This is handled via Django’s on_commit handlers which are automatically generated by the context manager.

The context manager will attempt to flush all outboxes generated within the context manager, unless their creation operations are wrapped in a nested asynchronous context manager:

Copied
with outbox_context(transaction.atomic(router_db_for_write(<model_type>):
	sync_outbox = RegionOutbox(...).save()

	with outbox_context(flush=False):
		async_outbox = RegionOutbox(...).save()

Because processing occurs after the transaction has been committed, any outboxes that cannot be processed are treated as asynchronous outboxes after their initial flush attempts.

Supplying the outbox_context with flush=False instead of a transaction skips generating the on_commit handlers entirely, meaning any outboxes created within the context manager will not be processed in the current thread. Instead, they will be picked up by a future periodic celery task that queries all outstanding outbox shards and attempts to process them.

Outboxes should live in the same silo and database as the models or processes that produce them.

For example, Organization model changes generate OrganizationUpdate outboxes. Because the Organization is a region-silo model that lives in the sentry database, the OrganizationUpdate outbox is a RegionOutbox that also lives in the sentry database.

Having both models aligned to the same database ensures that both the model change and the outbox message creation can be committed in the same database transaction for consistency.

Outboxes are coalesced in order to prevent backpressure from toppling our systems after a blocked outbox shard is cleared. We accomplish this by assuming the last message in a coalescing group is the source of truth, ignoring any preceding messages in the group.

An outbox’s coalescing group is determined by the combination of its sharding, category and object_identifier columns.

This coalescing strategy means that any outbox payloads that are stateful and order dependent in the same coalescing group will result in data loss when the group is processed. If you want to bypass coalescing entirely, you can set an arbitrary unique object identifier to ensure messages are run individually and in order; however, this can cause severe bottlenecks so be cautious when doing so.

Outboxes processors are implemented as Django signal receivers, with the sender of the receiver set to the outbox category. Here’s an example of a receiver that handles project update outboxes:

Copied
from sentry.models.outbox import OutboxCategory, process_region_outbox

@receiver(process_region_outbox, sender=OutboxCategory.PROJECT_UPDATE)
def process_project_updates(object_identifier: int, shard_identifier: int, payload: Any, **kwargs):
    pass

Each receiver is passed the following outbox properties as arguments:

  • object_identifier
  • shard_identifier
  • payload
  • region_name [ControlOutbox only]

If the receiver raises an exception, the shard’s processing will be halted and the entire shard will be scheduled for future processing. If the receiver returns gracefully, the outbox’s coalescing group will be deleted and the next outbox message in the shard will be processed.

Because any outbox message can be retried multiple times in these exceptional cases, it’s crucial to make these processing receivers idempotent.

  1. Add a category enum
  2. Choose an outbox scope
  3. Choose a shard identifier and object identifier
  4. Create a receiver to process the new outbox type
  5. Update your code to save new outboxes within an outbox context

If an outbox processor issues an RPC request to another silo, which in turn generates a synchronously processed outbox, this can result in a deadlock occurring if any of the outboxes in the newly generated outbox’s shard also generates or processes an outbox targeting the originating silo.

This is fairly rare situation, but one we’ve encountered in production before. Make sure to choose your outbox scopes and synchronous outbox processing use-cases carefully, minimizing cross-silo outbox interdependency whenever possible.

There’s no guarantee that an outbox will be processed in a timely manner. This can cause data between our silos to drift, so consider this when writing code that depends on outboxes to sync data.

  • The operation requires retriability and persistence.
  • The operation can be deferred and made eventually consistent.
  • A higher latency than a typical API or RPC request (<10s) is acceptable.

  • The work can take more than 10 seconds to process. This is a current limitation due to the way we acquire locks on outbox shards when processing them.
  • Data is ordered and has low cardinality, meaning it cannot be sharded efficiently.
  • Dependent code requires strong consistency guarantees cross-silo.
Help improve this content
Our documentation is open source and available on GitHub. Your contributions are welcome, whether fixing a typo (drat!) or suggesting an update ("yeah, this would be better").