Back to ER Diagram
Backend Scalability

Backend Scalability & Resilience

Comprehensive patterns for rate limiting, circuit breakers, bulkhead isolation, retry policies, and request queuing to handle 500+ concurrent users with graceful degradation.

.NET 10
Polly
Azure Service Bus
500+ Users

Overview

ReqVise must handle 500+ concurrent users with complex procurement workflows, real-time auctions, and heavy report generation. Without proper resilience patterns, a single service failure can cascade and bring down the entire system.

500+
Concurrent Users
100/min
Rate Limit (User)
5
Circuit Breaker Threshold
30s
Max Retry Delay

Failure Scenarios

  • External API timeout (Brevo, Bank APIs)
  • Database connection exhaustion
  • Memory pressure under load
  • Cascade failures across services
  • DDoS or abuse attempts

Resilience Patterns

  • Rate Limiting (per user, per tenant)
  • Circuit Breaker (Polly)
  • Bulkhead Isolation
  • Retry with Exponential Backoff
  • Request Queuing (Service Bus)

Rate Limiting

Protect APIs from abuse and ensure fair resource distribution across users and tenants.

ScopeLimitWindowAction on Exceed
Per User100 requests1 minuteHTTP 429 Too Many Requests
Per Tenant1,000 requests1 minuteHTTP 429 + Alert
Per IP (Unauthenticated)20 requests1 minuteHTTP 429 + Captcha
Reverse Auction Bids10 bids10 secondsThrottle with warning
Report Generation5 requests1 minuteQueue request
// Program.cs - Rate Limiting Configuration (.NET 10)
builder.Services.AddRateLimiter(options =>
{
    // Per-user rate limit
    options.AddPolicy("PerUser", context =>
        RateLimitPartition.GetFixedWindowLimiter(
            partitionKey: context.User?.Identity?.Name ?? context.Connection.RemoteIpAddress?.ToString(),
            factory: _ => new FixedWindowRateLimiterOptions
            {
                PermitLimit = 100,
                Window = TimeSpan.FromMinutes(1),
                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
                QueueLimit = 10
            }));

    // Per-tenant rate limit
    options.AddPolicy("PerTenant", context =>
        RateLimitPartition.GetFixedWindowLimiter(
            partitionKey: context.User?.FindFirst("tenant_id")?.Value ?? "anonymous",
            factory: _ => new FixedWindowRateLimiterOptions
            {
                PermitLimit = 1000,
                Window = TimeSpan.FromMinutes(1)
            }));

    options.OnRejected = async (context, token) =>
    {
        context.HttpContext.Response.StatusCode = 429;
        await context.HttpContext.Response.WriteAsync("Too many requests. Please try again later.");
    };
});

Circuit Breaker Pattern

Prevent cascade failures by stopping requests to failing services and allowing them time to recover.

CLOSED
Normal Operation
5 Failures
Threshold Reached
OPEN
Requests Rejected
HALF-OPEN
Test Request
CLOSED
Recovery
External ServiceFailure ThresholdBreak DurationFallback
Brevo (Email/SMS)5 failures30 secondsQueue for retry
Bank API3 failures60 secondsReturn cached status
CRM Webhook5 failures30 secondsQueue sync-back
Gift Card API3 failures60 secondsPending fulfillment
// Polly Circuit Breaker with HttpClient
builder.Services.AddHttpClient("BrevoClient", client =>
{
    client.BaseAddress = new Uri("https://api.brevo.com");
})
.AddPolicyHandler(Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .CircuitBreakerAsync(
        handledEventsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (result, duration) =>
        {
            Log.Warning("Circuit OPEN for Brevo. Duration: {Duration}", duration);
        },
        onReset: () =>
        {
            Log.Information("Circuit CLOSED for Brevo. Service recovered.");
        }
    ))
.AddPolicyHandler(Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(3, retryAttempt =>
        TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) // 2s, 4s, 8s
    ));

Business Rule: BR-SCALE-001

When circuit is OPEN, email/SMS notifications MUST be queued to Azure Service Bus for later delivery instead of being dropped.

Bulkhead Isolation

Isolate different parts of the system so that failure in one area doesn't exhaust resources for others.

BulkheadMax ConcurrentQueue SizePurpose
Report Generation1050Prevent CPU exhaustion
Excel Import520Memory protection
External API Calls20100Connection limits
SignalR Broadcasts50200Real-time capacity
// Bulkhead for Report Generation
private static readonly SemaphoreSlim _reportSemaphore = new(10, 10);

public async Task<ReportResult> GenerateReportAsync(ReportRequest request)
{
    if (!await _reportSemaphore.WaitAsync(TimeSpan.FromSeconds(30)))
    {
        throw new ServiceUnavailableException("Report generation queue full. Please try again.");
    }

    try
    {
        return await _reportService.GenerateAsync(request);
    }
    finally
    {
        _reportSemaphore.Release();
    }
}

Retry with Exponential Backoff

Handle transient failures by retrying with increasing delays to avoid overwhelming recovering services.

OperationMax RetriesBackoff StrategyMax Delay
Database Connection3Exponential (1s, 2s, 4s)4 seconds
Email/SMS Send5Exponential (2s, 4s, 8s, 16s, 30s)30 seconds
Webhook Delivery5Exponential + Jitter60 seconds
Redis Cache2Immediate, then 1s1 second
// Retry Policy with Jitter (prevents thundering herd)
var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => r.StatusCode == HttpStatusCode.ServiceUnavailable)
    .WaitAndRetryAsync(
        retryCount: 5,
        sleepDurationProvider: (retryAttempt, context) =>
        {
            var baseDelay = TimeSpan.FromSeconds(Math.Pow(2, retryAttempt));
            var jitter = TimeSpan.FromMilliseconds(Random.Shared.Next(0, 1000));
            var totalDelay = baseDelay + jitter;
            return totalDelay > TimeSpan.FromSeconds(30)
                ? TimeSpan.FromSeconds(30)
                : totalDelay;
        },
        onRetry: (outcome, timespan, retryAttempt, context) =>
        {
            Log.Warning("Retry {Attempt} after {Delay}ms", retryAttempt, timespan.TotalMilliseconds);
        }
    );

Request Queuing

Handle burst traffic by queuing requests for background processing instead of rejecting them.

QueueMessage TypeProcessing RateRetention
email-queueEmail notifications100/batch7 days
sms-queueSMS notifications50/batch3 days
report-queueReport generation5/minute1 day
webhook-queueCRM sync-back20/minute7 days
import-queueExcel imports2/minute1 day
// Azure Service Bus Producer
public class EmailQueueService
{
    private readonly ServiceBusSender _sender;

    public async Task QueueEmailAsync(EmailMessage email)
    {
        var message = new ServiceBusMessage(JsonSerializer.Serialize(email))
        {
            MessageId = Guid.NewGuid().ToString(),
            ContentType = "application/json",
            Subject = email.TemplateCode,
            ScheduledEnqueueTime = email.ScheduledTime ?? DateTimeOffset.UtcNow
        };

        await _sender.SendMessageAsync(message);
    }
}

// Azure Service Bus Consumer (Hangfire Worker)
public class EmailQueueProcessor
{
    [AutomaticRetry(Attempts = 5, DelaysInSeconds = new[] { 10, 30, 60, 120, 300 })]
    public async Task ProcessBatchAsync()
    {
        var messages = await _receiver.ReceiveMessagesAsync(maxMessages: 100);

        foreach (var message in messages)
        {
            var email = JsonSerializer.Deserialize<EmailMessage>(message.Body);
            await _brevoService.SendAsync(email);
            await _receiver.CompleteMessageAsync(message);
        }
    }
}

Business Rules Summary

Rule IDCategoryDescription
BR-SCALE-001Circuit BreakerFailed notifications MUST be queued, not dropped
BR-SCALE-002Rate LimitingHTTP 429 response MUST include Retry-After header
BR-SCALE-003BulkheadReport generation limited to 10 concurrent requests
BR-SCALE-004RetryMaximum retry delay MUST NOT exceed 30 seconds
BR-SCALE-005QueueMessages MUST be idempotent (safe to replay)
BR-SCALE-006MonitoringCircuit breaker state changes MUST trigger alerts