How do I detect an OAuth refresh race condition in logs?

Look for: (1) clusters of invalid_grant errors at the same millisecond or within a 200ms window from different request IDs or goroutine IDs — that timestamp cluster is the clearest signal; (2) one successful token refresh immediately followed by several invalid_grant failures for the same user or client_id; (3) refresh_token_reused errors (Okta/Auth0/Salesforce) — this is the reuse-detection variant that may also revoke the whole token family; (4) a corrupted or truncated token file on disk if multiple processes share one credential file. The key diagnostic: the failures are transient (retry eventually works) and correlate with bursts of parallel requests.

Does the race condition only happen with rotating refresh tokens?

The catastrophic form (invalid_grant on every loser) requires refresh-token rotation — the provider revokes the token the moment it issues a new one. With non-rotating tokens (some older providers) concurrent refreshes may produce redundant access tokens but not errors. However, even without rotation you can still get a token-file corruption race if multiple processes write the credential file concurrently without atomic persistence. Rotation just makes the problem visibly explosive: you go from 'possibly wasting a refresh' to 'definitely failing N-1 callers'.

Can a retry loop hide or worsen the race condition?

A naive retry on invalid_grant can mask the problem in low-concurrency scenarios because the retry fires after a small delay, by which time another process has already written a fresh token that the retry can read. But under high concurrency, retries amplify the problem: each retry is itself a caller, adding to the swarm of concurrent refresh requests. The correct fix is not to retry unconditionally but to re-read the stored token before retrying — if another caller already refreshed, you get the fresh token without hitting the endpoint at all.

OAuth token refresh race condition — detect, diagnose, and fix it

Q: What is an OAuth token refresh race condition?

A token refresh race condition occurs when multiple concurrent callers each independently notice that an access token is expired and each fires its own request to the token endpoint using the same refresh token. Because modern refresh tokens are single-use and rotating, only one refresh succeeds — that caller gets a new access token and the provider invalidates the old refresh token. Every other in-flight caller then presents a now-revoked refresh token and receives invalid_grant. The symptom is random auth failures under load that don't reproduce locally with a single caller.

Q: What is single-flight for OAuth token refresh?

Single-flight is an in-process coordination pattern where the first caller to detect an expired token starts the refresh and stores the in-flight Promise (or Future/Channel); every subsequent caller within the same process awaits that same Promise rather than starting a new refresh. When the one refresh completes, all waiters receive the result. The Promise is cleared in a finally block so the next expiry can refresh again. This guarantees exactly one refresh per expiry event within a single process, regardless of how many concurrent callers exist.

Q: When do I need a cross-process lock instead of single-flight?

Single-flight only coalesces concurrent callers within one process and its event loop. If multiple OS processes — two CLI instances, several workers in a process pool, containers, or background agents — all read the same credential file, they each have their own in-memory single-flight state. Two of them can still race. In that case you need an inter-process lock: an exclusive file lock (O_CREAT|O_EXCL, or flock) around the read-refresh-write cycle, plus a re-read of the token after acquiring the lock so the second process can short-circuit if the first already rotated. Use single-flight for the in-process case; add a cross-process lock when credentials are shared on disk.

TL;DR

The race: N callers each check the token, each sees "expired", each POST the same refresh_token to the provider. Provider processes caller A first: issues new access token, rotates (revokes) R0. Callers B–N arrive with R0 — now revoked → 400 invalid_grant.

Log signal: a tight cluster of invalid_grant errors (same user, milliseconds apart), one successful refresh, all others failed — OR refresh_token_reused on Okta/Auth0/Salesforce.

Fix: single-flight in-process (one shared Promise) + cross-process lock if credential is shared on disk + atomic writes + rotation-merge. See the full guide for code.

The race, step by step

OAuth 2.0 refresh-token rotation — now the default at Okta, Auth0, Microsoft, Salesforce, and recommended by the Security BCP for public clients — makes each refresh token single-use. The moment the provider processes a successful refresh it invalidates the token just consumed and issues a new one. This is the security property that makes concurrent refresh dangerous:

t0  access token is expired (or within the skew window)
t1  caller A reads creds, sees "expired", POSTs refresh_token=R0
t2  caller B reads creds, sees "expired", POSTs refresh_token=R0   // same token!
t3  provider processes A → issues access_token A1, rotates R0→R1, REVOKES R0
t4  provider processes B → R0 is revoked → 400 invalid_grant
                                           (on Okta/Auth0/Salesforce: refresh_token_reused
                                           → entire token family revoked; user logged out everywhere)

Both callers followed the OAuth spec. The defect is that they did it concurrently with a single-use token. The symptoms: random invalid_grant errors, surprise re-login prompts, "works locally, fails under load."

How to detect it in logs

The race produces distinctive log patterns:

Timestamp cluster — multiple invalid_grant or 401 Unauthorized entries within the same 200ms window for the same client_id or user. One success, many failures at the same instant.
refresh_token_reused — on Okta, Auth0, or Salesforce, this error code (instead of or alongside invalid_grant) means reuse detection fired. A single concurrent race can trigger this and revoke the entire token family.
Transient failures that self-resolve — the failures go away on retry (because another process already wrote a fresh token), which masks the bug until traffic increases.
Corrupted or zero-byte token file — if multiple processes write to the same credential file without atomic persistence (temp file + rename), you may see truncated JSON or a lost refresh_token field, producing a different failure mode.

Don't confuse with legitimate invalid_grant: a user revoking access, changing their password, or a token idle-expiring also produces invalid_grant — but that one is permanent (retry doesn't help) and not correlated with concurrent requests. The race version is transient and correlated with bursts.

The fix — three layers

Apply all layers that match your deployment topology:

In-process single-flight — one shared in-flight Promise; every concurrent caller awaits it. Covers the common case: one server, one worker, one process. See the single-flight code on the main guide.
Cross-process lock — if several CLIs, workers, or agents share one credential file: O_CREAT|O_EXCL lock file around the refresh; re-read after acquiring the lock and short-circuit if a sibling already rotated; write atomically (temp file + rename). See cross-process code.
Rotation-merge on persist — if the provider omits refresh_token in the response (Google does this), keep the old one. If it returns a new one (Okta/Auth0/Microsoft), save it. See rotation-merge code.

One library that packages all three layers

If you'd rather not re-derive the pattern from scratch, refresh-guard is a zero-dependency MIT library that ships in-process single-flight, correct rotation-merge, and atomic file persistence as a single installable primitive. It does not bundle a cross-process lock (that depends on your lock backend — file, Redis, DB); the guide covers that layer separately.

Drop-in for the in-process case

npm i refresh-guard

import { createTokenManager, fileStore } from "refresh-guard";

const tokens = createTokenManager({
  provider: "okta",   // picks the right quirks (rotation + reuse detection)
  store: fileStore("~/.myapp/creds.json"),
  refresh: async (prev) => fetchNewToken(prev.refresh_token)
});

const access = await tokens.getValidToken();  // exactly ONE refresh under any concurrency

Disclosure: refresh-guard is by the same team that wrote this guide. It solves the in-process race. For the multi-process case, combine with the file-lock pattern. Source on GitHub · npm i refresh-guard.

FAQ

What is an OAuth token refresh race condition?

When N concurrent callers each detect that the access token is expired and each POST the same refresh token to the provider. Only the first refresh succeeds; the provider revokes the refresh token it just processed. Every other caller then presents a revoked token and gets invalid_grant. The bug appears under any concurrency: a page loading several API calls in parallel, a worker pool, or multiple CLIs sharing one credential file.

How do I detect a token refresh race condition in logs?

Look for a tight timestamp cluster: multiple invalid_grant or refresh_token_reused errors from the same client_id within 200ms, with exactly one preceding successful refresh. On providers with reuse detection (Okta, Auth0, Salesforce) the error may be refresh_token_reused rather than invalid_grant. The failures are transient — a retry fired a moment later may succeed because another caller already wrote a fresh token.

What is single-flight for OAuth token refresh?

The first caller that sees the access token expired starts the refresh and stores the in-flight Promise in a shared variable. Every other caller within the same process awaits that same Promise instead of starting its own refresh. When the one refresh completes, all waiters get the result. The Promise is cleared in a finally block so the next expiry event can trigger a fresh refresh.

When do I need a cross-process lock instead of single-flight?

Single-flight only works within one process. If multiple OS processes share one credential file — several CLI instances, a process pool, containers on the same host — each process has its own in-memory state and can still race. Add an inter-process lock (O_EXCL lock file or flock), re-read the token after acquiring it (the previous holder may have already rotated), and write atomically (temp file + rename).

Does the race only happen with rotating refresh tokens?

The invalid_grant failure requires rotation (provider revokes the token after use). Without rotation, concurrent refreshes may be wasteful but not error-producing. However, if multiple processes write the same credential file without atomic persistence, you can still get file corruption (truncated JSON, lost refresh token field) even without rotation. Rotation just makes the concurrency bug visibly fatal instead of subtly corrupting.

Can a retry loop fix the race condition?

A retry may mask it at low concurrency (by which time another caller has written a fresh token). At high concurrency, retries amplify the problem — each retry is a new concurrent caller. The correct approach: on invalid_grant, re-read the stored token first; if another caller already rotated it, use that fresh token. Only re-auth the user if the token is genuinely revoked (permanent failure, not transient race).

TL;DR

The race, step by step

How to detect it in logs

The fix — three layers

One library that packages all three layers

Drop-in for the in-process case

FAQ

Further reading