Solving the Async Stack Trace Problem

One of the top pain points in the Rust 2025 Survey was poor async stack traces. async-inspect directly addresses this problem.

The Problem

When an async function panics, traditional stack traces are nearly useless:

thread 'tokio-runtime-worker' panicked at 'database connection failed'
stack backtrace:
   0: std::panicking::begin_panic
   1: <core::pin::Pin<P> as core::future::future::Future>::poll
   2: tokio::runtime::task::core::Core<T,S>::poll
   3: tokio::runtime::task::harness::Harness<T,S>::poll
             at ~/.cargo/registry/src/tokio-1.0/src/runtime/task/harness.rs:150
   4: tokio::runtime::blocking::pool::Inner::run
   5: std::sys_common::backtrace::__rust_begin_short_backtrace

What's Wrong?

❌ No task context: Which async task failed? ❌ No async call chain: What function called what? ❌ No await point: Where was the task blocked? ❌ No state information: What was the task doing? ❌ Runtime internals only: Just tokio/async-std internals

Why Does This Happen?

Async functions are compiled into state machines:

// You write:
async fn fetch_user(id: u64) -> User {
    let profile = db.get_profile(id).await;
    let posts = db.get_posts(id).await;
    User { profile, posts }
}

// Compiler generates:
enum FetchUserState {
    Start { id: u64 },
    WaitProfile { id: u64, future: ProfileFuture },
    WaitPosts { profile: Profile, future: PostsFuture },
    Done,
}

When it panics, you only see the poll machinery, not your actual async code.

The Solution: async-inspect

async-inspect captures async-specific context that normal stack traces can't provide.

1. Full Async Call Chain

Traditional stack trace:

tokio::runtime::task::harness::Harness<T,S>::poll
  at src/runtime/task/harness.rs:150

async-inspect:

$ async-inspect analyze --show-failures

Task #42: handle_request [PANICKED]
  ├─ Location: src/api/handlers.rs:23
  ├─ Duration: 5.2s before panic
  │
  └─ Async Call Chain:
     1. main::spawn_handler         (src/main.rs:45)
     2. handle_request(req)          (src/api/handlers.rs:23)
     3. ├─ authenticate_user(token)  (src/auth.rs:67)  [50ms] ✅
     4. ├─ fetch_user_data(id: 123) (src/users.rs:34)
     5. │  └─ db_query(sql)          (src/db.rs:89)    [5.1s] ❌ PANICKED
     6. └─ ❌ PANIC: "connection timeout"

2. Current Await Point

See exactly where the task was stuck:

Task #42 State:
  Status: PANICKED
  Blocked At: db_query().await
  Source: src/db.rs:89

  Code Context:
    87:     .bind(user_id)
    88:     .fetch_one(&pool)
    89: >>> .await?;  ← STUCK HERE FOR 5.2s
    90:
    91:     Ok(user)

3. Task Timeline

Understand what led to the panic:

$ async-inspect timeline --task 42

Task #42 Timeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  0ms   │ ● Task spawned
        │   handle_request(req: Request)
        │
  5ms   │ ● Entered: authenticate_user
 10ms   │ ○ Poll::Pending (awaiting auth)
 45ms   │ ● Poll::Ready(token)
 50ms   │ ✓ authenticate_user completed
        │
 55ms   │ ● Entered: fetch_user_data
 60ms   │   └─ db_query started
 65ms   │     ○ Poll::Pending (awaiting connection)
100ms   │     ○ Poll::Pending (waiting...)
200ms   │     ○ Poll::Pending (waiting...)
500ms   │     ○ Poll::Pending (still waiting...)
        │     ... [polled 847 times]
5200ms  │ ❌ TIMEOUT → PANIC
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Warning: Task polled 847 times without progress
    Possible cause: busy-wait loop or resource starvation

Find patterns across failures:

$ async-inspect analyze --correlate

🔍 Failure Analysis

Found 10 related failures (last 5 minutes):

Pattern: Database connection timeout
  Tasks: #38, #39, #40, #41, #42, #43, #44, #45, #46, #47
  All blocked at: db_query().await
  Common cause: Connection pool exhausted

Connection Pool Status:
  ┌──────────────────────────┐
  │ Active:    10/10  [FULL] │ ⚠️
  │ Idle:       0/10         │
  │ Waiting:   37 tasks      │ ← Tasks waiting for connections
  └──────────────────────────┘

Diagnosis: Connection pool saturation
  - All 10 connections in use
  - 37 tasks waiting for available connection
  - Average wait time: 5.2s → timeout

Recommendations:
  1. Increase max_connections in database config
  2. Add connection timeout (currently unlimited)
  3. Implement connection retry with backoff
  4. Review slow queries holding connections

Real-World Example

Scenario: Production API Panic

Your production API starts panicking with this error:

thread 'tokio-runtime-worker' panicked at 'database error: connection timeout'
note: run with `RUST_BACKTRACE=1` for a backtrace

Traditional Debugging Process

❌ Stack trace shows only runtime internals
❌ Add logging to every function manually
❌ Reproduce locally (can't replicate production load)
❌ Deploy, wait for it to happen again
❌ Check logs, still not enough context
😫 Hours/days of debugging

With async-inspect

✅ Check dashboard immediately:

$ async-inspect monitor

Active Tasks: 47
Failed Tasks (last 5m): 12
Deadlocks: 0
⚠️  High failure rate detected!

Failed Tasks:
  Task #42: handle_request [PANICKED] 5.2s
  Task #43: handle_request [PANICKED] 5.3s
  Task #44: handle_request [PANICKED] 5.1s
  ... [9 more]

Press 'd' for detailed analysis

✅ See the pattern:

$ async-inspect analyze --failures

Common Failure Pattern:
  Location: src/db.rs:89 (db_query().await)
  Cause: Connection timeout after 5s
  Affected: 12 tasks

Root Cause Analysis:
  ┌─────────────────────────────────────┐
  │ Database connection pool exhausted  │
  │ 10/10 connections active            │
  │ 35+ tasks waiting                   │
  └─────────────────────────────────────┘

✅ Fix immediately:

// Before (no timeout, no limit)
let pool = PgPoolOptions::new()
    .max_connections(10)
    .connect(db_url).await?;

// After (with timeout and more connections)
let pool = PgPoolOptions::new()
    .max_connections(50)           // ← Increase pool
    .acquire_timeout(Duration::from_secs(2))  // ← Add timeout
    .connect(db_url).await?;

✅ Verify fix:

$ async-inspect monitor

Active Tasks: 52
Failed Tasks (last 5m): 0  ✓
Average response time: 45ms

Connection Pool:
  Active:   12/50
  Idle:     38/50  ✓ Healthy
  Waiting:  0

Total time: 5 minutes instead of hours/days.

How to Use async-inspect for Stack Traces

Setup

Add instrumentation:

use async_inspect::Inspector;

#[tokio::main]
async fn main() {
    // Initialize inspector
    let inspector = Inspector::new(Default::default());

    // Your app code
    run_server().await;
}

#[async_inspect::trace]  // ← Add to async functions
async fn handle_request(req: Request) -> Response {
    let user = authenticate(req).await?;
    let data = fetch_data(user.id).await?;
    render(data)
}

Run with monitoring:

# Terminal 1: Run your app
cargo run

# Terminal 2: Monitor tasks
async-inspect monitor

When Something Panics

Immediate triage:

# See what failed
async-inspect analyze --show-failures

# Get detailed trace
async-inspect trace --task <id>

# Export for investigation
async-inspect export --json panic_trace.json

Development Workflow

# During development
cargo run  # Inspector automatically enabled in debug mode

# In another terminal
async-inspect tui  # Live dashboard

Production Deployment

// Low-overhead production config
let inspector = Inspector::new(Config {
    sampling_rate: 0.01,  // Only track 1% for low overhead
    capture_backtraces: false,
    mode: Mode::Production,
    ..Default::default()
});

// Export failures automatically
tokio::spawn(async move {
    loop {
        tokio::time::sleep(Duration::from_secs(60)).await;

        let failures = inspector.failed_tasks();
        if !failures.is_empty() {
            // Export to logging/monitoring system
            log::error!("Task failures: {:#?}", failures);
        }
    }
});

Comparison with Other Solutions

vs. RUST_BACKTRACE=1

Feature	RUST_BACKTRACE	async-inspect
Shows async call chain	❌ No	✅ Yes
Shows await points	❌ No	✅ Yes
Shows task state	❌ No	✅ Yes
Shows task relationships	❌ No	✅ Yes
Time in each state	❌ No	✅ Yes
Works in production	✅ Yes	✅ Yes (low overhead)
Zero cost when disabled	✅ Yes	🟡 Small

vs. tokio-console

Feature	tokio-console	async-inspect
Live task monitoring	✅ Yes	✅ Yes
Historical analysis	❌ No	✅ Yes
Panic analysis	❌ Limited	✅ Full
Deadlock detection	✅ Yes	✅ Yes
Export traces	❌ No	✅ JSON/CSV
Production safe	🟡 High overhead	✅ Low overhead

Best approach: Use both!

tokio-console for runtime observability
async-inspect for debugging and failure analysis

vs. tracing + tracing-subscriber

Feature	tracing	async-inspect
Manual instrumentation	✅ Flexible	✅ Automatic
Task relationships	❌ Limited	✅ Full graph
State machine visibility	❌ No	✅ Yes
Await point tracking	❌ No	✅ Yes
Integration	✅ Ecosystem	✅ Compatible

async-inspect works with tracing via the AsyncInspectLayer!

Future Improvements

The Rust project is working on better async diagnostics:

RFC: async stack traces
Improved panic messages for async
Better debugger integration

async-inspect will complement these improvements:

// Future: Better built-in stack traces
// + async-inspect: Full task context, relationships, timeline

// Best of both worlds!

Common Scenarios

Scenario 1: "My async function panics randomly"

# Run with async-inspect
async-inspect monitor --watch

# When it panics, you see:
Task #123: process_payment [PANICKED]
  Blocked at: external_api_call().await
  Poll count: 1 (panicked on first poll!)
  Error: "TLS handshake failed"

# Diagnosis: Network issue, not your code

Scenario 2: "Tests fail intermittently"

# Run tests with tracing
RUST_LOG=async_inspect=debug cargo test

# Failing test shows:
Task #5: test_user_creation [FAILED]
  Deadlock detected!
  - Task #5 waiting on Task #6 (mutex)
  - Task #6 waiting on Task #5 (channel)

# Diagnosis: Classic deadlock

Scenario 3: "Production slow requests"

async-inspect analyze --slow --threshold 1s

Slow Tasks (>1s):
  Task #42: handle_checkout [2.3s]
    ├─ validate_cart      [50ms]  ✅
    ├─ charge_payment     [2.1s]  ⚠️  SLOW
    │  └─ external_api    [2.0s]  ← Problem!
    └─ send_confirmation  [100ms] ✅

# Diagnosis: External API slow, add timeout

Best Practices

1. Annotate Critical Paths

// ✅ GOOD: Annotate user-facing handlers
#[async_inspect::trace]
async fn api_handler() { }

// ✅ GOOD: Annotate error-prone code
#[async_inspect::trace]
async fn risky_operation() { }

// ❌ BAD: Don't annotate everything
#[async_inspect::trace]
async fn tiny_helper() { }  // Too fine-grained

2. Use in Tests

#[tokio::test]
#[async_inspect::trace]  // ← Add this
async fn test_concurrent_access() {
    // If test fails, you get full async context
}

3. Production Sampling

Config {
    sampling_rate: 0.01,  // 1% overhead
    capture_backtraces: false,  // Expensive
    mode: Mode::Production,
}

4. Export Failures

// Auto-export failures for analysis
if let Some(failure) = inspector.last_failure() {
    let json = serde_json::to_string(&failure)?;
    log::error!("Task failure: {}", json);
}

Limitations

async-inspect helps tremendously but doesn't solve everything:

❌ Doesn't replace proper error handling
❌ Doesn't fix bugs, just helps find them faster
❌ Small overhead even when optimized
❌ Requires instrumentation (manual or via tracing)

Get Started

Install:
```
cargo install async-inspect
```
Add to project:
```
[dependencies]
async-inspect = "0.1"
```

Instrument:

#[async_inspect::trace]
async fn your_function() { }

Monitor:
```
async-inspect tui
```

That's it! Next time you hit an async panic, you'll have the context you need.

Learn More

The async stack trace problem is real, but it's solvable. async-inspect gives you the visibility you need to debug async Rust with confidence.

The Problem​

What's Wrong?​

Why Does This Happen?​

The Solution: async-inspect​

1. Full Async Call Chain​

2. Current Await Point​

3. Task Timeline​

4. Related Tasks Analysis​

Real-World Example​

Scenario: Production API Panic​

Traditional Debugging Process​

With async-inspect​

How to Use async-inspect for Stack Traces​

Setup​

When Something Panics​

Development Workflow​

Production Deployment​

Comparison with Other Solutions​

vs. RUST_BACKTRACE=1​

vs. tokio-console​

vs. tracing + tracing-subscriber​

Future Improvements​

Common Scenarios​

Scenario 1: "My async function panics randomly"​

Scenario 2: "Tests fail intermittently"​

Scenario 3: "Production slow requests"​

Best Practices​

1. Annotate Critical Paths​

2. Use in Tests​

3. Production Sampling​

4. Export Failures​

Limitations​

Get Started​

Learn More​