Skip to content

Runbook: Investigate a Slow Query

This runbook covers diagnosing and resolving slow database queries in production.


Step 1: Identify Slow Queries

From application logs

Queries that exceed a duration threshold are logged by the EF Core query interceptor or by OpenTelemetry traces. Search for slow query log entries:

Terminal window
# Adjust to your logging platform
grep '"duration_ms"' /var/log/api/app.log | jq 'select(.duration_ms > 1000)'

From PostgreSQL directly

-- Queries currently running for more than 5 seconds
SELECT
pid,
now() - query_start AS duration,
left(query, 200) AS query_preview,
state,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE state != 'idle'
AND (now() - query_start) > interval '5 seconds'
ORDER BY duration DESC;
-- Top slow queries from pg_stat_statements (requires the extension)
SELECT
left(query, 200) AS query,
calls,
total_exec_time / calls AS avg_ms,
total_exec_time,
rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 20;

Enable pg_stat_statements if not already enabled:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Step 2: Get the Query Plan

Use EXPLAIN (ANALYZE, BUFFERS) on the slow query to get the actual execution plan. Run this against a replica or during low-traffic if the query is expensive.

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM posts
WHERE author_id = '00000000-0000-0000-0000-000000000001'
AND published_at_utc IS NOT NULL
ORDER BY published_at_utc DESC
LIMIT 20;

Paste the output into a tool like explain.dalibo.com to visualize the plan.

Warning signs in the plan

PatternCause
Seq Scan on a large tableMissing index
High rows= estimate vs. actual rowsStale table statistics — run ANALYZE
Nested Loop with large inner scansN+1 query or missing foreign key index
Hash Join with large memory spillJoin on unindexed columns
buffers: shared hit= very highQuery scanning too much data

Step 3: Identify the Missing or Unused Index

-- Check existing indexes on the table
SELECT
indexname,
indexdef,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_indexes
JOIN pg_stat_user_indexes USING (indexrelid)
WHERE tablename = 'posts';

If idx_scan = 0 for an index that should be used, check whether:

  • The query’s WHERE clause columns match the index column order.
  • The query uses a function on the indexed column (for example LOWER(email) vs. email).
  • The data distribution makes a sequential scan cheaper (small table or very low selectivity).

Step 4: Add or Fix the Index

Add a missing index with CONCURRENTLY to avoid table locks:

-- Create index without locking the table
CREATE INDEX CONCURRENTLY ix_posts_author_id_published
ON posts (author_id, published_at_utc DESC)
WHERE published_at_utc IS NOT NULL;

After verifying the index improves the query, add the corresponding EF Core configuration to PostConfiguration.cs:

builder.HasIndex(p => new { p.AuthorId, p.PublishedAtUtc })
.HasFilter("published_at_utc IS NOT NULL")
.HasDatabaseName("ix_posts_author_id_published");

Create a new migration for this index so it is tracked in EF Core’s migration history.


Step 5: Update Statistics

If the planner is using bad estimates, update table statistics:

ANALYZE posts;

Or update all tables:

ANALYZE VERBOSE;

Step 6: Verify the Fix

Re-run EXPLAIN (ANALYZE, BUFFERS) after adding the index or updating statistics. Confirm:

  • The plan now uses an index scan instead of a sequential scan.
  • The execution time has decreased.
  • The rows estimate is closer to the actual rows.

Step 7: Monitor Over Time

After deploying the fix, monitor the slow query log and pg_stat_statements to confirm the query no longer appears in the top slow queries.

If the query has improved but still appears, consider:

  • Query result caching at the application layer (see docs/conventions/backend/caching.md).
  • Materialized views for expensive aggregations.
  • Moving the query to a read replica.