joellabs // engineering & observability

When analyzing massive datasets like the GitHub events archive, counting specific occurrences across different categories is a highly common task. A classic example is calculating the ratio of closed-to-opened Pull Requests (PRs) for active repositories.

While there are many ways to write this in SQL, the performance characteristics of each approach vary drastically. Let's look at how we can evolve a query from a slow, naive approach into an elegant, high-performance solution using specialized SQL combinators.

1. The Dataset Baseline

First, let's establish the sheer scale we are working with by checking the total number of records in our dataset:

SELECT count() FROM github_events;

2. The Naive Approach: Correlated Subqueries

A correlated subquery executes an inner query once for every single row processed by the outer query. While intuitive to write, it is incredibly inefficient on large datasets.

This query attempts to calculate opened and closed PRs for repositories with significant activity (more than 50 opened PRs):

SELECT
    repo_name,
    (
        SELECT count()
        FROM github_events AS sub
        WHERE (sub.repo_name = main.repo_name) AND (event_type = 'PullRequestEvent') AND (action = 'opened')
    ) AS pr_opened,
    (
        SELECT count()
        FROM github_events AS sub
        WHERE (sub.repo_name = main.repo_name) AND (event_type = 'PullRequestEvent') AND (action = 'closed')
    ) AS pr_closed
FROM github_events AS main
WHERE event_type = 'PullRequestEvent'
GROUP BY repo_name
HAVING pr_opened > 50;

Performance Warning: Avoid this pattern in production analytics. The database is forced to perform a loop-like operation across millions of rows, drastically slowing down execution time. Moreover this query is going to fail in ClickHouse and it is intentional.

3. The Traditional Join Approach (Without -If Combinator)

To avoid correlated subqueries, developers often split the problem into separate datasets using subqueries and then LEFT JOIN them back together.

SELECT
    opened.repo_name,
    opened.cnt AS pr_opened,
    coalesce(closed.cnt, 0) AS pr_closed,
    round(pr_closed / nullIf(pr_opened, 0), 2) AS closed_rate
FROM
(
    SELECT
        repo_name,
        count() AS cnt
    FROM github_events
    WHERE (event_type = 'PullRequestEvent') AND (action = 'opened')
    GROUP BY repo_name
) AS opened
LEFT JOIN
(
    SELECT
        repo_name,
        count() AS cnt
    FROM github_events
    WHERE (event_type = 'PullRequestEvent') AND (action = 'closed')
    GROUP BY repo_name
) AS closed ON opened.repo_name = closed.repo_name
WHERE opened.cnt > 50
ORDER BY closed_rate DESC;

The Good: It scans the table cleanly by separating the metrics.
The Bad: It forces the engine to read the github_events table twice and perform an expensive join operation in memory.

4. The "Standard SQL" Way: Conditional Aggregation

The industry-standard way to solve this in traditional relational databases (like PostgreSQL or MySQL) is to use a single scan with CASE WHEN statements embedded inside aggregate functions.

SELECT
    repo_name,
    SUM(CASE WHEN action = 'opened' THEN 1 ELSE 0 END) AS pr_opened,
    SUM(CASE WHEN action = 'closed' THEN 1 ELSE 0 END) AS pr_closed,
    ROUND(
        SUM(CASE WHEN action = 'closed' THEN 1 ELSE 0 END) / 
        NULLIF(SUM(CASE WHEN action = 'opened' THEN 1 ELSE 0 END), 0), 
        2
    ) AS closed_rate
FROM github_events
WHERE event_type = 'PullRequestEvent'
GROUP BY repo_name
HAVING SUM(CASE WHEN action = 'opened' THEN 1 ELSE 0 END) > 50
ORDER BY closed_rate DESC;

Why this is better: It requires only a single pass over the data. No joins, no secondary subquery table lookups.

5. The Elegant Modern Way: The ClickHouse -If Combinator

Modern columnar analytics engines like ClickHouse take conditional aggregation a step further by introducing syntax extensions called combinators. Append -If to almost any aggregate function, and it takes a condition as its final argument.

This completely eliminates the clunky, repetitive CASE WHEN syntax while keeping the single-pass speed.

SELECT
    repo_name,
    countIf(action = 'opened') AS pr_opened,
    countIf(action = 'closed') AS pr_closed,
    round(pr_closed / nullIf(pr_opened, 0), 2) AS closed_rate
FROM github_events
WHERE event_type = 'PullRequestEvent'
GROUP BY repo_name
HAVING pr_opened > 50
ORDER BY closed_rate DESC
FORMAT Null;

Why This Wins:

Readability: The query is clean, concise, and clearly intent-driven.
Performance: It matches or beats the "Standard SQL" single-scan performance with lower query-parsing overhead.
Safety: Utilizing nullIf(pr_opened, 0) protects the engine from throwing a fatal division-by-zero error.
Note: The FORMAT Null clause at the end is a handy trick to benchmark execution speed without rendering millions of rows to your terminal display.