Insights

How we accidentally dropped half our traces: a tale of Cloudflare Workers and WAF

Ruben

· February 25, 2026

Flare news

Yesterday morning we noticed something worrying: our queueing servers were processing far fewer spans than usual. After a few hours of debugging, we traced the problem to an unexpected routing change inside Cloudflare. Here's what happened.

Flare receives spans as part of our tracing integration. It shows you exactly what happens within your application. Which queries run, which API requests are made, which views get rendered, and a lot more.

We use Flare to monitor Flare, our staging environment watches over the production environment. That morning, the production /v1/traces endpoint was getting hit about 50% less than usual.

The day before, we had merged a massive PR: a complete rewrite of our trace ingestion logic. Naturally, that's where we started looking. But here's the strange part: 50% of our traces were still being processed correctly. The new code was working half the time. Strange. Very strange.

Our ingestion pipeline

To understand what went wrong, here's how trace ingestion works in Flare:

A client sends a trace in OpenTelemetry format to ingress.flareapp.io/v1/traces
A Cloudflare Worker handles the request. Think AWS Lambda, a function running at the edge. The Worker checks:
- Is there an API key provided?
- Is the API key valid?
- Has the API key exceeded its usage quota?
- Is the API key being rate limited for sending too much at once?
- Is the trace format valid?
If everything checks out, the Worker uploads the trace to R2 (Cloudflare's S3 alternative) and sends a notification to Flare with the filename
That notification is a request to ingress.flareapp.io/api/cloudflare-traces, handled by our load balancer and eventually a Laravel application that queues a job to process the file
A queue worker picks up the job, fetches the file from R2, processes the trace, and stores the spans in ClickHouse
Done, another trace successfully processed

What went wrong

In step 4, the Cloudflare Worker sends a request to ingress.flareapp.io/api/cloudflare-traces to notify Flare about the new trace. We always assumed that since this request originates from a Worker within the same Cloudflare zone, it would be passed directly to our origin load balancer, bypassing the rest of Cloudflare's infrastructure.

That assumption turned out to be wrong.

Instead of going straight to our origin, Cloudflare routed these requests back through its entire stack, including the WAF (Web Application Firewall). We use Cloudflare's WAF extensively: the entire Flare website, app, and API sit behind it. We have rules to block abusive API keys and, crucially, rate limits.

So what happened is simple: the Worker's internal requests to Flare were being treated as regular external traffic. They hit our rate limits and got dropped. That's why roughly half the traces made it through, we were sitting right at the edge of the rate limit threshold.

The baffling part? This Worker setup has been running unchanged since we launched performance monitoring over a year ago. We have no idea why Cloudflare suddenly started routing these requests differently.

Looking back, we actually noticed a similar but smaller dip in spans around Valentine's weekend. At the time we shrugged it off, people probably had better plans than visiting websites on that weekend. It now looks like Cloudflare briefly rerouted requests that weekend too. Since everything recovered on Monday, nobody investigated.

How we fixed it

We applied two fixes.

First, we changed the internal routing so the Worker notifies Flare through the /v1/traces instead of the /api/cloudflare-traces path. That's the path the worker is catching and we hoped it would start avoiding the WAF entirely. This immediately brought most traces back, but not all of them.

Second, we added a WAF skip rule. Cloudflare lets you bypass WAF rules when a request originates from a Worker in a specific zone:

(cf.worker.upstream_zone eq "flareapp.io")

We assumed this would be the default behavior. It's not. This rule isn't even supported by Cloudflare's visual rule builder, and there's barely any documentation mentioning it exists. But it works!

After both changes, trace ingestion was back to normal.

Closing thoughts

Despite this hiccup, Cloudflare Workers have been a great part of our infrastructure. They've kept Flare's servers safe from all kinds of malicious traffic for over a year, and they'll keep doing that. We're happy with what Cloudflare provides, even if the configuration can surprise you sometimes.

Up next: we're working on the next big thing for Flare, logging. While you can already send logs to Flare, we're going to take it to the next level. Stay tuned!

Back to overview

Continue reading

Flare news

Bringing Flare back to Laravel 10 and PHP 8.1

Flare now supports Laravel 10 and PHP 8.1 again, so older apps get performance monitoring and logs too. Here’s how we did it.

Ruben

· July 3, 2026

Flare news

Your Laravel routes can carry metadata now, and Flare shows it

Laravel added a new way to attach metadata to routes, and Flare now automatically shows it on errors and traces

Ruben

· June 30, 2026

Subscribe to Backtrace, our quarterly Flare newsletter

No spam, just news & product updates