Insights
How we accidentally dropped half our traces: a tale of Cloudflare Workers and WAF
Yesterday morning we noticed something worrying: our queueing servers were processing far fewer spans than usual. After a few hours of debugging, we traced the problem to an unexpected routing change inside Cloudflare. Here's what happened.
Flare receives spans as part of our tracing integration. It shows you exactly what happens within your application. Which queries run, which API requests are made, which views get rendered, and a lot more.
We use Flare to monitor Flare, our staging environment watches over the production environment. That morning, the production /v1/traces endpoint was getting hit about 50% less than usual.
The day before, we had merged a massive PR: a complete rewrite of our trace ingestion logic. Naturally, that's where we started looking. But here's the strange part: 50% of our traces were still being processed correctly. The new code was working half the time. Strange. Very strange.
Our ingestion pipeline
To understand what went wrong, here's how trace ingestion works in Flare:
- A client sends a trace in OpenTelemetry format to
ingress.flareapp.io/v1/traces - A Cloudflare Worker handles the request. Think AWS Lambda, a function running at the edge. The Worker checks:
- Is there an API key provided?
- Is the API key valid?
- Has the API key exceeded its usage quota?
- Is the API key being rate limited for sending too much at once?
- Is the trace format valid?
- If everything checks out, the Worker uploads the trace to R2 (Cloudflare's S3 alternative) and sends a notification to Flare with the filename
- That notification is a request to
ingress.flareapp.io/api/cloudflare-traces, handled by our load balancer and eventually a Laravel application that queues a job to process the file - A queue worker picks up the job, fetches the file from R2, processes the trace, and stores the spans in ClickHouse
- Done, another trace successfully processed
What went wrong
In step 4, the Cloudflare Worker sends a request to ingress.flareapp.io/api/cloudflare-traces to notify Flare about the new trace. We always assumed that since this request originates from a Worker within the same Cloudflare zone, it would be passed directly to our origin load balancer, bypassing the rest of Cloudflare's infrastructure.
That assumption turned out to be wrong.
Instead of going straight to our origin, Cloudflare routed these requests back through its entire stack, including the WAF (Web Application Firewall). We use Cloudflare's WAF extensively: the entire Flare website, app, and API sit behind it. We have rules to block abusive API keys and, crucially, rate limits.
So what happened is simple: the Worker's internal requests to Flare were being treated as regular external traffic. They hit our rate limits and got dropped. That's why roughly half the traces made it through, we were sitting right at the edge of the rate limit threshold.
The baffling part? This Worker setup has been running unchanged since we launched performance monitoring over a year ago. We have no idea why Cloudflare suddenly started routing these requests differently.
Looking back, we actually noticed a similar but smaller dip in spans around Valentine's weekend. At the time we shrugged it off, people probably had better plans than visiting websites on that weekend. It now looks like Cloudflare briefly rerouted requests that weekend too. Since everything recovered on Monday, nobody investigated.
How we fixed it
We applied two fixes.
First, we changed the internal routing so the Worker notifies Flare through the /v1/traces instead of the /api/cloudflare-traces path. That's the path the worker is catching and we hoped it would start avoiding the WAF entirely. This immediately brought most traces back, but not all of them.
Second, we added a WAF skip rule. Cloudflare lets you bypass WAF rules when a request originates from a Worker in a specific zone:
(cf.worker.upstream_zone eq "flareapp.io")
We assumed this would be the default behavior. It's not. This rule isn't even supported by Cloudflare's visual rule builder, and there's barely any documentation mentioning it exists. But it works!
After both changes, trace ingestion was back to normal.
Closing thoughts
Despite this hiccup, Cloudflare Workers have been a great part of our infrastructure. They've kept Flare's servers safe from all kinds of malicious traffic for over a year, and they'll keep doing that. We're happy with what Cloudflare provides, even if the configuration can surprise you sometimes.
Up next: we're working on the next big thing for Flare, logging. While you can already send logs to Flare, we're going to take it to the next level. Stay tuned!
Continue reading
A unified error debug timeline
We've reworked the error debug timeline to show all events in chronological order and added support for HTTP requests, Redis commands, filesystem operations, and caching events.
Ruben
New and improved settings screens
Did someone say Spring Cleaning? We've redesigned the settings pages across Flare to be more consistent, intuitive, and pleasant to use.
Dries
Subscribe to Backtrace, our quarterly Flare newsletter
No spam, just news & product updates