How we stopped a DDoS attack at Laracon
It is the first day of Laracon EU 2024. Flare was sponsoring the confererence for the first time and we were present with almost the whole Spatie team to promote it. The clock ticks 10:30, and inside the Bimhuis auditorium, Kévin Dunglas talk about how he built the awesome FrankenPHP server is almost over.
The doors of the auditorium open, and people come out towards us to talk about Flare or win a Playstation 5 / Lamborghini* at our booth. And suddenly, a message from Rias, who's still working at the office in Antwerp, pops up in Slack:
@Freek @Alex @Ruben Flare down? #uptime looks like something DDoS 😬
Oh no! We've been working towards this conference for weeks: preparing a sales talk (something we've never done; we're developers, not salespeople), crafting banners, picking t-shirts, and ordering a bunch of stickers.
This was our moment, the moment for Flare to shine, and it was down. With Alex, I rushed behind our promo banner, grabbed our MacBooks, and quickly formed a team with Rias at the office and Mathias from the befriend Oh Dear booth next to us. We started investigating the problem.
Taylor and Freek talking about Flare, while Ruben and Mathias try to fix Flare
At that moment, we received a 200x increase in requests compared to a regular day. WTF. Those requests were all sent to our servers, which spun up PHP-FPM processes as quickly as possible to handle the requests. Unfortunately, our server couldn't handle such a load, resulting in downtime.
The first idea that popped into our heads was to spin up more web servers or scale our existing servers so that they could manage the load. However, we quickly realized that it was impossible to manage a load like this by adding extra servers.
While the rest of our team was talking to the conference attendees, that Flare is THE Laravel error tracker, with a small but dedicated team always listening to our clients to create the best error tracker ever. That error tracker was now down for half an hour.
Then it hit us: CloudFlare! We had CloudFlare in front of Flare all the time. Cloudflare has an "I'm under attack" switch, which basically adds a JavaScript challenge to each request coming into Flare. You, as a human, can skip this request by clicking the famous "I'm not a robot" checkbox. Robots are less lucky and can't surpass the challenge.
We hit that "I'm under attack switch," and WOW! Immediately, we saw a reduction in the number of requests hitting the Flare servers. CloudFlare was blocking more than 99% of the requests at that time. We restarted our FPM services, and Flare was back online!
Now was the time to relax ... we were thinking. The great thing about the "I'm under attack" switch is that it immediately stops all non-human requests to a server. The problem with the switch is that it stops all non-human requests to a server. Flare is an error-tracking service. Many people send us errors from their servers. These requests, albeit non-human, are valid requests. By stopping the DDoS, we stopped the ingestion of errors.
Flare exists of two groups of servers, first we have our web servers they allow you to browse to the Flare application. The attack was targeted at these servers, specifically the login endpoint.
Another set of servers are the ingest servers. Ignition and the Flare client send errors from our clients' applications to these servers. There's no human interaction, and these servers were not under attack.
The ingestion servers kept working during the attack because a DDoS attack cannot attack them because the routes are protected by Cloudflare workers (kinda like Lambda functions). The workers check if a request has a valid API key and if the team connected to that API key has enough quota to send us its errors. When an API key does not exist, is invalid, or exceeds its quota, the request will be dropped.
That's why we kept processing errors sent to us during the attack. Our ingestion servers purred like kittens, and they didn't notice anything going on at the web servers.
Back to CloudFlare, the "I'm under attack" mode stopped the DDoS attack on our web servers, but it also stopped legitimate requests made to our ingestion servers. Luckily, the CloudFlare challenge can be configured, and after a few minutes, we managed to exclude it from our ingestion servers.
For the next two hours, the attacker kept sending the same ridiculous amount of requests to our servers. But Flare didn't break a sweat—we were back online!
We found an email addressed to us in our support system. The attacker wanted a certain amount of ransom money. After the payment, he would stop the attack. A few Google searches later, we found other small companies on Twitter (ehm X) that the same attacker targeted. We're glad we didn't pay.
That was one hell of a ride, our first DDoS attack. To let things run smoother the next time, we've updated our internal documentation so that the next time a DDoS attack happens, everyone on our team can configure CloudFlare to stop blocking the correct requests.
To all the other small SAAS companies like us, ensure you have a service like CloudFlare between your servers and the web. It is extremely helpful in these kinds of situations.
* It was a toy Lamborghini, still cool, though 😎