Dramatisch
♲ Dennis Schubert - 2024-12-27 00:20:02 GMT
Excerpt from a message I just posted in a #diaspora team internal forum category. The context here is that I recently get pinged by slowness/load spikes on the diaspora* project web infrastructure (Discourse, Wiki, the project website, ...), and looking at the traffic logs makes me impressively angry.- - - - - -In the last 60 days, the diaspora* web assets received 11.3 million requests. That equals to 2.19 req/s - which honestly isn't that much. I mean, it's more than your average personal blog, but nothing that my infrastructure shouldn't be able to handle.
However, here's what's grinding my fucking gears. Looking at the top user agent statistics, there are the leaders:
- 2.78 million requests - or 24.6% of all traffic - is coming from
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
.- 1.69 million reuqests - 14.9% -
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
- 0.49m req - 4.3% -
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
- 0.25m req - 2.2% -
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
- 0.22m req - 2.2% -
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
and the list goes on like this. Summing up the top UA groups, it looks like my server is doing 70% of all its work for these fucking LLM training bots that don't to anything except for crawling the fucking internet over and over again.
Oh, and of course, they don't just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don't give a single flying fuck about
robots.txt
, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki. And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users.If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.
Just for context, here's how sane bots behave - or, in this case, classic search engine bots:
- 16.6k requests - 0.14% -
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- 15,9k req - 0.14% -
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Because those bots realize that there's no point in crawling the same stupid shit over and over again.
I am so tired.
mögen das
xy@diaspora.schoenf.de, dieter_wilhelm, Josef, Andreas Kilgus, nitronils und Andrew Pam mögen das.
Diese Webseite verwendet Cookies. Durch die weitere Benutzung der Webseite stimmst du dieser Verwendung zu. https://inne.city/tos
OldKid ⁂
Als Antwort auf Matthias • •@Matthias das ist echt heftig
habe gerade mal bei meiner Instanz geschaut für diesen Monat (01.12.24 - 26.12.24)
gptbot : 72 Anfragen
amazonbot : 16 Anfragen
bei beiden wurde nur robots.txt angefragt
Googlebot : 131 Anfragen ( robots.txt , /files, /ftp, /backup, /backups, /config, ....)
bingbot : 18 Anfragen (nur robots.txt)
Matthias mag das.
Rainer "diasp.org" Sokoll ✅
Als Antwort auf Matthias • • •Howto block AI bots with fail2ban (Apache) - Rainers kleine Welt
rainer.sokoll.comalfredb mag das.
ℝ𝕠𝕓𝕚𝕟
Als Antwort auf Matthias • • •