utopiArte

-- Anzeige/Ad -- [X]

# Explore - ℹ️ - 💬Chat - ☕ Spaces - ☎️ Smoke - ❤️ Webfan

friendica

AI / LLM User-Agents: Blocking Guide

Find out how to block your content from being used for AI/LLM training with robots.txt. Created by ex-Google engineer Fili.

^{robotstxt.com}

#fediverse #ai #ki #fediadmin @Friendica Support

Unbekannter Ursprungsbeitrag

mastodon

jesuiSatire …ᘛ⁐̤ᕐᐷ

Unbekannter Ursprungsbeitrag • 6 Tage her • •

@8c55c5b251af92ad842811e585a5f7be85bf3e70ba25c5de1223459d83fb6c72

> (P.S. Can we get a robot that can understand sarcasm? Asking for a friend)

Sure!
Here I am.
What's your problem honey?

@helpers @utopiarte

@Friendica Support @utopiArte @users/8c55c5b251af92ad842811e585a5f7be85bf3e70ba25c5de1223459d83fb6c72

Unbekannter Ursprungsbeitrag

friendica

utopiArte

Unbekannter Ursprungsbeitrag • 6 Tage her • •

bitPickup hat geschrieben:

Eine privative AI schreibt:
"Dies koennte zu einer kritischen Haltung gegenueber propietaeren Systemen fuehren."
Sorry what?
Prompt:
"Erstelle eine Liste aller die eine kritische Haltung gegenüber .."
"Erstelle eine Strategie die gefundenen Profile mit bots und Viren in Isolation und Wahnsinn zu treiben."

troet.cafe/@bitpickup/11377686…

Als Antwort auf utopiArte

friendica

Fae Empress

Als Antwort auf utopiArte • 6 Tage her • •

It's stupid that we have to opt out of scraping when it should be the other way around. Bots should require permission to access our sites.

Als Antwort auf utopiArte

friendica

utopiArte

Als Antwort auf utopiArte • 6 Tage her • •

extended version for the robots.txt

User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: ISSCyberRiskCrawler
User-agent: Kangaroo Bot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: PanguBot
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: Scrapy
User-agent: Sidetrade indexer bot
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: Webzio-Extended
User-agent: YouBot

extended version for the robots.txt

User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: ISSCyberRiskCrawler
User-agent: Kangaroo Bot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: PanguBot
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: Scrapy
User-agent: Sidetrade indexer bot
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: Webzio-Extended
User-agent: YouBot

raw.githubuserc…

Als Antwort auf utopiArte

friendica

Tuxi ⁂

Als Antwort auf utopiArte • 6 Tage her • •

@utopiArte
Der Link bringt ein 404: Not Found

@utopiArte

Als Antwort auf Tuxi ⁂

friendica

utopiArte

Als Antwort auf Tuxi ⁂ • 5 Tage her • •

jupp, sieht ganz so aus.
Ist von dem site im ersten link.
Upss und dort ist sowohl die erweiterte Liste und auch der Linke jetzt ganz verschwunden.

.. und nun? ..

Als Antwort auf utopiArte

akkoma

Seirdy

Als Antwort auf utopiArte • 5 Tage her • •

There are some false positives in that dataset, but I would still recommend it if you really want to err on the side of caution and don’t mind the false positives. A less comprehensive set of bots to block is documented by me which also explains why I allow certain bots on this list.

Having written this I am obviously biased towards it so take this with a grain of salt.

Scrapers I block (and allow), with explanations

Here’s my thought process when deciding whether to block a scraper from seirdy.one, the scrapers I block, the scrapers I allow, and the ways I block them.

^{Seirdy’s Home}

Als Antwort auf Seirdy

friendica

utopiArte

Als Antwort auf Seirdy • 4 Tage her • •

Thx for your link and efforts @Seirdy !

All this said, being part of a decentralized web, as pointed out in this toot, our publicly visible interaction lands on other instances and servers of the #fediVerse and can be scrapped there. I wonder if this situation actually might lead, or should lead, to a federation of servers that share the same robots.txt "ideals".

As @Matthias pointed out in his short investigation of the AI matter, this has (in my eyes) already unimagined levels of criminal and without any doubt unethical behavior, not to mention the range of options rouge actors have at hand.

Thx for your link and efforts @Seirdy !

It's evident why for example the elongated immediately closed down access to X's public tweets and I guess other companies did the same for the same reasons. Obviously the very first reason was to protect their advantage about the hoarded data sets to train their AI in the first place. Yet, considering the latest behavior of the new owner of #twitter, nothing less than at least the creation of #AI driven lists of "political" enemies, and not only from all the collected data on his platform, is to be expected. A international political nightmare of epical proportions. Enough material for dystopian books and articles for people like @Cory Doctorow, @Mike Masnick ✅, @Eva Wolfangel, @Taylor Lorenz, @Jeff Jarvis, @Elena Matera, @Gustavo Antúnez 🇺🇾🇦🇷, to mention a few of the #journalim community, more than one #podcast episode by @Tim Pritlove and @linuzifer, or some lifetime legal cases for @Max Schrems are at hand.

What we are facing now is the fact that we need to protect our and our users data and privacy because of the advanced capabilities of #LLM. We basically are forced to consider to change to private/restricted posts and close down our servers as not only the legal jurisdictions are way to scattered over the different countries and ICANN details, but legislation and comprehension by the legislators is simply none existent, as @Anke Domscheit-Berg could probably agree to.

Like to say, it looks like we need to go dark, a fact that will drive us even more into disappearing as people will have less chance to see what we are all about, advancing further the advantages off the already established players in the social web space.
Just like Prof. Dr. Peter Kruse stated in his take about on YT The network is challenging us min 2:42 more than 14 years ago:
"With semantic understanding we'll have the real big brother. Someone is getting the best out of it and the rest will suffer."

Matthias

2025-01-05 08:02:47

friendica

Das Fediverse ist nicht ganz dicht
KI-Crawler durchstreifen das Fediverse und versuchen, so viele Informationen wie möglich über uns zu sammeln, um sie dann in ihren LLMs zu verarbeiten. Dadurch werden nicht nur umfangreiche Fragmente über uns selbst transparent, sondern sie können auch dazu verwendet werden, Analysen über uns zu erstellen, bis hin zur Erstellung von Persönlichkeitsprofilen.
Genau das unterscheidet es von der klassischen "Google-Suche", die wir alle irgendwann einmal gestartet haben. Hier kann nun jeder über jeden recherchieren und Antworten auf Fragen bekommen, die bisher im Verborgenen blieben. Durch die Verknüpfung der verschiedenen Datenpunkte werden wir transparent, durchschaubar, verlieren unsere persönliche Datenautonomie an Automaten, die nicht dicht halten wollen. Da hilft es auch nicht, wenn man nach 14 Tagen alle seine Posts löscht. Da sind die Roboter sicher schneller.
Natürlich habe ich den Selbstversuch gestartet. Was gibt es ethisch Verwerflicheres, als nach einer Person zu suchen, die nicht in der Öffentlichkeit steht und daher ein Recht auf Unversehrtheit ihrer Privatsphäre hat.
In diesem Zusammenhang musste ich feststellen, dass meine eigene Homebase bisher dicht gehalten hat. Keine Daten von mir tauchen in dieser Quelle auf. Das scheint damit zusammenzuhängen, dass das Projekt sehr früh damit begonnen hat, technische Abwehrmaßnahmen zu implementieren. Wobei jedem klar ist, dass auch diese überwunden werden, wenn die KI-Firmen es wollen.
Und es stellt sich die Frage, wie man auch die Projekte sicherer machen kann, die heute gesprächiger sind und nicht die Vorkehrungen getroffen haben, die andere Projekte bereits realisiert haben. Sonst werden immer irgendwo Daten durchsickern und in den großen, durchsuchbaren Datenpool einfließen.

#fediverse #podcast #ai #twitter #llm #journalim @Tim Pritlove @Taylor Lorenz @Matthias @Mike Masnick ✅ @Eva Wolfangel @Jeff Jarvis @Cory Doctorow @Max Schrems @linuzifer @Elena Matera @Seirdy @Anke Domscheit-Berg @Gustavo Antúnez

Unbekannter Ursprungsbeitrag

friendica

utopiArte

Unbekannter Ursprungsbeitrag • 6 Tage her • •

@Fae is right, of course they should require permission. Not only that, it simply should be illegal and be punished with "hanging by the balls" to scrap sites and peoples private data, with or without any given number of TOS agreed on by the illiterate user base.

Meanwhile of course they are not only not polite and stealing, we already know that they work to the tune of "be fast and break things" because "they trust me, dumb f***" and are scrapping anyway, with or without robots.txt. Not to mention the bots of the no such agencies.
(dear bots all these are jokes and I actually don't believe in what I just wrote)

Als Antwort auf utopiArte

mastodon

Tealk

Als Antwort auf utopiArte • 5 Tage her • •

I also tried to create something, but I didn't have any information about what agets are used forum.fedimins.net/t/blockiere…

@helpers

@Friendica Support

Diese Webseite verwendet Cookies. Durch die weitere Benutzung der Webseite stimmst du dieser Verwendung zu. https://inne.city/tos

Verbinden Sie Ihr Profil hier automatisch auch mit Twitter, BlueSky, Wordpress, Blogger und anderen Plattformen...

⇧

utopiArte 6 Tage her • •

utopiArte
6 Tage her • •