The trouble with openness

November 27, 2024

Back in the early 2000s, there was this nebulous idea called the semantic web. It never really went anywhere, but I found it exciting at the time.

One piece that particularly spoke to me was the notion of including data in websites so that web scrapers could easily get at it. This was supposed to make the web more open and interoperable. The problem was that the data needed to be well structured and rigorously defined. Only a few nerds cared to put in the effort for that, and even those nerds had endless arguments about how it should work.

Fast forward a bit and a few things happened. First, the hippie optimism of the early web wore off. Few people wanted their data to be “open”, and most large companies took pains to keep their data to themselves. Second, APIs took off in a huge way. If you were going to share data with people, it was likely to be using an API rather than to encode it directly in the website content. This provided more control over who could access the data. Third, even most of the nerds from earlier gave up on building websites in a rigorous way. It became easier to use query selectors or natural language tools to scrape data from the few sites that would still let you.

The semantic web was an idealistic dream, but it was a nice one for a time. Now things are much more closed, for better and worse.

This morning I read a piece on 404 Media about a machine learning researcher who released a set of Bluesky data for AI training purposes. The community blew up about this and the researcher removed the data set. A lot of people got really upset.

I’ve been ranting all year about the lack of privacy and surprising openness of Bluesky. It’d be stupid of me to assume anyone joining the service would care about that sort of thing, but I think it’s a big reason why Bluesky is interesting. I also think that, for every earnest researcher sharing their work, there are at least a dozen shady companies silently scraping Bluesky data for monetary gain. The data in the Bluesky is free and you can take it home!

The Bluesky team has been clear about this from the start, but no user signing up cares about the system’s architecture. It’s a nicer place than X or Threads, and it’s popping off right now. It really does remind me a bit of the semantic web days. What if the whole system could be completely open, and people could build whatever cool things they wanted on it? Andy Baio recently posted a thread of interesting things people were building on the firehose. I even spent some time this weekend building a tiny demo.

Exciting stuff! But, when I say Bluesky’s data is open, I mean it’s really open. That leads to trade-offs. Anyone can take all the data, and I’m betting they have. The limits are legal and ethical, not technological. I don’t trust the ethics of most companies. Bluesky’s systems are currently more open than data on the public web, which is saying something. Most websites try to protect themselves with a robots.txt file. This file is just a suggestion, and while I think there are some web crawlers that respect it, many don’t. However, there are other ways to prevent this sort of thing on the web. If you don’t believe me, I invite you to try scraping Facebook or Google.

Also, while it seems from the outside that Bluesky is trying to do the right thing, they’re also backed by venture capital money from firms that fund crypto and AI. They’re trying to build the platform in a way that is “billionaire-proof”, but I’m not certain it’s possible. At the very least, a billionaire could — today — easily download all the data and use it in ways many people wouldn’t agree with, but that might be totally legal.

This leads to the “fair use” bit (aka, “fair dealing” in Canada). This goes for all content, but I think it’s especially important when the content can be gathered in bulk. Is training LLM systems a “transformative use” of content? I don’t know, but I think it might end up being declared so. It certainly sucks to have your work slurped up and used to make money by someone else. But if LLM training is deemed to be transformative and fair use, then that’s another trade-off. Do you like that Archive.org preserves old websites and abandoned video games? I sure do! Did you know that they disregard robots.txt files to do it? It probably goes without saying, but many video game copyright holders don’t love that old games are preserved there. I think scraping content for research is reasonable, if tricky. Here’s a great thread from someone who outlines the trade-offs involved.

I also value the idea that people should be able to use a snippet of a movie or some gameplay when critiquing or reviewing it. I believe this type of work is transformative, but some copyright holders hate it. I don’t think the copyright holder should always get the final say in how their works are used. Everything is a remix, as they say.

However you feel about all of this, it would be nice to at least have some of the options available on the public web. Something like a robots.txt for Bluesky would be welcome, and the Bluesky team say they’re looking at ways to do this. But, like robots.txt, I think this will end up being more of a suggestion than a roadblock. The whole Bluesky system is engineered to be open.

I’ve done work for companies that have paywalled content. You want regular visitors to pay for that content, but you also want Google to be able to index it. In many cases, leaving the door open for Google also leave it open to services like Archive.is or 12ft.io. They pretend to be Google, slurp up the content, and serve it paywall-free. There are ways to prevent this, but it’s a game of cat and mouse. If the Bluesky team come up with a solution, I suspect it will have similar issues.

So, what to make of all this? Mostly, I think it’s good for more people to understand that Bluesky is extremely open. If you’re posting content there, do so with the knowledge that anyone can access it in bulk. This could potentially lead to very cool things, but it’s equally possible people might do things with it that you might not like. It sure would be nice if people only consumed your works in the way you preferred. On the other hand, I think fair use is important and it’s possible it doesn’t align with all of your preferences. Like everything, it’s a bunch of trade-offs.

It’s also possible that, like the idea of the semantic web, the openness of Bluesky’s protocol might be too idealistic. I guess we’ll all find out together how this shakes out.