The Case for Personal Data Ownership

Where's your data?

If you open LinkedIn or any other professional social feed (risky in 2025!), you will be swamped with takes on AI. Conversations and debates about the impact of AI are now an endless fixture of these feeds, as are discussions about its relationship to labor, craft, and human judgment.

But at its core, this technology is fundamentally about data processing. Foundation model companies have gathered enormous amounts of data—their models have seen it, processed it, and learned from it. When you ask Claude or ChatGPT about physics, you’re in part asking every physics book, paper, Wikipedia page, and website in existence.

Thanks for reading Playful Work! Subscribe for free to receive new posts and support my work.

My question isn’t so much about whether or not this is useful (yes, it is) but more about the ethics of how information gets processed and used.

Oops, we gave away everything

In the late 1990s and early 2000s, I was very inspired by the open source movement. I attended the first Open Source Convention in 2000, and was completely starstruck getting to see heros like Larry Wall, Bill Joy, and Tim O’Reilly in person. There was tremendous idealism then about sharing knowledge, freedom of information, and the opportunities the internet provided to better human lives through access to information. Ideologically, I still believe in much of that vision.

But I'm starting to wonder if we got some things wrong—open everything seems less and less like it benefits people, and more and more like it benefits the few companies who have the resources to process this kind of information at scale. In the early 2000s, it felt like copyright laws and traditional rights holders were going to keep information out of people's hands. But now the intellectual work in the commons has been harvested, repackaged, and sold back to its creators via generative AI.

It’s about infrastructure

Fundamentally, frontier AI boils down to training data and processing power. The processing part—the infrastructure—is extraordinarily expensive, requiring significant skill, talent, and money to develop and maintain. The scale is almost unimaginable. Money doesn’t just buy infrastructure, but talent too.

But the data that’s being processed was captured from the commons, copyrighted works, and from people's everyday activities online.

Most folks are dependent on these platforms for daily tasks. If you’re like me, you’ve clicked through endless user agreements and given away significant rights to your data for access to inexpensive software. We've been told the value we get from these services justifies this exchange, which up until recently, I believed. While there's definitely something to that argument, it’s been hard not to wonder lately if it still adds up.

The math starts to change when we’re sold AI to replace someone’s labor and craft, sometimes competing directly with the very people who created the training data in the first place.

Also, digital identity is a problem

My identity has become increasingly digital, but access to my own data is something that I often need to pay for, either with money, ad views, or more data.

We've extended the boundaries of our personal identities outside our physical bodies, but we don't seem to own the digital parts of ourselves. Doesn’t this feel problematic?

The solution must be person-centric

The question becomes: how do we change the baseline? Can we create a person-centered approach to AI and data? How do we enable sharing and collaboration while makig sure that people make the decisions about how their data is consumed, processed, and used?

Don't get me wrong: I’m bullish on AI as a technology, and I believe that there are plenty of valuable exchanges to be made. I’m happy to allow services that I value to process my data, but I’d like a little more say in how and when it happens. The status quo doesn't offer users real control, and that needs to change.

What can we build to fix this?

I think we can solve this problem by helping users gather their content together, store it safely, and provide an interface that allows them to manage how it gets used while actually being able to use it.

The vision: photographers, writers, and musicians have control over their data in a useful, searchable format. They share access to subsets of that data with other artists to create data unions. Or maybe they train their own models in their style, with proper attribution, that could actually produce value because they're built on real work.

This could enable new forms of collaboration and value creation while keeping creators in control.

If we imagine most normal people, all their digital stuff is probably on the order of one million digital artifacts. For those artifacts, we would need:

  • Affordable, encrypted cloud/hybrid storage for individuals (something like Dropbox but without the filesystem baggage or a subsription trap)
  • Local indexing, metadata, and search. I think there is ton of opportunity here to solve specific, hard product problems that help people. This is basically what folks are trying to do with RAG in the enterprise, right? It’s not easy, but it needs to be solved for individuals and not just enterprises. Also, I see privacy preserving metadata generation as a huge use case for edge AI.
  • Simple, transparent access control management. There’s too much stuff for people to sift through everything, so a lot of this would need to be categorical and based on the metadata (e.g. a blanket set of permissions or photos of cats in 2012). The UX needs to be easy and transparent.
  • Interfaces to make the data useful. This should be an ecosystem–we could provide APIs for structured software, MCP for agentic workflows, etc. The point isn’t to hide our data in a bunker and hope that AI goes away. The point is to change the relationship between people, platforms, and data.

The core question

This leads us to a simple, question that we need better answers to—"where's my stuff?"

We should have answers that are understandable and serve the people who own the data: I should know where my stuff is, how it's being used, and I should be able to use it to benefit from the major advances we’re seing in AI, without feeling like I’m being used.

If someone builds a great AI product that can do zero-shot inference with my content, fantastic—I want to engage with it on my own terms, and make understandable and clear exchanges about what's happening. If we're training models on my stuff, I should have attribution, the opportunity to benefit from the output, or I should be able to opt-out if I don't want to be involved.

Thanks for reading Playful Work! Subscribe for free to receive new posts and support my work.