Hacker Newsnew | past | comments | ask | show | jobs | submit | brisky's commentslogin

Does any current Open Source license address the question of AI/LLM training at all? Some OSS developers have clear sentiment against it but currently they can not even pick a standard OSS license that aligns with their worldview.

One of these things is true:

1. Training AI on copyrighted works is fair use, so it's allowed no matter what the license says.

2. Training AI on copyrighted works is not fair use, so since pretty much every open source license requires attribution (even ones as lax as MIT do; it's only ones that are pretty much PD-equivalent like CC0, WTFPL, and Unlicense that don't) and AI doesn't give attribution, it's already disallowed by all of them.

So in either case, having a license mention AI explicitly wouldn't do any good, and would only make the license fail to comply with the OSD.


Point 2 misses the distinction between AI models and their outputs.

Let's assume for a moment that training AI (or, in other words, creating an AI model) is not fair use. That means that all of the license restrictions must be adhered to.

For the MIT license, the requirement is to include the copyright notice and permission notice "in all copies or substantial portions of the Software". If we're going to argue that the model is a substantial portion of the software, then only the model would need to carry the notices. And we've already settled on accessing over a server doesn't trigger these clauses.

Something like the AGPL is more interesting. Again, if we accept that the model is a derivative work of the content it was trained on, then the AGPL's viral nature would require that the model be released under an appropriate license. However, it still says nothing about the output. In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

So far, though, in the US, it seems courts are beginning to recognize AI model training as fair use. Honestly, I'm not surprised, given that it was seen as fair use to build a searchable database of copyright-protected text. The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

But there is still the ethical question of disclosing the training material. Plagiarism still exists, even for content in the public domain. So attributing the complete set of training material would probably fall into this form of ethical question, rather than the legal questions around intellectual property and licensing agreements. How you go about obtaining the training material is also a relevant discussion, since even fair use doesn't allow you to pirate material, and you must still legally obtain it - fair use only allows you to use it once you've obtained it.

There are still questions for output, but those are, in my opinion, less interesting. If you have a searchable copy of your training material, you can do a fuzzy search of that material to return potential cases where the model returned something close to the original content. GitHub already does something similar with GitHub Copilot and finding public code that matches AI responses, but there are still questions there, too. It's more around matches that may not be in the training data or how much duplicated code needs to be attributed. But once you find the original content, working with licensing becomes easier. There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.


> The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

> In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

It does if the software copies portions of itself into the output, which seems close enough to what LLMs do. The neuron weights are essentially derived from all the training data.

> There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.

That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The only way to not violate the license on the training data is to treat all output as potentially derived from all training data.


> You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output. The same model can be used across multiple software applications for different purposes. If I were to go to https://huggingface.co/deepseek-ai/DeepSeek-V3.2/tree/main (for example) and download those files, I wouldn't be able to reverse-engineer the training data without building more software.

Compare that to a search database, which needs the full text in an indexable format, directly associated with the document it came from. Although you can encrypt the database, at some point, it needs to have the text mapped to documents, which would make it much easier to reconstruct the complete original documents.

> That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The threshold of originality defines whether something can be protected by copyright. There are plenty of small snippets of code that can't be protected. But there are still questions about these small snippets that were consumed in the context of a larger, protected work, especially when there are only so many ways to express the same concept in a given language. It's definitely easier in written text than code to reason about.


> The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output.

By that argument, a compressed copy of the Internet doesn't reproduce the Internet, the decompression software does. That's not a useful semantic distinction; the compressed file is the derived work, not the decompression software.


Testing your app/website when it has different behaviour depending on locale


Real, but pretty minimal usage.


Great points. With Meta glasses and other similar gadgets I think manual consent is not enough. There should be a 'protocol' to announce that you don't allow your images to be included in social media. I propose a QR code that would signify that you don't want to filmed. We need to push for legislation allowing (returning) such liberty. After such automated consent is legal it will be up to social media platforms to blur and anonymize individuals with such preferences. Finally we will have a job where AI could be put to good use!


You might want to check out Really Simple Decentralized Syndication (RSDS) https://writer.did-1.com/


I think I have it as well. But my theory is that we might have imagination but it is only accessible to subconscious. It is as if it is blocked from consciousness. I have ADHD as well, might be that this is protection mechanism that allows my kind of brain to survive in the world better (otherwise it would be too entertaining to get lost in your own imagination). As a kid I used to daydream a lot.


It would be very useful for AI platform customers. You could run prompts with 0 temperature and check if the results are the same making sure that AI provider is not switching the PRO model in the background for a cheap one and ripping you off.


Similar situation - I was an independent app publisher on app store, but I don't feel comfortable publishing my phone number next to my apps. I don't do customer support. This punishes indie app devs. After I saw this requirement I decided to remove my app from the app store.


It is possible that digitization and improvement of taxi services was inevitable anyway


They still haven’t properly digitized, curb sucks ass, I had to report a driver to curb when he made me Zelle him because “curb payment wasn’t working”


Possible, yes. Probable?


Not only was it inevitable, if we were so inclined and willing to use the regulatory pen, we could've simply written into law that for Taxi's to operate, they must be well maintained and must accept all major forms of payment. And yeah, the Taxi industry would've fought it because every company ever has fought every regulation ever no matter how much it stands to benefit both their customers and they them-fucking-selves but companies having a say in how they are regulated is both how a Taxi company would fight this, and how Uber, AirBnb, OpenAI, Meta, etc. blatantly and flagrantly violate the law and instead of consequences, they get fines, and court hearings. So maybe we just shouldn't be allowing that?

It drives me up the goddamn wall how people will say shit like "the Taxi industry needed to be upended" when like... I mean, maybe? But on balance, given all the negative externalities associated with these companies, are they really a gain? Or are they just a different set of overlords, equally disinterested in providing a good service once they reach the scale where they no longer are required to give a shit?

Just... regulate the fuckers. Are you sick of filthy Taxis that break down? Put a regulation down that says if a cab breaks down during a trip, they owe the customer a free ride and five thousand dollars. You bet your ASS those cabs will be serviced as soon as humanly possible. This isn't rocket science y'all. Make whatever consequence the government is going to dispense immeasurably, clearly worse than whatever the business is trying to weasel out of doing, and boom. Solved.


> Not only was it inevitable, if we were so inclined and willing to use the regulatory pen, we could've simply written into law that for Taxi's to operate, they must be well maintained and must accept all major forms of payment.

That was frequently already the case. They were required to accept credit cards but then the card reader would be "broken" and it wasn't worth anybody's time to dispute it instead of just paying in cash.

You also... don't really want laws like that. They're required to accept "all payment methods", which ones? Do they have to take American Express, even though the fees are much higher? Do they have to take PayPal if the customer has funds in a PayPal account? What about niche card networks like store cards accepted at more than one merchant? If not those and just Visa and Mastercard, you now have a law entrenching that duopoly in the law.

> Are you sick of filthy Taxis that break down? Put a regulation down that says if a cab breaks down during a trip, they owe the customer a free ride and five thousand dollars. You bet your ASS those cabs will be serviced as soon as humanly possible. This isn't rocket science y'all.

It's not rocket science, it's trade offs.

Is there a $5000 fine for a breakdown? You just made cab service much more expensive, because they're either going to have to pay the fines as a cost of doing business and then pass them on, or propylactically do excessive maintenance like doing full engine rebuilds every year because it costs less than getting caught out once, and then passing on the cost of that. And even then, there is no such thing as perfect. The cabbie paid to have the whole engine rebuilt by the dealership just yesterday and the dealer under-tightened one of the bolts when putting it back in, so there's a coolant leak? Normally that's just re-tightening the bolt and $20 worth of coolant, but now it's a $5000 fine on top of the $4000 engine rebuild.

The way you actually want to solve this is with competition, not rigid rules and onerous fines. If someone is always having breakdowns then they get bad rating, customers can see that when choosing and then opt for a different driver that costs slightly more -- but only if the cost is worth the difference to them. Maybe it's worth $2 for the difference between two stars and five but it isn't worth $50 for the difference between 4.7 and 4.8. Either way you shouldn't be deciding for people, you should be giving them the choice.


> That was frequently already the case. ...the card reader would be "broken"

I traveled a lot to a smallish town for work before Uber got there and ran into this several times. After the second or third time, I started just saying "well that sucks for you" and starting to leave. Suddenly it would work.

Yes it sucked, but it didn't really impact much.


> Just... regulate the fuckers.

That's true, however we must also keep in mind that Uber (and alikes) happened because regular institutions failed to do this for some reason or another. I won't try to speculate why, because I have no idea (and of course it looks obvious in the hindsight).

There was a demand for safer and more reliable taxis. There was not enough supply for that. Government haven't paid enough attention to the sector. So, naturally, someone came and used that whole situation to provide supply for this demand.

Of course it's not this simple, and there were a lot of other things going on. But if we narrow the scope down to just this, then we can see that the core problem here wasn't Uber, it was that that governments were too slow to react in time.


I would rather ruin the taxi livelihood than have to argue with my driver about turning on the meter again


It is easy to confuse a mastermind with somebody who is simply willing to break the law.


Recently I have posted about RSDS (really simple decentralized syndication) - a protocol that tries to solve RSS content global discovery problem. Here is the link if you are interested to read more about it

https://news.ycombinator.com/item?id=42654891


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: