More

jedbrown · 2025-08-22T01:30:37 1755826237

Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.

We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.

raggi · 2025-08-22T01:57:30 1755827850

For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording, as the structure really isn’t setup that way in the models under question here.

This is similar to the ruling by Alsup in the Anthropic books case that the training is “exceedingly transformative”. I would expect a reinterpretation or disagreement on this front from another case to be both problematic and likely eventually overturned.

I don’t actually think provenance is a problem on the axis you suggest if Alsups ruling holds. That said I don’t think that’s the only copyright issue afoot - the copyright office writing on copyrightability of outputs from the machine essentially requires that the output fails the Feist tests for human copyrightability.

More interesting to me is how this might realign the notion of copyrightability of human works further as time goes on, moving from every trivial derivative bit of trash potentially being copyrightable to some stronger notion of, to follow the feist test, independence and creativity. Further it raises a fairly immediate question in an open source setting if many individual small patch contributions themselves actually even pass those tests - they may well not, although the general guidance is to set the bar low - but is a typo fix either? There is so far to go on this rabbit hole.

snickerbockers · 2025-08-22T06:51:41 1755845501

I'd be fine with that if that was the way copyright law had been applied to humans for the last 30+ years but it's not. Look into the OP's link on clean room reverse engineering, I come from an RE background and people are terrified of accidentally absorbing "tainted" information through extremely indirect means because it can potentially used against them in court.

I swear the ML community is able to rapidly change their mind as to whether "training" an AI is comparable to human cognition based on whichever one is beneficial to them at any given instant.

j4coh · 2025-08-22T02:30:25 1755829825

So if you can get an LLM to produce music lyrics, for example, or sections from a book, those would be considered novel works given the encoding as well?

GCUMstlyHarmls · 2025-08-22T03:00:09 1755831609

Depends if the music is represented by the RIAA or not :)

raggi · 2025-08-22T04:26:31 1755836791

"an LLM" could imply an LLM of any size, for sufficiently small or focused training sets an LLM may not be transformative. There is some scale at which the volume and diversity of training data and intricacy of abstraction moves away from something you could reasonably consider solely memorization - there's a separate issue of reproduction though.

"novel" here depends on what you mean. Could an LLM produce output that is unique that both it and no one else has seen before, possibly yes. Could that output have perceived or emotional value to people, sure. Related challenge: Is a random encryption key generated by a csprng novel?

In the case of the US copyright office, if there wasn't sufficient human involvement in the production then the output is not copyrightable and how "novel" it is does not matter - but that doesn't necessarily impact a prior production by a human that is (whether a copy or not). Novel also only matters in a subset of the many fractured areas of copyright laws affecting the space of this form of digital replication. The copyright office wrote: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell....

Where I imagine this approximately ends up is some set of tests that are oriented around how relevant to the whole the "copy" is, that is, it may not matter whether the method of production involved "copying", but may more matter if the whole works in which it is included are at large a copy, or, if the area contested as a copy, if it could be replaced with something novel, and it is a small enough piece of the whole, then it may not be able to meet some bar of material value to the whole to be relevant - that there is no harmful infringement, or similarly could cross into some notion of fair use.

I don't see much sanity in a world where small snippets become an issue. I think if models were regularly producing thousands of tokens of exactly duplicate content that's probably an issue.

I've not seen evidence of the latter outside of research that very deliberately performs active search for high probability cases (such as building suffix tree indices over training sets then searching for outputs based on guidance from the index). That's very different from arbitrary work prompts doing the same, and the models have various defensive trainings and wrappings attempting to further minimize reproductive behavior. On the one hand you have research metrics like 3.6 bits per parameter of recoverable input, on the other hand that represents a very small slice of the training set, and many such reproductions requiring strongly crafted and long prompts - meaning that for arbitrary real world interaction the chance of large scale overlap is small.

j4coh · 2025-08-22T05:16:53 1755839813

By novel, I mean if I ask a model to write some lyrics or code and it produces pre-existing code or lyrics, is it novel and legally safe to use because the pre-existing code or lyrics aren’t precisely encoded in a large enough model, and therefore legally not a reproduction just coincidentally identical.

raggi · 2025-08-22T05:45:40 1755841540

No. I don't think "novelty" would be relevant in such a case. How much risk you have depends on many factors, including what you mean by "use". If you mean sell, and you're successful, you're at risk. That would be true even if it's not the same as other content but just similar. Copyright provides little to no protection from legal costs if someone is motivated to bring a case at you.

strogonoff · 2025-08-22T03:40:42 1755834042

In the West you are free to make something that everyone thinks is a “derivative piece of trash” and still call it yours; and sometimes it will turn out to be a hit because, well, it turns out that in real life no one can reliably tell what is and what isn’t trash[0]—if it was possible, art as we know it would not exist. Sometimes what is trash to you is a cult experimental track to me, because people are different.

On that note, I am not sure why creators in so many industries are sitting around while they are being more or less ripped off by massive corporations, when music has got it right.

— Do you want to make a cover song? Go ahead. You can even copyright it! The original composer still gets paid.

— Do you want to make a transformative derivative work (change the composition, really alter the style, edit the lyrics)? Go ahead, just damn better make sure you license it first. …and you can copyright your derivative work, too. …and the original composer still gets credit in your copyright.

The current wave of LLM-induced AI hype really made the tech crowd bend itself in knots trying to paint this as an unsolvable problem that requires IP abuse, or not a problem because it’s all mostly “derivative bits of trash” (at least the bits they don’t like, anyway), argue in courts how it’s transformative, etc., while the most straightforward solution keeps staring them in the face. The only problem is that this solution does not scale, and if there’s anything the industry in which “Do Things That Don’t Scale” is the title of a hit essay hates then that would be doing things that don’t scale.

[0] It should be clarified that if art is considered (as I do) fundamentally a mechanism of self-expression then there is, of course, no trash and the whole point is moot.

0points · 2025-08-22T06:57:05 1755845825

There's an whole genre of musicians focusing only on creating royalty free covers of popular songs so the music can be used in suggestive ways while avoiding royalties.

It's not art. It's parasitism of art.

strogonoff · 2025-08-22T08:34:43 1755851683

> There's an whole genre of musicians focusing only on creating royalty free covers

There is no such thing as a “royalty free cover”. Either it is a full on faithful cover, which you can perform as long as license fees are paid, and in which case both the performer and the original songwriter get royalties, or it is a “transformative cover” which requires negotiation with the publisher/rights owner (and in that case IP ownership will probably be split between songwriter and performer depending on their agreement).

(Not an IP lawyer myself so someone can correct me.)

Furthermore, in countries where I know how it works as a venue owner you pay the rights organization a fixed sum per month or year and you are good to go and play any track you want. It thus makes no difference to you whether you play the original or a cover.

Have you considered that it is simply singers-performers who like to sing and would like to earn a bit of money from it, but don’t have many original songs if their own?

> It's parasitism of art

If we assume covers are parasitism of art, by that logic would your comment, which is very similar to dozens I have seen on this topic in recent months, be parasitism of discourse?

Jokes aside, a significant number of covers I have heard at cafes over years are actually quite decent, and I would certainly not call that parasitic in any way.

Even pretending they were, if you compare between artists specialising in covers and big tech trying to expropriate IP, insert itself as a middleman and arbiter for information access, devalue art for profit, etc., I am not sure they are even close in terms of the scale of parasitism.

0points · 2025-08-22T10:15:18 1755857718

> Have you considered that it is simply singers-performers who like to sing and would like to earn a bit of money from it, but don’t have many original songs if their own?

Or, maybe you start to pay attention?

They are selling their songs cheaper for TV, radio or ads.

> Even pretending they were, if you compare between artists specialising in covers and big tech trying to expropriate IP

They're literally working for spotify.

strogonoff · 2025-08-22T12:01:26 1755864086

> They are selling their songs cheaper for TV, radio or ads.

I guess that somehow refutes the points I made, I just can’t see how.

Radio stations, like the aforementioned venue owners, pay the rights organizations a flat annual fee. TV programs do need to license these songs (as unlike simple cover here the use is substantially transformative), but again: 1) it does not rip off songwriters (holder of songwriter rights for a song gets royalties for performance of its covers, songwriter has a say in any such licensing agreement), and 2) often a cover is a specifically considered and selected choice: it can be miles better fitting for a scene than the original (just remember Motion Picture Soundtrack in that Westworld scene), and unlike the original it does not tend to make the scene all about itself so much. It feels like you are yet to demonstrate how it is particularly parasitic.

Edit: I mean honest covers; modifying a song a little bit and passing it as original should be very sueable by the rights holder and I would be very surprised if Spotify decided to do that even if they fired their entire legal department and replaced it with one LLM chatbot.

zvr · 2025-08-22T14:01:18 1755871278

I know of restaurants and bars that choose to play cover versions of well-known songs because the costs are so much less.

strogonoff · 2025-08-22T14:12:34 1755871954

I really doubt you would ever license any specific songs as a cafe business. You should be able to pay a fixed fee to a PRO and have a blanket license to play almost anything. Is it so expensive in the US, or perhaps they do not know that this is an option? If the former, and those cover artists help those bars keep their expenses low and offer you better experience while charging less—working with the system, without ripping off the original artists who still get paid their royalty—does it seem particularly parasitic?

zvr · 2025-08-23T10:40:14 1755945614

The example I was referring to was not in the US.

A restaurant / cafe may pay a fixed fee and get access to a specific catalog of songs (performances). The fee depends on what the catalog contains. As you can imagine, paying for the right to only play instrumental versions of songs (no singers, no lyrics) is significantly cheaper. Or, having performances of songs by unknown people.

strogonoff · 2025-08-23T12:41:46 1755952906

Two countries where I know how it works from a venue business owner perspective work this way. The fees seemed pretty mild, that’s why I asked if it’s too expensive in your country (which I guess is not US).

withinboredom · 2025-08-22T10:14:52 1755857692

There's several sides of music copyright:

1. The lyrics

2. The composition

3. The recording

These can all be owned by different people or the same person. The "royalty free covers" you mention are people abusing the rights of one of those. They're not avoiding royalties, they just havn't been caught yet.

strogonoff · 2025-08-22T12:21:58 1755865318

I believe performance of a cover still results in relevant royalties paid to the original songwriter, just sans the performance fee, which does not strike me as a terrible ripoff (after all, a cover did take effort to arrange and perform).

withinboredom · 2025-08-22T13:45:51 1755870351

What this person is talking about is they write “tvinkle tvinkle ittle stawr” instead of the real lyrics (basically just writing the words phonetically and/or misspelled) to try and bypass the law through “technicalities” that wouldn’t stand up in court.

strogonoff · 2025-08-22T13:51:26 1755870686

I doubt so for a few reasons based on how they described this alleged parasitic activity, but mainly because the commenter alluded to Spotify doing this. Would be very surprising if they decided to do something so blatantly illegal, when they could keep extracting money by the truckload with their regular shady shenanigans that do not cross that legality line so obviously.

Regarding what you described, I don’t think I encountered this in the wild enough to remember. IANAL but if not cleared/registered properly as a cover it doesn’t seem to be a workaround or abuse, but would probably be found straight up illegal if the rights holder or relevant rights organization cares to sue. In this case, all I can say is “yes, some people do illegal stuff”. The system largely works.

camgunz · 2025-08-22T08:22:00 1755850920

> For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording

We don't need all this (seemingly pretty good) analysis. We already know what everyone thinks: no relevant AI company has had their codebase or other IP scraped by AI bots they don't control, and there's no way they'd allow that to happen, because they don't want an AI bot they don't control to reproduce their IP without constraint. But they'll turn right around and be like, "for the sake of the future, we have to ingest all data... except no one can ingest our data, of course". :rolleyes:

rovr138 · 2025-08-22T22:23:18 1755901398

This is how sqlite handles it,

> Contributed Code

> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.

source, https://www.sqlite.org/copyright.html

jojobas · 2025-08-22T01:44:12 1755827052

There are only so many ways to code quite a few things. My classmate and I once got in trouble in high school for having identical code for one of the tasks at a coding competition, down to variable names and indentation. There is no way he could or would steal my code, and I sure didn't steal his.

Borealid · 2025-08-22T02:30:12 1755829812

An LLM can be used for a clean room design so long as all (ALL) of its training data is in the clean room (and consequently does not contain the copyrighted work being reverse engineered).

An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This doesn't detract from the core of your point, that LLM output may be copyright-contaminated by LLM training data. Yes, but that doesn't necessarily mean that an LLM output cannot be a valid clean-room reverse engineer.

account42 · 2025-08-22T08:17:20 1755850640

> An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This is assuming that you are only concerned with a particular work when you need to be sure that you are not copying any work that might be copyrighted without making sure to have a valid license that you are abiding by.

Borealid · 2025-08-22T09:47:38 1755856058

The "clean room" in "clean room reverse engineering" refers to a particular set of trade secrets, yes. You could have a clean room and still infringe if an employee in the room copied any work they had ever seen.

The clean room has to do with licenses and trade secrets, not copyright.

Aeolun · 2025-08-22T09:05:31 1755853531

Or you know, they just feel like code should be free. Like beer should be free.

We didn't have this whole issue 20 years ago because nobody gave a shit. If your code was public, and on the internet, it was free for everyone to use by definition.

jedbrown · 2025-05-12T04:44:41 1747025081

The colloquial definitions have always been more cultural than technical, but it's become more acute recently.

> I think we should shed the idea that AI is a technological artifact with political features and recognize it as a political artifact through and through. AI is an ideological project to shift authority and autonomy away from individuals, towards centralized structures of power. https://ali-alkhatib.com/blog/defining-ai

jedbrown · 2025-02-14T19:53:00 1739562780

Many journals require LaTeX due to their post-acceptance pipeline. I use Typst for letters and those docs for which my PDF is the final version (modulo incomplete PDF/A in Typst), but for many journals in my field, I'd need a way to "compile to LaTeX" or the journal would need to implement a post-acceptance workflow for Typst (I'm not aware of any that have).

szvsw · 2025-02-14T20:19:36 1739564376

Right, I guess that’s my point: If Typst wants to compete with LaTeX, IMO it needs some sort of mechanism by which journals will deem a Typst submission acceptable, along with readily available templates for said submissions. That’s a big hill to climb probably, but probably the single most valuable development they could achieve from a product diffusion perspective.

mr_mitm · 2025-02-14T20:53:44 1739566424

Typst doesn't even have a stable release yet. Give it some time, I genuinely believe it will get there.

gnatolf · 2025-02-14T21:30:29 1739568629

Interestingly enough, e.g. Elsevier accepts latex but has their own typesetting backend. Which typically means that the last editing steps are quite annoying, because even if one is using the provided latex templates, what actually happens for the final typesetting is done by some mechanical turk editor on a slightly different publishing system.

bombcar · 2025-02-14T20:02:28 1739563348

Exactly - they require LaTeX not only to make it match the style, but because the final document is a prepared LaTeX work. Sometimes you can even see all the hooks and such that are waiting for \include and similar.

dmitriz · 2025-02-15T18:09:20 1739642960

Can you use AI tools to covert Typst to Latex? That would remove the acceptance issue.

jedbrown · on June 17, 2024

Their developers have intent. That intent is to give the perception of understanding/facts/logic without designing representations of such a thing, and with full knowledge that as a result, it will be routinely wrong in ways that would convey malicious intent if a human did it. I would say they are trained to deceive because if being correct was important, the developers would have taken an entirely different approach.

wumbo · on June 17, 2024

generating information without regard to the truth is bullshitting, not necessarily malicious intent.

for example, this is bullshit because it’s words with no real thought behind it: “if being correct was important, the developers would have taken an entirely different approach”

jedbrown · on June 17, 2024

If you are asking a professional high-stakes questions about their expertise in a work context and they are just bullshitting you, it's fair to impugn their motives. Similarly if someone is using their considerable talent to place bullshit artists in positions of liability-free high-stakes decisions.

Your second comment is more flippant than mine, as even AI boosters like Chollet and LeCun have come around to LLMs being tangential to delivering on their dreams, and that's before engaging with formal methods, V&V, and other approaches used in systems that actually value reliability.

jedbrown · on June 17, 2024

> customers at the national labs are not going to be sharing custom HPC code with AMD engineers

There are several co-design projects in which AMD engineers are interacting on a weekly basis with developers of these lab-developed codes as well as those developing successors to the current production codes. I was part of one of those projects for 6 years, and it was very fruitful.

> I suspect a substantial portion of their datacenter revenue still comes from traditional HPC customers, who have no need for the ROCm stack.

HIP/ROCm is the prevailing interface for programming AMD GPUs, analogous to CUDA for NVIDIA GPUs. Some projects access it through higher level libraries (e.g., Kokkos and Raja are popular at labs). OpenMP target offload is less widespread, and there are some research-grade approaches, but the vast majority of DOE software for Frontier and El Capitan relies on the ROCm stack. Yes, we have groaned at some choices, but it has been improving, and I would say the experience on MI-250X machines (Frontier, Crusher, Tioga) is now similar to large A100 machines (Perlmutter, Polaris). Intel (Aurora) remains a rougher experience.

jedbrown · on May 31, 2024

The point is that LLMs are never right for the right reason. Humans who understand the subject matter can make mistakes, but they are mistakes of a different nature. The issue reminds me of this from Terry Tao (LLMs being not-even pre-rigorous, but adept at forging the style of rigorous exposition):

It is perhaps worth noting that mathematicians at all three of the above stages of mathematical development can still make formal mistakes in their mathematical writing. However, the nature of these mistakes tends to be rather different, depending on what stage one is at:

1. Mathematicians at the pre-rigorous stage of development often make formal errors because they are unable to understand how the rigorous mathematical formalism actually works, and are instead applying formal rules or heuristics blindly. It can often be quite difficult for such mathematicians to appreciate and correct these errors even when those errors are explicitly pointed out to them.

2. Mathematicians at the rigorous stage of development can still make formal errors because they have not yet perfected their formal understanding, or are unable to perform enough “sanity checks” against intuition or other rules of thumb to catch, say, a sign error, or a failure to correctly verify a crucial hypothesis in a tool. However, such errors can usually be detected (and often repaired) once they are pointed out to them.

3. Mathematicians at the post-rigorous stage of development are not infallible, and are still capable of making formal errors in their writing. But this is often because they no longer need the formalism in order to perform high-level mathematical reasoning, and are actually proceeding largely through intuition, which is then translated (possibly incorrectly) into formal mathematical language.

The distinction between the three types of errors can lead to the phenomenon (which can often be quite puzzling to readers at earlier stages of mathematical development) of a mathematical argument by a post-rigorous mathematician which locally contains a number of typos and other formal errors, but is globally quite sound, with the local errors propagating for a while before being cancelled out by other local errors. (In contrast, when unchecked by a solid intuition, once an error is introduced in an argument by a pre-rigorous or rigorous mathematician, it is possible for the error to propagate out of control until one is left with complete nonsense at the end of the argument.)

https://terrytao.wordpress.com/career-advice/theres-more-to-...

jedbrown · on March 7, 2024

The cross-over can be around 500 (https://doi.org/10.1109/SC.2016.58) for 2-level Strassen. It's not used by regular BLAS because it is less numerically stable (a concern that becomes more severe for the fancier fast MM algorithms). Whether or not the matrix can be compressed (as sparse, fast transforms, or data-sparse such as the various hierarchical low-rank representations) is more a statement about the problem domain, though it's true a sizable portion of applications that produce large matrices are producing matrices that are amenable to data-sparse representations.

jedbrown · on March 1, 2024

> It's trivially easy to find a real-world situation where conservation of energy does not hold (any system with friction, which is basically all of them)

Conservation of energy absolutely still holds, but entropy is not conserved so the process is irreversible. If your model doesn't include heat, then discrete energy won't be conserved in a process that produces heat, but that's your modeling choice, not a statement about physics. It is common to model such processes using a dissipation potential.

nostrademons · on March 1, 2024

Right, but I'm saying that it's all modeling choices, all the way down. Extend the model to include thermal energy and most of the time it holds again - but then it falls down if you also have static electricity that generates a visible spark (say, a wool sweater on a slide) or magnetic drag (say, regenerative braking on a car). Then you can include models for those too, but you're introducing new concepts with each, and the math gets much hairier. We call the unified model where we abstract away all the different forms of energy "conservation of energy", but there are a good many practical systems where making tangible predictions using conservation of energy gives wrong answers.

Basically this is a restatement of Box's Aphorism ("All models are wrong, but some are useful") or the ideas in Thomas Kuhn's "The Structure of Scientific Revolutions". The goal of science is to from concrete observations to abstract principles which ideally will accurately predict the value of future concrete observations. In many cases, you can do this. But not all. There is always messy data that doesn't fit into neat, simple, general laws. Usually the messy data is just ignored, because it can't be predicted and is assumed to average out or generally be irrelevant in the end. But sometimes the messy outliers bite you, or someone comes up with a new way to handle them elegantly, and then you get a paradigm shift.

And this has implications for understanding what machine learning is or why it's important. Few people would think that a model linking background color to likeliness to click on ads is a fundamental physical quality, but Google had one 15+ years ago, and it was pretty accurate, and made them a bunch of money. Or similarly, most people wouldn't think of a model of the English language as being a fundamental physical quality, but that's exactly what an LLM is, and they're pretty useful too.

jcgrillo · on March 1, 2024

It's been a long time since I have cracked a physics book, but your mention of interesting "fundamental physical quantities" triggered the recollection of there being a conservation of information result in quantum mechanics where you can come up with an action whose equations of motion are Schrödinger's equation and the conserved quantity is a probability current. So I wonder to what extent (if any) it might make sense to try to approach these things in terms of the really fundamental quantity of information itself?

jerf · on March 1, 2024

Approaching physics from a pure information flow is definitely a current research topic. I suspect we see less popsci treatment of it because almost nobody understands information at all, then trying to apply it to physics that also almost nobody understands is probably at least three or four bridges too far for a popsci treatment, but it's a current and active topic.

ogogmad · on March 3, 2024

This might be insultingly simplistic, but I always thought the phrase "conservation of information" just meant that the time-evolution operator in quantum mechanics was unitary. Unitary mappings are always bijective functions - so it makes intuitive sense to say that all information is preserved. However, it does not follow that this information is useful to actually quantify, like energy or momentum. There is certainly a kind of applied mathematics called "information theory", but I doubt there's any relevance to the term "conservation of information" as it's used in fundamental physics.

The links below lend credibility to my interpretation.

https://en.wikipedia.org/wiki/Time_evolution#In_quantum_mech...

https://en.wikipedia.org/wiki/Bijection

https://en.wikipedia.org/wiki/Black_hole_information_paradox

jedbrown · on Feb 29, 2024

What is the state of tiling in Plasma 6 and how does it compare to Pop Shell (https://github.com/pop-os/shell) for GNOME?

jedbrown · on Nov 1, 2023

They tripled the price of my grandmother's service over a five year period despite no speed increases so I figure they were going to charge you more regardless.