1. What does "source code" even mean, in this context? I don't buy the requirement that all training data needs to also be open sourced for any of the things that people normally discuss with regard to open source topics (though that kind of openness would be good for different reasons).
Network weights and biases plus architecture models are things that can be directly build upon, just like graphics are, and I would count a photo licensed under, say, MIT, as "open source" even if the JPEG codec on the camera which took the photo was not.
b. https://opensource.org/license/nasa1-3-php - "Notwithstanding any provisions contained herein, Recipient is hereby put on notice that export of any goods or technical data from the United States may require some form of export license from the U.S. Government. Failure to obtain necessary export licenses may result in criminal liability under U.S. laws. Government Agency neither represents that a license shall not be required nor that, if required, it shall be issued. Nothing granted herein provides any such export license."
Simply put, it should be reproducible with the published material. An ML model is definitely not reproducible without the training data (or a recipe to reproduce the training data).
This was said with source code and binaries in mind. Source code is easy to verify, modify and rebuild. Binaries are in practice non-modifiable, non-verifyable except for costly reverse engineering.
Large models don’t work like that.
The only practical way to modify (finetune, lora, merge) a model is using its binary form. A source dataset may be interesting, but it’s non-modifiable and non-reproducible in practice due to training costs. Rebuild process is usually non-deterministic, so “verify” is basically not an option.
So technically true, practically complicated. Open weights and open dataset would be better terms.
The term open source fell prey to hype, ignorance and VC money in the case of AI models. The most incorrect uses are for literally binary blobs without a recipe to reproduce (no training data in the majority of cases) and for products that are basically a thin layer over OpenAI calls which is not privacy-preserving, local or free.
The point is moot. Like anything else created by an automated computer process, model weights are not protected under US copyright law. The exclusion of machine-generated works from copyright protection has been pretty well established by the US Copyright Office. In fact, going off letters and rulings recently published by the US Copyright Office[1], even the outputs of generative models are excluded from copyright protection, regardless of the amount of human skill (e.g., prompt composition, parameter selection) involved in their production.
IANAL, but at most, publishing weights with a license may amount to little more than a ToS agreement, allowing distributors a bit more leeway in managing their legal/commercial relationship with recipients of said models. In other words, breaking the terms laid out in a text file entitled "LICENSE.txt" and distributed alongside a set of model weights may constitute a breach of contract, but it is in no way a copyright violation.
> What constitutes Open Source is not vague, btw. it's well defined by the OSI [1].
I'm pretty surprised to see objections to my claims here, I kinda thought I was just stating the obvious.
> Who is this organisation that they get to mandate the definition of the english language?
> What authority do they have to define the term “open source”?
This term has meaning and while its meaning started with an group of people forming an organization saying "this is what this means" its meaning doesn't derive from OSI (nor some farcical aquatic ceremony). Its meaning comes from its popular use in language. For example, Wikipedia does describe this term and how it came to be used [1].
IMO it would be unclear and confusing to use the same term Open Source to describe both what it has historically described and how model weights like these are distributed. The term "Open Source" itself was coined to disambiguate merely "open" source from "free-as-in-freedom" source.
Probably bad editorializing on the submitter's part – I don't see open source being mentioned anywhere on Qualcomm's Hugging Face (although at least some models are distributed under an open source license it seems – not unlike blobs in Linux kernel).
Sorry, what's not open source about them? I've only checked a few models but they look to be under a BSD-3-clause license. The first few I looked at all have the same BSD-3 license [0] [1] [2].
Are you saying they've just repacked other existing models under their own banner but haven't opened sourced some other component?
From the link the the person you're reply to posted:
> The program must include source code, and must allow distribution in source code as well as compiled form.
The compute graph with trained weight is very much a compiled form of the model. The source code would include everything needed to train that model and reproduce it.
> The source code would include everything needed to train that model and reproduce it.
You know these models are trained on internet scrape which contains copyrighted content, so the dataset can't be open sourced. It's either this or bad models.
In theory, you must have written some code to train the models + download the data ... just openning this code + adding logging to store the sources trained on, you could achieve trully "open source" (anybody can now go and scrape + train the same way you did and achieve the same outcome/model)
I'm not saying "opening models is bad", it's good. However imo it would be nice to have a semantic way to differentiate between those two
> The program must include source code, and must allow distribution in source code as well as compiled form.
Can you reproduce these models? if not then it's probably not open source. With a model the closest analog seems to be the training data. Is that all published?
ML training runs are not reproducible, GPUs are non-deterministic when doing large sums, the order operands are added changes the result, thread execution times also depend on caching, which is hard to predict. If you want to force deterministic mode be prepared for a huge slowdown.
A lot of work at larger orgs is put into reproducible training runs. It's about the only way to debug 'did the small parameter tweak tank performance because of a hardware failure or is there something special about that parameter?'
Each time I see a comment like this I wonder who declared the open source initiative the single authority on this. I agree with the comment but this really rubs me the wrong way.
Oh, thanks. OSI has been around much longer than I expected.
This post[0] dives into who coined the term (spoiler: it predates OSI by a long, long time), but it’s reasonable that OSI popularized it alongside their specific definition.
Open Source is more like a designation. It is an agreed upon set of requirements that, if you change a requirement, it is something else. This is important.
Some things have legally protected designations such as 'ice cream'. Ice Cream has specific meaning in industry and even a grading system. If someone wants to make a cheaper product than the lowest grade of ice cream, they can't call it ice cream, they have to call it something like: frozen dairy dessert.
This makes it easy for people understand what they are actually getting and paying for.
I wouldn't get indignant about mandating english language definitions. I would be indignant that ai companies are not fulfilling the requirements to call it open source and are providing a cheaper product than the abilities that an actual open source model would provide.
It also has an obvious english meaning. The source of the model isn’t fully open, it is not possible to inspect and modify the input used to build the models.
I'm reminded of med school when one day on rounds our internal medicine attending remarked that there are scores of treatments for hiccups, none of which have been shown to be superior to the others, which is why there are scores of treatments.
I found something that works for me. Hold my breath for a short while, then start breathing slowly and continue breathing slowly. Then I might start breathing normally and if it comes back repeat the process. Usually I can make them stop in a few minutes (or sometimes 5+ mins). On certain occasions I can't control them, e.g. eating very spicy food without realizing I was doing so or exceeding my max limit for spicy-ness (which used to be very high but I don't partake as much).
There's probably an economic name for this, but it makes sense for non-leading companies or companies in an adjacent field to make public some secret sauce of a leading company's proprietary assets. e.g companies other than Google/Apple/MS to be funding or open-sourcing web browsers. AI models could be one of those things where we don't want one company to have most of the marbles.
There is a pretty interesting system to run the models on the actual mobile devices too. Seems they are using a cloud of mobile devices to make sure the models run on device.
Followed the links to see what Whisper looks like here, and I'm kinda disappointed. They call their model [0] "Whisper-base" but the model checkpoint they're using is 'tiny-en'. There's a pretty significant performance difference between whisper-tiny and whisper-base.
Calling freely downloadable weights "Open Source" quite diminishes the term. It's laudable, so I'd hate to discourage it. But it's not Open Source.
What constitutes Open Source is not vague, btw. it's well defined by the OSI [1].
Let's call it open weights or freely downloadable weights or something.
EDIT: I was mistaken - they are down sampled/right-sized w half precision:
BTW props to Qualcomm but these are "just" quantized versions of existing models? Useful, yes, but maybe not that novel.
[1] https://opensource.org/osd