Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.
Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.
The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.
If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`
On a Nvidia 4090, it's producing:
prompt eval time = 17.93 ms / 24 tokens ( 0.75 ms per token, 1338.39 tokens per second)
eval time = 2382.95 ms / 421 tokens ( 5.66 ms per token, 176.67 tokens per second)
total time = 2400.89 ms / 445 tokens
*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.
is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.
This amuses me tremendously. I began programming in the early 1980s and quickly developed an interest in Artificial Intelligence. At the time there was a great interest in the advancement of AI by the introduction of "Expert Systems" (which would later play a part in the ‘Second AI Winter’).
What Amazon appears to have done here is use a transformers based neural network (aka LLM) to translate natural language into symbolic logic rules which are collectively used together in what could be identified as an Expert System.
The problem with expert systems (and most KG-type applications) has always been that translating unconstrained natural language into the system requires human-level intelligence.
It's been completely obvious that LLMs are a technology that let us bridge that gap for years, and many of the best applications of LLMs are doing exactly that (eg code generation)
To be clear, my amusement isn't that I find this technique to not be useful for the purpose it was created, but that 40 years later, we find ourselves in pursuit for the advancement of AI to be somewhat back where we already were; albeit, in a more semi-automated fashion as someone still has to create the underlying rule-set.
I do feel that the introduction of generative neural network models in both natural language and multi-media creation has been a tremendous boon for the advancement of AI, it just amuses me to see that which was old is new again.
Right. The trouble with that approach is that it's great on the easy cases and degrades rapidly with scale.
This sounds like is a fix for a very specific problem. An airline chatbot told a customer that some ticket was exchangeable. The airline claimed it wasn't. The case went to court. The court ruled that the chatbot was acting as an agent of the airline, and so ordinary rules of principal-agent law applied. The airline was stuck with the consequence of their chatbot's decision.[1]
Now, if you could reduce the Internal Revenue Code to rules in this way, you'd have something.
Yes, as I said in another comment: "By constraining the field it is trying to solve it makes grounding the natural language question in a knowledge graph tractable."
There are a number of ways this might get solved, but I would speculate that it will generally be solved by adding image metadata that is signed by a certificate authority similar to the way SSL certificates are assigned to domains.
I think eventually all digital cameras and image scanners will securely hash and sign images just as forensic cameras do to certify that an image was "captured" instead of generated.
Of course this leaves a grey area for image editing applications such as Photoshop, so there may also need to be some other level of certificate base signing introduced there as well.
For those who might not be aware of this, there is also an open source project on GitHub called "Twinny" which is an offline Visual Studio Code plugin equivalent to Copilot: https://github.com/rjmacarthy/twinny
It can be used with a number of local model services. Currently for my setup on a NVIDIA 4090, I'm running both the base and instruct model for deepseek-coder 6.7b using 5_K_M Quantization GGUF files (for performance) through llama.cpp "server" where the base model is for completions and the instruct model for chat interactions.
Currently, if you disable chat history, you'll see this message:
Chat History is off for this browser.
When history is turned off, new chats on this browser won't appear in your history on any of your devices, be used to train our models, or stored for longer than 30 days. This setting does not sync across browsers or devices.
No it's not. If they explicitly say they won't train on your data and then they do, it's going to come out in discovery of one of the lawsuits they're fighting, and the consequences would be significant. Plus there's little incentive for them to lie about it, given most people leave history on.
yeah because no large tech company has ever lied to their customers about how their data is being handled. oh wait there are lawsuits surrounding this sort of thing all the time.
I wouldn't trust them with nuclear secrets, but to say it's "insane" to trust that they're going to do what they explicitly say they're going to do just isn't logical.
They hide this link a bit. They completed my opt-out request in about ten minutes and at least claim to be not using any of my data going forward for training.
BTW, for anyone who might not be aware of it, this model trained by Intel based on the Mistral architecture is probably the single best general 7B model available currently:
The Intel one had supervised fine-tuning with the SlimOrca dataset, and then DPO alignment on top of that using a preference dataset.
The technique for generating the preference data is what’s so interesting about that one. Instead of having human labelers choose a preferred response, they generated a response from a small model and a large model, and then always selected the large one’s as the preferred response.