More

robertnishihara · on July 31, 2024

I'm glad you find it exciting!

Our intention from the start was for Ray to be general purpose. And the core Ray APIs are quite general (basically just scheduling a Python function somewhere in a cluster or instantiating a Python class as a process somewhere in the cluster).

We had AI use cases in mind from the start, since we were grad students in AI. But the generality has really been important since AI workloads encompass a huge variety of computational patterns (allreduce style communication patterns on GPUs for training, embarrassingly parallel data processing workloads on spot instances, and so on).

dekhn · on July 31, 2024

Oh, I know all that, I used to work at Google and give lots of money to the various groups associated with Ion Stoica's groups at Berkeley to help stimulate more open source alternatives to Borg/MapReduce/Flume/TensorFlow. Keep up the good work.

refset · on Aug 1, 2024

Is there anybody trying to build a SQL database on Ray yet? Asking for a friend.

robertnishihara · on July 31, 2024

I'm one of the creators of Ray. A few thoughts :)

1. This is truly impressive work from AWS. Patrick Ames began speaking about this a couple years ago, though at this point the blog post is probably the best reference. https://www.youtube.com/watch?v=h7svj_oAY14

2. This is not a "typical" Ray use case. I'm not aware of any other exabyte scale data processing workloads. Our bread and butter is ML workloads: training, inference, and unstructured data processing.

3. We have a data processing library called Ray Data for ingesting and processing data, often done in conjunction with training and inference. However, I believe in this particular use case, the heavy lifting is largely done with Ray's core APIs (tasks & actors), which are lower level and more flexible, which makes sense for highly custom use cases. Most Ray users use the Ray libraries (train, data, serve), but power users often use the Ray core APIs.

4. Since people often ask about data processing with Ray and Spark, Spark use cases tend to be more geared toward structured data and CPU processing. If you are joining a bunch of tables together or running SQL queries, Spark is going to be way better. If you're working with unstructured data (images, text, video, audio, etc), need mixed CPU & GPU compute, are doing deep learning and running inference, etc, then Ray is going to be much better.

justsocrateasin · on July 31, 2024

I'm just learning about this tool now and had a brief question if you have the time:

The paper mentions support for zero-copy intranode object sharing which links to serialization in the Ray docs - https://docs.ray.io/en/latest/ray-core/objects/serialization...

I'm really curious how this is performant - I recently tried building a pipeline that leveraged substantial multiprocessing in Python, and found that my process was bottlenecked by the serialization/deserialization that occurs during Python multiprocessing. Would love any reading or explanation you can provide as to how this doesn't also bottleneck a process in Ray, since it seems that data transferred between workers and nodes will need to serialized and deserialized.

Thanks in advance! Really cool tool, hopefully I'll be able to use it sooner rather than later.

robertnishihara · on July 31, 2024

Your right that the serialization / deserialization overhead can quickly exceed the compute time. To avoid this you have to get a lot of small things right. And given our focus on ML workloads, this is particularly important when sharing large numerical arrays between processes (especially processes running on the same node).

One of the key things is to make sure the serialized object is stored in a data format where the serialized object does not need to be "transformed" in order to access it. For example, a numpy array can be created in O(1) time from a serialized blob by initializing a Python object with the right shape and dtype and a pointer to the right offset in the serialized blob. We also use projects like Apache Arrow that put a lot of care into this.

Example in more detail:

Imagine the object you are passing from process A to process B is a 1GB numpy array of floats. In the serialization step, process A produces a serialized blob of bytes that is basically just the 1GB numpy array plus a little bit of metadata. Process A writes that serialized blob into shared memory. This step of "writing into shared memory" still involves O(N) work, where N is the size of the array (though you can have multiple threads do the memcpy in parallel and be limited just by memory bandwidth).

In the deserialization step, process B accesses the same shared memory blob (process A and B are on the same machine). It reads a tiny bit of metadata to figure out the type of the serialized object and shape and so on. Then it constructs a numpy array with the correct shape and type and with a pointer to the actual data in shared memory at the right offset. Therefore it doesn't need to touch all of the bytes of data, it just does O(1) work instead of O(N).

That's the basic idea. You can imagine generalizing this beyond numpy arrays, but it's most effective for objects that include large numerical data (e.g., objects that include numpy arrays).

There are a bunch of little details to get right, e.g., serializing directly into shared memory instead of creating a serialized copy in process A and then copying it into shared memory. Doing the write into shared memory in parallel with a bunch of threads. Getting the deserialization right. You also have to make sure that the starting addresses of the numpy arrays are 64-byte aligned (if memory serves) so that you don't accidentally trigger a copy later on.

EDIT: I edited the above to add more detail.

Xophmeister · on July 31, 2024

This is probably a naive question, but how do two processes share address space? mmap?

robertnishihara · on July 31, 2024

Yeah, mmap, I think this is the relevant line [1].

Fun fact, very early on, we used to create one mmapped file per serialized object, but that very quickly broke down.

Then we switched to mmapping one large file at the start and storing all of the serialized objects in that file. But then as objects get allocated and deallocated, you need to manage the memory inside of that mmapped file, and we just repurposed a malloc implementation to handle that.

[1] https://github.com/ray-project/ray/blob/21202f6ddc3ceaf74fbc...

zacmps · on July 31, 2024

Super cool to see you here.

I've also looked at ray for running data pipelines before (at much much smaller scales) for the reasons you suggest (unstructured data, mixed CPU/GPU compute).

One thing I've wanted is an incremental computation framework (i.e., salsa [1]) built on ray so that I can write jobs that transparently reuse intermediate results from an object store if their dependents haven't changed.

Do you know if anyone has thought of building something like this?

[1] https://github.com/salsa-rs/salsa

jonmoore · on Aug 1, 2024

I asked the same question to one of the core devs at a recent event and he (1) said that some people in finance have done related things and (2) suggested using the Ray slack to connect with developers and power users who might have helpful advice.

I agree this is a very interesting area to consider Ray for. There are lots of projects/products that provide core components that could be used but there’s no widely used library. It feels like one is overdue.

robertnishihara · on Aug 1, 2024

Other folks have built data processing libraries on top of Ray: Modin and Daft come to mind.

But I'm not aware of anything exactly like what you're referring to!

theLiminator · on July 31, 2024

Curious if you know how well Ray works with multithreaded python libraries? For example, when using jax with ray, I have to ensure the import ordering imports ray first, as forking a threaded process leads to deadlocks in Python. Do you know how to ensure that ray handles forking the python interpreter correctly?

robertnishihara · on Aug 1, 2024

Multi-threaded libraries (e.g., numpy and PyTorch on CPUs come to mind) are well supported. In scenarios where many processes are each running heavily multi-threaded computations, it can help to pin specific processes to specific cores (e.g., using tools like psutil) to avoid contention.

The scenario where a Ray task forks is probably not very well supported. You can certainly start a subprocess from within a Ray task, but I think forking could easily cause issues.

You can definitely use Ray + Jax, but you probably need to avoid forking a process within a Ray worker.

nubinetwork · on Aug 1, 2024

> this is not a typical ray use case

Must be good enough if you're willing to dogfood it though?

robertnishihara · on Aug 3, 2024

To clarify, what I mean is that working with "exabytes" is atypical. Most use cases are at a slightly smaller scale :)

Data processing workloads are quite common on Ray, especially with unstructured data.

Also, I work on Ray, which is the underlying framework used here, but all the work in the post was done by the Amazon team.

robertnishihara · on Dec 8, 2023

We're hosting the model on Anyscale Endpoints. Try it out here [1]

[1] https://docs.endpoints.anyscale.com/supported-models/Meta-Ll...

robertnishihara · on Sept 29, 2023

I'm a huge fan of Jax. The Jax team is incredibly strong!

Just want to share that Ray (an open source project we're developing at Anyscale), can be used to scale Jax (e.g., across TPUs).

Some docs from Google on how to do this

https://cloud.google.com/tpu/docs/ray-guide

Alpa is an open source project scaling Jax on 1000+ GPUs

https://www.anyscale.com/blog/training-175b-parameter-langua...

Cohere uses Ray + Jax + TPUs to build their LLMs

https://www.youtube.com/watch?v=For8yLkZP5w

A demo from Matt Johnson on the Jax team

https://www.youtube.com/watch?v=hyQ-tgD5sgc

robertnishihara · on Sept 15, 2023

Here is the blog post accompanying the notebook

https://www.anyscale.com/blog/a-comprehensive-guide-for-buil...

robertnishihara · on Sept 14, 2023

It shouldn't be 100x. We've built an LLM API at Anyscale, and the price comparison works out as follows (per million tokens)

- Llama-2-70B: $1 (on Anyscale Endpoints [1]) - GPT-3.5-turbo: $1.50 - $2 (OpenAI [2])

[1] https://app.endpoints.anyscale.com/ [2] https://openai.com/pricing

robertnishihara · on Aug 26, 2023

It's amazing to see how rapidly things are moving.

You can try out CodeLlama-34B on Anyscale Endpoints (an LLM inference API we're building here at Anyscale for open source LLMs).

https://app.endpoints.anyscale.com/

gsuuon · on Aug 26, 2023

This looks like it might be neat but it has a pretty sparse intro page and then the email signup goes straight into stripe checkout - is there any more info about the service anywhere? Like which models are available, or more pricing info?

robertnishihara · on Aug 26, 2023

Thanks for the feedback, we'll improve the landing page!

The models (and current prices) right now are - Llama-2-7B ($0.25 / million tokens) - Llama-2-13B ($0.50 / million tokens) - Llama-2-70B ($1 / million tokens) - Code Llama ($1 / million tokens)

gsuuon · on Aug 26, 2023

Awesome, thanks! I've been wanting exactly this service.

bytefactory · on Aug 26, 2023

I had the same question. Being thrown into a payment page right after providing your email is very jarring. I don't want to pay for a product without even giving it a spin.

037 · on Aug 26, 2023

From the FAQ:

> What rights do you claim to my queries?

> During the alpha release, we may use your inputs and outputs to improve the service. In a future release, we may provide the ability to opt out of certain data uses.

I understand that you are in alpha, but paying without any privacy guarantees is a little hard to accept right now.

Otherwise your service seems really nice, I'm sure a lot of people have been waiting for something like this. Any ETA on the opt-out and further clarification on the matter?

Thank you.

robertnishihara · on Aug 26, 2023

If you want to try out Code Llama, you can query it on Anyscale Endpoints (this is an LLM inference API we're working on here at Anyscale).

https://app.endpoints.anyscale.com/

robertnishihara · on Aug 23, 2023

We've run experiments on datasets ranging from 5K - 100K examples, which gave fantastic results [1].

Some examples - https://huggingface.co/datasets/b-mc2/sql-create-context - https://huggingface.co/datasets/GEM/viggo

On the other hand, 8K examples was not enough to learn to solve grade school math problems [2], so it is very problem dependent.

[1] https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

[2] https://huggingface.co/datasets/gsm8k

robertnishihara · on Aug 23, 2023

I think for fine-tuned GPT-3.5 to be competitive with GPT-4 on your use cases (assistance with Angular), you'd have to fine-tune on enough data that it really resembles pre-training more than fine-tuning. And it wouldn't be worth the hassle unless you're building a product around it.

That said, many valuable LLM products / features are more narrow in scope and can see a huge lift from fine-tuning. We've run a bunch of experiments on this (e.g., SQL query generation is a good example), where fine-tuning even the 7B Llama-2 model outperforms GPT-4 (surprisingly) [1]. That's a very different type of problem from teaching software engineering of course.

[1] https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...