You don't need an A100, you can get a used 32GB V100 for $2K-$3K. It's probably the absolute best bang-for-buck inference GPU at the moment. Not for speed but just the fact that there are models you can actually fit on it that you can't fit on a gaming card, and as long as you can fit the model, it is still lightyears better than CPU inference.
Much slower memory and limited parallelism. Gpu ÷- 8k pr more cuda cores vs +-16 on regular cpu. Less mem swapping between operations. Gpu much much faster.
There is a PR to that repo for a diffusers implementation, which may run on a cheap L4 GPU w/ enable_model_cpu_offload(): https://huggingface.co/black-forest-labs/FLUX.1-schnell/comm...