Nailgun: A Rust DNS performance testing client

zokier · on Aug 23, 2021

    in_flight: FxHashMap<u16, QueryInfo>

this makes me think of one possible micro-optimization: u16 is small and memory is cheap, so you could just have

    in_flight: [Option<QueryInfo>; 65536]

you might get away with not having the Option wrapper, depending on how your code works

anyways, fun project

Arnavion · on Aug 23, 2021

It depends. `QueryInfo` is a struct of `std::time::Instant` and `usize`, which makes it 24 bytes on x86_64 Linux for example. So the `Store` type that contains it will become 64 Ki * 24B = 1.5 MiB larger, which will break if the `Store` is held on the stack in systems with small-enough thread stacks. In particular, the default stack size on Windows for the main thread is 1 MiB.

tialaramex · on Aug 23, 2021

If you decide that you can't afford for it to live on the stack, you could use a Vec so that it lives on the heap but it's still a single pointer dereference to get the QueryInfo (or not) which is a similar value.

The nice thing about the stack is that it's warmer (more likely to be in cache), but this is less valuable for these larger sizes where the structure is too large to all fit in fast cache anyway.

Actually depending on how sparse the map is, that's a reason not to do this at all. I don't know what a FxHashMap is, but if you've got a u16 key and only a small number of entries in your map, some maps won't need more than a few cachelines to store that data, far smaller than the 1.5MB contemplated above.

Anyway with anything like this, measure three times, mark twice, cut once. If you don't have a performance problem, needn't do anything, if you do have a performance problem you need to measure it before you try to fix it.

Arnavion · on Aug 23, 2021

A `Vec<Option<QueryInfo>>` would be unable to elid the bounds checks because the optimizer will never remember that the Vec was created with 65536 `None`s. `Box<[Option<QueryInfo>; 65536]>` would be better, though slightly annoying to create without accidentally creating an intermediate `[Option<QueryInfo>; 65536]` on the stack. (`vec![None; 65536].into_boxed_slice().try_into().expect("size is hard-coded to match")` is the easiest way in stable, AFAICT.)

tialaramex · on Aug 23, 2021

> the optimizer will never remember that the Vec was created with 65536 `None`s.

I mean, probably today, but:

1. If you're worried about bounds checks you are likely not thinking about this problem in the right way. The bounds check is a couple of CPU instructions, a register compare and a conditional jump. The jump is never taken, the CPU will remember that if we do this often, and everything is in registers or instruction cache. Then, having discerned that we're in bounds, we do a memory dereference.

Actually no modern CPU waits until then, the CPU has concluded that we probably aren't going to branch, so it will emit the memory read, and while that takes its sweet time the CPU can do the comparison, and as anticipated decide not to take the branch it never takes.

So we're actually waiting on the memory read, which is why we care about not using a hash map, because the hash map might incur two memory reads if we get unlucky. But on the other hand, if the hash map was small enough to fit in cache at least it arrives before the CPU goes into a coma because of how slow memory reads are to main memory.

2. Optimizers get smarter. It's clearly not impossible to do the analysis and decide this check isn't necessary. So I think that means you shouldn't rule it out.

Arnavion · on Aug 23, 2021

Yes, you could trust in the CPU's branch prediction and the optimizer getting smarter in the future, or you could do the thing that's not only more self-documenting (Vec == need to read the code to notice it never grows, array == obviously fixed size) but also guaranteed to work already.

leshow · on Aug 23, 2021

> I don't know what a FxHashMap is

It is the stdlib hashmap with the default hashing function changed from siphasher to fxhash.

I fully expect there to be some low hanging fruit, for instance the hashmap should probably be created with some default pre-allocated capacity but instead it's just initialized with `new`. It's as you said though, it seems plenty fast (at least to me) at the moment so while the rabbit holes are interesting I'd probably be better served by cleaning up certain bits of code.

tialaramex · on Aug 23, 2021

Yeah, that's fair. If it's the standard library hashmap, then yes, that's an Abseil Swiss Table (well, not actually, but that's the thing it's reimplementing) and so yes it will have great performance if correctly sized and if it's relatively sparse I'd bet it's faster than the heap allocated Vec idea because so much of it will fit in your L1 cache and the Vec would not (unless you've got a complete beast of a CPU dedicated to this task).

I would add the size hint if your program has any reason to know what hint to give (blind guesses are worse than just calling new) and otherwise forget optimizing it until it's too slow.

bluejekyll · on Aug 23, 2021

It always makes me excited to see what people are doing with the trust-dns libraries. Thanks for your contributions to the project. It's very cool to see some performance testing tools like this coming to Rust.

leshow · on Aug 23, 2021

Thank you for making such a great collection of libs!