M1 Icestorm cores can still perform well

saagarjha · on Sept 1, 2021

A couple of people seem surprised that the "idiomatic" Swift code seems to perform much worse. Unfortunately the blog post doesn't provide much details on the code it is actually running, so I tried my hand at guessing:

  let vC = [Float](repeating: 1, count: 4)
  
  func unidiomatic() -> Float {
      var vA = [Float](repeating: 0, count: 4)
      let vB = [1, 2, 3, 4] as [Float]
      var tempA: Float = 0.0
      for _ in 0..<1000 {
          for i in 0...3 {
              tempA += vA[i] * vB[i]
          }
          for i in 0...3 {
              vA[i] = vA[i] + vC[i]
          }
      }
      return tempA
  }
  
  func idiomatic() -> Float {
      var vA = [Float](repeating: 0, count: 4)
      let vB = [1, 2, 3, 4]  as [Float]
      var tempA: Float = 0.0
      for _ in 0..<1000 {
          tempA = zip(vA, vB).map(*).reduce(0, +)
          for (index, value) in vA.enumerated() {
              vA[index] = value + vC[index]
          }
      }
      return tempA
  }

When compiled with optimizations (it's unclear if the author did this?), the "idiomatic" code here is indeed much worse. It's easy to fix the worst of that by working on a lazy sequence, i.e.

  tempA = zip(vA, vB).lazy.map(*).reduce(0, +)

which will save on an extra array allocation and instead run the sum "in place" (in this case, vB is recognized as a constant and in both cases and distributed accordingly). Unfortunately there is still a small amount of overhead because zip doesn't know that the arrays are of the same length, so the compiler inserts some extra bounds checks; also, the second loop gets transformed into what is essentially

  temp = __swift_intrisic_move(vA)
  for (index, value) in vA.enumerated() {
      temp[index] = value + vC[index]
  }
  vA = __swift_intrisic_move(temp)

but this should ideally bring the performance of the two much closer.

api · on Sept 1, 2021

That just means the compiler needs more work around generating optimal code. Rust has some similar idioms and does much better, with the difference between "C-like" Rust and idiomatic Rust usually being small to nonexistent. With Rust you can still sometimes hit edge cases though where the compiler spits out slow code for some idiom that should be zero cost.

saagarjha · on Sept 1, 2021

Yes, Swift lags behind on this. Part of it is a language design thing (lots of extra allocations by default, certain invariants cannot be reasonably be expressed to the compiler, lots of optimization information is thrown away before it hits the middle end), part of it (IMO) is a priority thing. Rust has an extremely strong desire to compare favorably against C/C++, and missed optimizations make it look bad. Swift is working in this area but it seems to be late to the game.

Also, just to be entirely clear, an example like this (once I “fixed” it) is probably the best Swift can reasonably expect to get, since it hits all the hardcoded semantic hints that Array has been annotated with. It actually does get quite close to the unidiomatic version; save for some extra bounds checks from the zip and a dubious double move, it’s basically done as well as it could with constant propagation and inlining. More complex examples are unlikely to fare as well.

zozbot234 · on Sept 1, 2021

Bound checks can be an issue in Rust as well, actually. It's often worthwhile to hoist them manually out of loops, which usually enables the compiler to optimize them out within the loop and even vectorize the resulting code.

(This can't happen automatically because the default semantics generally involve a panic or exception to be triggered when the check fails; there's no way to optimize this further.)

api · on Sept 1, 2021

A trick I've found to eliminate them in a few cases: pass &[T; size] instead of &[T] when dealing with statically sized arrays. The compiler will elide bounds checking a lot more aggressively with statically sized arrays since the bounds are known at compile time.

saagarjha · on Sept 1, 2021

In this case none of the accesses should fail because they can all be statically observed to be in-bounds; unfortunately the Swift compiler was not able to see that.

nerdponx · on Sept 1, 2021

I was under the impression that this category of optimizations ("deforestation") is difficult, and requires some immutability/side-effect-free guarantees to be performed effectively. Is that kind of optimization even feasible in Swift?

mhh__ · on Sept 1, 2021

Emphasis on similar, though. This sometimes comes up when people are discussing D ranges, and it's important to keep in mind that in some case the autovectorization cannot happen because a tradeoff was made that we value locality over parallelizism i.e. just reduce rather than map-reduce.

Transforming from reduce to map-reduce is usually very trivial, but has to be kept in mind

mhh__ · on Sept 1, 2021

When in doubt, look at the disassembly, thank god for godbolt.

saagarjha · on Sept 1, 2021

Unfortunately Swift is kind of verbose to read and doesn't really look good in Godbolt (this is why I didn't link it–also, it only does Intel). Instead I compiled on my own machine and opened the binary in Ghidra.

mshockwave · on Sept 1, 2021

I'll suggest to look at higher-level IRs first. Try to dump SIL (Swift Intermediate Language) first:

```

swiftc -emit-sil ...

```

Then if that doesn't workout, then try to dump LLVM IR:

```

swiftc -emit-ir ...

```

Both compiler flags should be available in out-of-box Swift compiler.

mhh__ · on Sept 1, 2021

In this case, why? If I'm looking for failed vectorization in particular, I usually start at the bottom and walk upwards.

aardvark179 · on Sept 1, 2021

Thanks for that. From the blog post I wasn’t sure how the temp variable was being used and thus how much of the work might be optimised away entirely.

ewg4345h43 · on Sept 1, 2021

AMD 5850U beats Apple M1 in multicore performance by 29% having the same power consumption (15 Watt):

https://www.cpubenchmark.net/compare/AMD-Ryzen-7-PRO-5850U-v...

but is 19% slower in single core performance. However, if you consider that AMD uses 7nm and Apple 5nm technology to build their processors, AMD is a lot better.

ksec · on Sept 1, 2021

>AMD 5850U beats Apple M1 in multicore performance by 29% having the same power consumption (15 Watt):

The last part is not entirely accurate. They have the same TDP. Not the same power consumption. Because 5850U doesn't use 15W in those test. The same goes to M1 which is closer to 20W Max.

The word TDP means Typical TDP by both Intel and AMD and not what it means in literal sense. That is excluding cTDP and other state like PL2.

Worth mentioning the M1 achieve those single thread performance at no more than 5W, if you put the two on equal footing, even accounting for the possible node improvement, M1 is still quite far ahead in terms of pref / watt. And the 5850 is already on Zen 3.

leucineleprec0n · on Sept 1, 2021

Indeed, thank you. It's plainly pathetic this has to be said so often, especially here FFS. Every time this topic comes up "AMD can do the same, 15W" as if the "15W" figure was a terminating value from a set of industry-standard "TDP" figures. It is not, and the "TDP" listed at best offers a clue as to the power draw. Not much more. Hell, it can even differ by motherboard config. Intel encourages it.

The next emotional preservation tactic usually cites the old GF IO Die, but that was only on H/desktop series chips anyways and furthermore they still lose to Apple in sheer performance per watt.

It's August 2021 and we still have to have this conversation. Sigh

omegalulw · on Sept 1, 2021

It's really _highly_ misleading. I read OPs comment and assumed they are talking about measured power, not TDP.

hajile · on Sept 1, 2021

Don't forget the GPU either. With all the cores going full-out, the M1 can still run its GPU at high clocks.

AMD reduced the CU count down to 8 and ramped the clockspeeds which is terrible for the thermal budget. If you need to offload stuff to the GPU, both GPU clocks and CPU clocks dramatically lower. AMD needs a 16CU design with RDNA2 if they hope to actually compete with current and upcoming designs.

Speaking of upcoming, Apple's next generation will be announced in the next 2-3 weeks. A15 and either M1X or M2 (or maybe both) on N5P which should be 10-15% better than the previous N5 process. That's what 5850U is actually competing against considering how long it took to get out the door.

Things aren't looking pretty for x86. Now if we could just get some nice RISC-V designs shipping...

andreasley · on Sept 1, 2021

"The TDP of the [AMD 5850U] APU is specified at 15 Watt (default) and can be configured from 10 to 25 Watt by the laptop vendor (most chips are configured higher than 15 Watt)." [1]

I'd love to see actual measurements for both chips.

[1] https://www.notebookcheck.net/AMD-Ryzen-7-PRO-5850U-Processo...

cwizou · on Sept 1, 2021

Yep, you can see the distribution of results (there aren't a lot) here and it's all over the place for the Ryzen vs M1 :

Ryzen : https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+7+PRO+585...

M1 : https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+320...

TDPs don't mean anything even within a vendor lineup, and doing cross comparison is futile (without properly measuring yourself the power consumption at load, then you can start doing real efficiency comparison or apple to apple).

hajile · on Sept 1, 2021

Also, Intel and AMD chips are allowed to greatly exceed those TDP numbers with as much as 50-60 watts peak consumption.

tumblewit · on Sept 1, 2021

‘Beats’ ‘benchmark’ ‘nm’ are all marketing terms. You can only do so much comparison using such numbers but what you miss completely is that we are not putting them in a server farm. x86 is still coming as a CPU offering while the M1 and ARM chips are SOCs. It reduces system power as far less separate chips are needed to do other stuff while simultaneously being much more performant. A 5850U miniPC uses a 65w power brick. An M1 mini sips at best 39W. You would have to compare 1.65xM1 with 1x5850U

esyir · on Sept 1, 2021

nm might be a marketing term, but it's sure as hell meaningful. This goes double here as we're comparing between TSMC 7nm and TSMC 5nm, of which the latter is definitively superior in power efficiency.

Similarly, benchmarking is a core component of CPU evaluation, allowing for isolated (and combined) analysis of the CPU's performance characteristics.

Handwaving both of these away is extremely misleading.

ewg4345h43 · on Sept 1, 2021

I don't think that this is correct. My laptop, Lenovo P14S GEN 2 AMD with 5850U CPU has 65 WATT USB-C power adapter. Half of that power will be consumed by monitor (4K and 500 cd/m²) and you write above that the whole system should consume 65 WATT... This is not possible at all, because there is only 30-35 WATT left for the whole system except monitor.

mgraupner · on Sept 1, 2021

If your system exceeds the maximum amount of power provided by the power adapter, it will use the internal battery. Otherwise it will throttle the CPU.

dcow · on Sept 1, 2021

Laptops are not designed to do that under consumer operating conditions (they can, sure, but they won’t). No manufacturer will ship you a system with an adapter that can’t power the machine at peak load.

mgraupner · on Sept 1, 2021

That is not true, Apple laptops (and probably others) are known exactly to do this. The infamous i9 MacBook Pro 16" could definitely use more power than the supplied power adapter and drain the batterie under load. And that was perfectly within its specification.

errantblaze · on Sept 1, 2021

The Dell G5 15 is designed to do that and the feature is called hybrid power mode.

https://www.dell.com/support/kbdoc/en-us/000140513/gaming-la...

dcow · on Sept 1, 2021

This one honestly reads more like a “we fucked up let’s call it a feature in an obscure technical amendment”.

cwizou · on Sept 1, 2021

You'd be surprised, this is how most modern "high end" Intel laptop actually works. Even some Intel Macbook Pro exhibited this.

This is 100% an Intel thing.

errantblaze · on Sept 1, 2021

Another possibility is cost reduction as these are lower-end models. At least some of higher-end models (G7 and Alienware) have 180W and 240W power adapters, instead of the 130W included with this model.

dragontamer · on Sept 1, 2021

Counterpoint:

https://www.pcmag.com/news/gaming-on-surface-book-2-drains-b...

dcow · on Sept 1, 2021

> In other words, the laptop will throttle back performance automatically regardless of the settings you chose.

So if you change the performance settings they allow the laptop to draw 10W from the battery while plugged in for a little bit but it will throttle down to 95W to keep itself running. It still throttles which is I think the GGP’s point.

dragontamer · on Sept 1, 2021

The point is that a laptop with a 65W power adapter can in fact, draw 90W (or more) for a period of time in practice.

Which means, we don't really have a good way to benchmark power usage on laptops in a practical sense. We'd likely need to bust out the soldering iron + oscilloscope and measure currents entering the laptop's VRMs to accurately measure power usage over time.

I know laptops / cores have an "amp-counter" on board somewhere, but there's no guarantee that these devices are consistent or accurate across different laptops. Its sufficient for measuring how much energy different bits of code has (ex: Linux powertop tools), but not sufficient at comparing Apple M1 vs AMD Zen3 chips. We need a 3rd, trusted and independent measurement of power usage.

We can't just assume a 65W power adapter leads to 65W peak usage. Perhaps in the past when laptop designs were more in spec that was a decent assumption. But that time has passed, and today's laptops often do peak at power usages far in excess of their charger capacities (albeit temporarily, but even then, that makes measurements / benchmarks very difficult).

--------

I guess if you physically remove the battery pack (is that still allowed on these laptops?) and then plug it in, we might be getting somewhere. But the Macbook Pro doesn't have an easily removable battery pack.

cwizou · on Sept 1, 2021

Completely agree on everything.

That's the reason we didn't review laptop CPUs when I reviewed CPUs. You can get exact CPU power draw on a desktop motherboard (by using an amp clamp on the P8 connector) but it's hard (or not possible) to do that across multiple laptop chassis.

Removing battery (when possible) is not a solution either as what you get may differ a lot from classic "plugged in" usage (see the references to the MacBook Pro and Dell that used an i9 that still drained the battery when plugged in, because they can use more power than the power adapter brings).

On top of that, way too much depends on the OEM design and the performance of a given CPU will greatly vary from one chassis to another, because of the various throttling mechanism and the various configurable things that OEM can do (it's not just the cTDP, you can as an OEM play with various turbo times, another person mentionned P2 states, which is one of those).

So a given mobile CPU performance means nothing at the end of the day, only the laptop "as a whole" can be measured, which is why you don't see good quality benchmarks of mobile CPUs.

Anyway, just a small complement :

> laptops / cores have an "amp-counter" on board somewhere

Intel (and AMD to some measure) CPUs all have various sensors on chip that gives you the power consumption in watts (or amps, depending). They can be read with software such as hwinfo [1].

Those are usually not incredibly reliable though, they are not calibrated per CPU and it's very much a guestimate that could easily be in some cases +/- 5W off.

So sadly, not usable either (especially on mobile).

[1] : https://www.hwinfo.com/

jclardy · on Sept 1, 2021

But the M1 air charges and runs simultaneously on a 30W USB-C power brick…it has a slightly dimmer display at 400 nits.

snazz · on Sept 1, 2021

I've even plugged my M1 Air into the 20W USB-C iPad charger with no ill effects (somewhat slower charging while it's in use). It's possible that the battery would stop charging or deplete if I had the brightness cranked all the way up and the CPU was doing something intensive.

cwizou · on Sept 1, 2021

Check the sibling comments, your CPU (like most mobile Intel/AMd) can be configured for different TDPs by the OEMs, so you can't compare anything, especially not PassMark results.

ewg4345h43 · on Sept 1, 2021

Yes, you can throttle the CPU, but then the CPU scores in Benchmarks will be lower and that's not the case... (Benchmarks are collected from different machines). So it looks the Macs consume pretty much the same amount of power.

cwizou · on Sept 1, 2021

No, the TDP is configurable up to 25W (and down to 10). Again check the link of the score distribution I put in another comment, it's all over the place, and the passmark results are not a reliable comparison at that point.

ProfessionalHat · on Sept 1, 2021

No. AMD Ryzen™ 7 PRO 5850U can boost upto 4.4Ghz, likely it would be consuming upwards of 50W for that short duration.

An easy way to verify it, is to measure the benchmark delta when on battery and when connected to an external power source. (M1 benchmarks remains almost the same)

Intel i7 9750H for example has a P2 of above 80W and only then can it break the 4Ghz barrier. Even though the processor is technically rated only 45W. At 45W it can just maintain the base clock i.e. 2.6Ghz on all cores.

M1 is much more efficient than any x86 chip on the market right now.

smoldesu · on Sept 1, 2021

The M1 is hitting 4gHz across only 4 of it's cores, whereas the 9750 is driving 6 (and 12 threads on top of that). Furthermore, the M1 will have no problem hitting ~20w during peak load, so frankly the math checks out to me. The comparison definitely starts to deteriorate once you consider that the M1 is ~3x as transistor-dense as the Intel chip, and part of me actually wonders why they didn't get more power out of a chip that only needs to worry about a handful of instructions and doesn't know about hyper-threading.

saagarjha · on Sept 1, 2021

Pretty sure M1 does not hit that frequency on any of its cores…

smoldesu · on Sept 1, 2021

I was under the impression that the performance cores had hit 4.1gHz before, is that incorrect?

hajile · on Sept 1, 2021

Max frequency is 3.2GHz and you can max pretty much indefinitely on the pro with a fan or the air with the heatsink mod.

mishafb · on Sept 1, 2021

These days nanometers are just marketing, they are disconnected from the actual chip density

cwizou · on Sept 1, 2021

Well in that case both are made by TSMC so it's comparable. But PassMark results in general are not that great (they use micro benchmarks that may not be representative or are optimised for by vendors).

More importantly, as pointed out by a sibling comment, the AMD CPU in question (like most mobile Intel CPUs) has a "configurable" TDP which is set higher on most products sold. And PassMark doesn't differentiate those and only mention the "official" TDP.

To PassMark credit, they give a distribution of performance scores, just compare the distribution of the Ryzen and the M1 and you'll see (you have to scroll down a bit to see the graphs) :

Ryzen : https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+7+PRO+585...

M1 : https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+320...

In general, you can't compare TDPs even within a brand, they rarely mean what it used to mean a few years back as they "innovate" with various turbo mechanism and other OEM configurable settings.

ewg4345h43 · on Sept 1, 2021

Both chips are produced at TSMC factories, so both have the same meaning. We do not compare here two different manufacturers.

leucineleprec0n · on Sept 1, 2021

Yeah, no. The libraries used can and do differ.

zorgmonkey · on Sept 1, 2021

Just to give an idea of how crazy it is here is a list of the synopsys TSMC 7nm cell libraries [1] and of course their are other companies that offer additional cell libraries. Also I have no way of proving this, but I wouldn't be at all surprised if both AMD and Apple have their own customizations or additions to whatever cell library they use.

[1] https://www.synopsys.com/dw/emllselector.php?f=TSMC&n=7&s=r3...

pantulis · on Sept 1, 2021

When did this happen? I think I've read somewhere that process node size stopped being a meaningful indicator even decades ago.

BiteCode_dev · on Sept 1, 2021

How so? Architecture?

e40 · on Sept 1, 2021

Serious question, as I'm in the market: are there any mac mini like systems that feature an AMD 5850U?

EDIT: not the same processor: https://simplynuc.com/cbm1r8rb/

hajile · on Sept 1, 2021

I'd consider waiting a couple weeks. There's several rumors that the redesigned Mac Mini is coming soon. It should have an 8 big-core (and 2-4 little cores) variant and probably up to 64gb of RAM as options in a much smaller package (most of the M1 mini was empty space because it reused the x86 case).

jmkni · on Sept 1, 2021

How is the AMD in terms of heat? Have they matched Apple in terms of cooling requirements etc?

qayxc · on Sept 1, 2021

Given similar power consumption, cooling is up to the OEM, not the chipmaker. So this question should be directed at Clevo, Dell, HP, Lenovo, ASUS, Acer, Tongfang, etc.

This is physics after all and 15W are 15W no matter if they go into an Apple M1 or an AMD Ryzen.

acdha · on Sept 1, 2021

It’s not as simple as seeing 15W and saying the thermal load is the same: they don’t have constant usage so you have to compare the actual power usage for workloads you care about. Some manufacturers are more conservative so you need uncommon combinations to hit the maximum output whereas other chips will approach that in normal usage.

This is also where design decisions matter: for example, a while back I measured hashing performance for some boxes which needed to check data integrity and an Intel chip handily lost despite being faster on everything else because the embedded processor I was comparing it to had dedicate SHA hardware which was both faster and more power efficient than a generic x86 implementation. That’s ancient history now but I would expect Apple to aggressively explore opportunities to improve their stack like that since they control it at every level - for example, I believe benchmarks have shown Objective-C message passing is considerably faster on M1.

qayxc · on Sept 1, 2021

Optimising applications for specific hardware has nothing to do with the hardware manufacturer if the software and hardware come from different suppliers, though.

Apple has the advantage of developing and deploying hardware, OS and system software completely in-house.

AMD only supplies chips and basic firmware, both of which can be configured by OEMs/ODMs and the OS and software come from entirely different parties again.

So the usage profile depends on external factors, not just the CPU itself. In the end, however, a 15W power budget is a 15W power budget and an M1 under full load and a Ryzen under full load will have the same thermal output if configured the same (as far as power consumption goes).

How well the waste heat is managed is not in the hands of the CPU.

acdha · on Sept 1, 2021

Your position appears to be conflating requiring coordination with impossibility. Apple has it easier in some ways but it’s not like AMD/Intel, Microsoft, Dell, et al. don’t work together, too. Similarly, while you’re not entirely wrong that the ISVs customize firmware and settings, the defaults and range of options are constrained by the CPU design.

Finally, again, 15W TDP is not the same as 15W under normal usage. That misunderstanding appears to be driving most of your disagreements in this thread.

qayxc · on Sept 1, 2021

> Finally, again, 15W TDP is not the same as 15W under normal usage. That misunderstanding appears to be driving most of your disagreements in this thread.

No. The CPUs can be configured to consume no more than 15 Watts, even if few OEMs do so. Same goes for the M1 - there's no difference with regards to this: both the MBP 13 and the Mac Mini have higher power limits and active cooling for that reason.

In fact, the latest U-series mobile Ryzen CPUs are even optimised to be most efficient at a 15W power level, contrary to Intel's Ice Lake chips, which get the most performance at a higher wattage configuration of 28 Watts.

acdha · on Sept 2, 2021

Again, as a number of people corrected you, it's not a simple 15W figure but specifically the listed maximum 15W thermal design power (TDP).

That's a term in the industry with a specific meaning:

https://en.wikipedia.org/wiki/Thermal_design_power

The key thing to understand is that this is not measured power consumption while running the benchmarks and you cannot reliably compare the values even across the same product line, much less across chips — especially when we're talking about SoC designs where, for example, the TDP refers to the entire chip but the benchmarks being discussed are all CPU-focused and don't even exercise the GPU at all. We also know that TDP numbers are not a hard ceiling: there are some chips which under some conditions — most commonly but not always synthetic benchmarks — will exceed those figures, possibly by somewhat significant margins.

qayxc · on Sept 2, 2021

OK. For the last time, here's a freakin' screenhot to illustrate the point: https://ibb.co/QYnj40r

What you see there is are the power limits of the CPU (as reported by HWInfo64).

If you set the POWER LIMIT in the UEFI/BIOS, this regulates THE POWER consumption of the chip. NOTHING to do with TDP.

Why is that so hard to grasp for you? You can even measure the power rom the wall to confirm this. I am NOT talking about TDP here!

snazz · on Sept 1, 2021

In theory it's 15W and 15W, but both chips can use more than that amount under certain circumstances and the AMD chip can even be configured by OEMs with different TDPs. Plus they heat up differently under the same workloads due to different chip features and optimizations (as the sibling comment mentioned).

hajile · on Sept 1, 2021

To my knowledge, the only AMD chips approved for fanless designs are their 5-6w embedded chips and the same is true for Intel.

Those chips usually have 3.5-4GHz turbos, but in a fanless config, you'll never see them (and even with active cooling, you won't see them for more than a handful of seconds).

qayxc · on Sept 1, 2021

Power curves and C-state handling is also performed by the firmware, which is again controlled by the OEM/ODM not the chipmaker.

r00fus · on Sept 1, 2021

Can 5850U run fanless at full load? That's the table-stakes that Apple has raised.

turbinerneiter · on Sept 1, 2021

As impressive the M1 is, it leaves a sour taste in my mouth. I have a similar issue with Qualcomms chips, and I guess Tensor will have the same problem:

I can't actually use them.

The M1 innovation is locked to Apple. If I have a great idea for a new device that would be enabled by the M1 perf/power/spec, I'd have no chance of building it.

I hope Google continues what they did with Coral with the Tensor chip. A Raspberry Pi style device, or a compute module of this chip would be a fever dream.

The best thing we have right now is the NXP iMX8 when it comes to performance, and still the RPi when it comes to ease of use.

Maybe if have to dig into Qualcomm and check how to get hands on their 8xx chips. The fact that there are not really SBCs with them however tells me that it's not easy to get them/use them.

codetrotter · on Sept 1, 2021

If the author is here and able to do so I’d much appreciate if they would share the complete code for the benchmarking as a whole, so that others may use it for benchmarking other code in the same way :)

Ensorceled · on Sept 1, 2021

> where 100% would be the same time as the Firestorm core, and 200% would be twice that time.

I wish more articles would explicitly call this out. I've seen 100% used to mean 100% MORE more than a few times ...

dundarious · on Sept 1, 2021

I’m a big fan of 1x, 2x, etc., for this reason. Usually clearly strictly multiplicative, except with annoying constructions like “2x improvement”.

Ensorceled · on Sept 2, 2021

Yeah, "2x improvement" or "200% improvement" is always a total crap shoot on what that means.

eternalban · on Sept 1, 2021

I submitted this earlier. This github project (not mine) has pretty extensive benchmarks and the M1 makes a very strong showing.

https://news.ycombinator.com/item?id=28346908

On another note, I was amazed to see the SHA-1 and SHA-256 (and the Blake2b) performance numbers when running the benchmark of minio’s blake2b-simd (over M1 Golang) on a 16GB M1 Air. For some reason, Go on M1’s SHA-1 and SHA-256 are getting orders of magnitude better numbers (~2.5GB/sec!) and even the Blake2b code is beating the numbers for the same code running on an intel Air (which uses the SIMD codepath).

https://github.com/minio/blake2b-simd

simondotau · on Sept 1, 2021

TLDR: Based on a single simple synthetic benchmark, the low performance "Icestorm" cores were shown to be as much as 52%—or as little as 18%—of the performance of the primary "Firestorm" cores. Highly efficient assembly showed the least performance drop whereas complex "idiomatic" Swift code showed the greatest performance drop.

However the Icestorm cores also use substantially less energy so they are an efficiency win regardless. Plus they take up use significantly less physical space which is a large cost saving for the SOC part.

OskarS · on Sept 1, 2021

> Highly efficient assembly showed the least performance drop whereas complex "idiomatic" Swift code showed the greatest performance drop.

I wonder what this means. The efficient assembly probably has fewer instructions that use vector instruction and floating point calculations more, while the "idiomatic" Swift probably has just a larger number of instructions that aren't doing heavy calculation. Does that imply then that the high performance cores does much deeper pipelining, but the the number floating point units or whatever is probably pretty similar across both types?

simondotau · on Sept 1, 2021

My initial guess is that it's because Icestorm CPUs have less L1 and L2 cache, resulting in more frequent cache misses in complex loops. I'm by no means an expert in any of this, so I really have no place hypothesising.

Firestorm has 128KB L1 per core and 12MB shared L2.

Icestorm has 64KB L1 per core and 4MB shared L2.

emn13 · on Sept 1, 2021

I've never written native code for an apple device, but have some low-level experience on other things; my guess would be that idiomatic swift contains safety checks, i.e. predictable branches, but branches nevertheless - and that simpler cores with poor or no branch prediction suffer much more heavily than "smarter" cores from that kind of code.

At the end of the day, if all you're doing is basic adding/multiplying, you just don't need a fancy core (and hence: GPGPU, right?)

So this microbenchmark result isn't very surprising, although it's nice to see the principles in action.

saagarjha · on Sept 1, 2021

That is correct. Also, the checks have a bit of an unfortunate downside that they need to be emitted for each array access if the loop is unrolled (as it should be here).

hajile · on Sept 1, 2021

I doubt the cache is a serious problem.

Firestorm has 2x the ports across the board (int, fp, load/store, and simd). To keep those ports busy, it has a massive instruction window (630 entries). Keeping those fed requires large I-cache.

x86 instructions are 15-20% more dense than uarch64, but the fastest x86 cores still have only 32kb of I-cache (in fact, AMD went down from 64kb in Zen 1 to 32k in Zen 2/3), so I doubt doubling that is the major limitation here.

In this particular micro-benchmark, I doubt even extremely bloated code would get anywhere close to 1k of instructions for the loop let alone 64k. There might be something to say about the size of the dedicated loop cache that many chips have, but that's a completely different animal.

OskarS · on Sept 1, 2021

Yeah, that makes a ton of sense. Cache misses would be less of a factor in the highly optimized assembly case.

saagarjha · on Sept 1, 2021

This benchmark is highly likely to fit entirely within the cache even for Icestorm.

ChrisMarshallNY · on Sept 1, 2021

I know what that means to me (as a practical issue, as opposed to an internal mechanism issue), and I am sort of face-palming.

I spent a great deal of time, learning "idiomatic Swift" coding practices. I can do some damn clever HOF stuff (inlined maps and reduces).

This tells me that I may be better off, doing "classic C"-style coding, like I did, when I was just getting started with Swift.

Sigh...

OskarS · on Sept 1, 2021

It's generally true. Both Rust and C++ (and Swift to a smaller extent) are both all about the "zero cost abstractions", but they usually aren't. It's hard to beat the C-style for-loop: they map really closely to the underlying hardware, and modern compilers are starting to become really good at auto-vectorizing them.

qayxc · on Sept 1, 2021

> It's hard to beat the C-style for-loop: they map really closely to the underlying hardware, and modern compilers are starting to become really good at auto-vectorizing them.

Except with many compilers and depending on the loop, for-loops may optimise much poorer than their C++/Rust iterator based counterparts.

That's because modern language constructs can communicate intent much clearer to the compiler than a primitive for-loop can; the compiler has to make fewer assumptions (which can can be wrong).

So no, it's not "generally true" at all.

nicoburns · on Sept 1, 2021

That's not a general rule. In Rust iterator chains are frequently faster than for loops, and are just as likely to be auto-vectorized (or not at the whim of the compiler) as for-loops.

Ar-Curunir · on Sept 1, 2021

That’s not necessarily true; sometimes the higher level abstractions can provide more information to the compiler

simondotau · on Sept 2, 2021

This isn't the right lesson to derive from this. These benchmarks show that the high performance cores pack a lot of smarts which can maintain high performance with idomatic Swift.

Idiomatic Swift is still great for code which doesn't have significant performance consequences and/or is likely to only ever run on the Firestorm cores—i.e. any user-interactive process.

If you're writing a background task which is expected to have some non-negligible load on the CPU, it could make sense to experiment with simpler C-style coding for the hottest loops in your code.

Filligree · on Sept 1, 2021

How significantly less, I wonder?

For my workloads it’d be an overall win to have more cores at that speed. The more the better; I’d cap out at maybe a a hundred or so.

Obviously Firestorm is better, but a hundred-core desktop CPU at present seems… unlikely.

em500 · on Sept 1, 2021

Eyballing die shots, Firestorm is about 4 times the size of Icestorm, so if you have perfectly parallelizable workloads, you stand to gain if Firestorm is less than 4 times as fast per core. (This ignores L2 cache and maybe some other supporting transistors which makes things much more complicated.)

In the simple benchmarks, the speed differences range roughly from 2x to 5x. It looks like the current configuration (4 Firestorm + 4 Icestorm) is pretty well balanced: equisized alternatives (5 Firestorm + 0 Icestorm, or 3 Firestorm + 8 Icestorm) can be faster for specific workloads, but probably not across the board. The Apple CPU team really knows what they're doing (but that has been abundantly clear for many years now).

maccard · on Sept 1, 2021

AMD [0] would like a word. it's 64 cores but 128 with hyperthreading.

[0] https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-3...

Filligree · on Sept 1, 2021

So, not a hundred-core processor yet.

I can’t use hyperthreading. It does give a 60% speed boost, but it’s also disabled in production so…

nisegami · on Sept 1, 2021

Is that out of concern for speculative execution security issues a la spectre & meltdown?

Filligree · on Sept 1, 2021

Yes, as you’d expect. So much performance lost. :/

ant6n · on Sept 1, 2021

But do seem "likely" at present.

hajile · on Sept 1, 2021

An interesting thought is the comparison to x86.

Icestorm cores max out at 2GHz, but have 30-50% the performance of the 3.2GHz Firestorm cores. Accounting for that frequency difference, the actual performance difference if Icestorm were to boost clocks would be closer to 50-60% of the Firestorm IPC.

Zen 3 at 4.9GHz trades blows with M1 at 3.2GHz in single-core benchmarks.

It seems when you adjust all the things that Icestorm should be 10-30% slower than Zen 3.

Putting that in perspective, Icestorm cores (based on this one microbenchmark) would have around the same performance per clock as the original Zen cores, but only consume a 5-10% of the actual power consumption.

shantara · on Sept 1, 2021

I didn't realize map-reduce was so much slower than a regular looped multiplication, regardless of the hardware the code was running on.

zozbot234 · on Sept 1, 2021

Yes, that was weird for a language with such a high focus on performance. My guess is that the compiler couldn't inline the "idiomatic" operations and had to fall back on generic implementations, which will be painfully slow in Swift due to the overhead of being ABI independent.

You could do all this in other languages too, but it wouldn't be the "idiomatic" choice during development; it would be clear that you're giving up on performance for the sake of flexibility.

hajile · on Sept 1, 2021

That's an implementation detail.

Consider JS. The builtin implementation is naive and chaining calls results in multiple passes over the array.

Lodash is different. It uses an iterator internally, so chaining isn't much of a performance difference (just the overhead of the function calls which are probably inlined by most JITs anyway).

I see no reason why Swift couldn't (or shouldn't) do this in a future version.

viktorcode · on Sept 1, 2021

It isn't slower per se. The 'idiomatic' and imperative part aren't equivalent; 'idiomatic' performs unnecessary work.

The real problem is the existence of the overhead isn't obvious from the code.

ChrisMarshallNY · on Sept 1, 2021

I suspect that, for most cases, it won’t matter (not everything needs to be optimized -heresy, I know).

But for some “inner loop” stuff, it could make a big difference.

Another issue, is that visitors will hit the entire set; even when it is no longer necessary to continue. We can do some tricks to make subsequent visits shorter, but they’ll happen, nonetheless. This always bothers me.

mchusma · on Sept 1, 2021

I've been very disappointed in my m1 MBA performance. I had used a 8 year old PC desktop and Chromebook before. The PC died, so I tried to get a MBA to replace both, and honestly it's just slower than either at most things. The thing it does well, and the reason I keep using it, is that my Chromebook doesn't support a few software applications I like using.

There are also what appear to be intermittent sound card issues. But Everytime I make an appointment to get it fixed, it decides to not have it.

I mean it's ok, but I think wait for v2 of the chip. But I just don't get how people's experience with the M1 is so different than mine.

fiddlerwoaroof · on Sept 1, 2021

It sounds to me like you probably got a bad unit: I did a simple benchmark of compiling the abcl codebase from scratch and the M1 was significantly faster than my year old desktop with an AMD 3800 and 128G RAM.

https://twitter.com/fwoaroof/status/1330041394019397636?s=21

minhazm · on Sept 1, 2021

What do you find it to be slow at?

webmobdev · on Sept 1, 2021

big.LITTLE Processing: Defining the Future of SoC Architecture - https://www.samsung.com/semiconductor/minisite/exynos/newsro...

With this CPU design some cores are optimised for performance (at the expense of using more power) while some cores are optimised for efficiency (using the least power at the expense of computing performance). This makes sense for laptops and smartphones, as it can save power and thus run longer when being powered by batteries. But (in my opinion) not for Desktop PC's where most people care more about computing performance than saving a few watts.

masklinn · on Sept 1, 2021

> not for Desktop PC's where most people care more about computing performance than saving a few watts.

Not for all workloads. Daemons and services don’t usually need much oomph, and being able to run them at double the efficiency and avoid them competing for resources with software which actually wants the higher power cores would be amazing: less context switching, less cache thrashing, less random variability, …

Hell, it would also help make low-power software more reliable e.g. foobar and the audio output don’t need much but they need it. If something else is loading the machine to the gills they can get starved and audio will start breaking down.

webmobdev · on Sept 1, 2021

I do understand the pros and cons. My point was that it doesn't matter when you prefer peak performance vs efficiency on a desktop PC. I'd rather have 8 Firestorm cores on my PC, with one or two of these fast cores dedicated for the "lighter" workload even if it consumes a few more watts. When your desktop computer is slow, doing a CPU resource intensive task, nobody is going to appreciate that it is consuming less power - everyone will be cribbing about the slow speed.

celrod · on Sept 1, 2021

Assuming die area is the constraint, the choice is between 4 firestorm + 4 icestorm and 5 firestorm, because a firestorm core is 4x bigger than an icestorm core.

Or between 8 firestorm and 7 fire + 4 ice (or 6 fire + 8 ice).

Applejinx · on Sept 1, 2021

I thought the idea was that not all processes are the same class of citizen. Some of 'em go to high performance cores, but some of them always get assigned to what's basically a second computer running in parallel and handling interface stuff or system management, typically a lot less powerful (though according to this article, far from helpless).

In this way the CPU-intensive stuff never has to break its concentration by sullying its tiny processors with utility calculations, but keeping the observable system responsive (a much less demanding task) goes to the second, weaker computer, that's running as if nothing else is happening. So neither thing ends up being slow, because neither thing particularly has to be interleaved with the other.

Source: I'm no CPU or SoC designer, but I own a Mac M1 laptop for the purposes of compiling open source plugins to the new machines, so I've done hundreds of XCode compiles on the machine, and lots of housekeeping work in Finder. None of it seems slow, by any metric or perspective.

webmobdev · on Sept 1, 2021

I understand that. But that wasn't the point. Historically laptop and desktop PC CPUs have been different because of the different requirement - laptop's prioritise efficiency whereas desktop pc users prioritise performance. Apple instead has used the same CPU on their tablet, laptop and Desktop PC.

(And note that even Intel / AMD processors do the kind of power scaling that you described without any specialised low power cores - all cores are designed to run at varying speed. Thus, when a small task is executing, the core executing it runs it at the minimum acceptable speed and saves power. But the same core can run at full speed on demand. That's a more acceptable tradeoff for me for Desktop PCs, especially considering that ARM CPUs already are very efficient.)

oarsinsync · on Sept 1, 2021

> This makes sense for laptops and smartphones, as it can save power and thus run longer when being powered by batteries. But (in my opinion) not for Desktop PC's where most people care more about computing performance than saving a few watts.

You're probably right about laptops and smartphones. I don't have a view on whether you're right about desktops, but I think it's probably not true for the big datacentre providers, who absolutely optimise for power and heat.

If the only niche that doesn't care about power is the desktop, and desktops are absolutely a niche when compared to the total market of smartphones, laptops and datacentres, I suspect power efficiency is coming to the desktop whether you want it or not, whether it's a positive or not, if only because of economies of scale as far as chip development and production goes.

webmobdev · on Sept 1, 2021

Yes, I realise that datacenters do care about power consumption too, and that is why they find the idea of a ARM server cpus so attractive. My point ofcourse was directed at home users of Desktop PC - I don't like the idea of sacrificing performance to save a few watts, when the ARM cpus are already very efficient. (I mean, when you are rendering a 3d scene, or encoding a video, or compiling a code, no one I know is going to appreciate the longer times it takes to execute these tasks because not all cores are equal).

simondotau · on Sept 1, 2021

I'm not sure that you could make a case for this not making sense in a desktop computer, as everything is ultimately a trade-off.

It's fairly clear that the Icestorm cores represent a performance gain in terms of performance per watt, but also die area. The four Icestorm cores and their support infrastructure takes up about the same physical space as one Firestorm core with its support infrastructure.

I doubt that an M1 with five Firestorm cores would perform as well as the eight cores we did get.

webmobdev · on Sept 1, 2021

My case for having 8 fast Firestorm cores instead of 4 fast + 4 slow cores is simple - the majority of Desktop PC users don't care about sacrificing performance to save a few watts. When a PC is running slow because of a cpu intensive task, nobody is going to appreciate that it is saving a few more watts.

And from the figures in the article, it looks like 1 Firestorm core is very roughly equal to 2 Icestorm cores -

    Relative to their Firestorm times, Icestorms performed more slowly by:

      190% running assembly language
      330% running simd (Accelerate) library functions
      280% running simple Swift
      550% running ‘idiomatic’ Swift

So 5 or 6 Firestorm cores could have matched the current 4 + 4 config (disregarding the valid point you made about the die area).

simondotau · on Sept 1, 2021

More Firestorm cores means more die area, which means the SOC gets significantly more expensive and/or supply won't keep up with demand.

Obviously it would be great to have 8 Firestorm cores in a desktop Mac—and this will probably be what we'll get in a future A1X—but as long as we're doing silicon fan fiction, why not ask for 32 Firestorm cores? That would be cool.

webmobdev · on Sept 1, 2021

The M1 is cheaper than previous Intel i7 Mac Minis. So I don't think it would have mattered if the CPU was slightly more expensive. People would have been willing to pay that price. It was just less riskier, and cheaper, for Apple to use the same chip on both laptops and Desktop PCs even though the CPU trade-offs and consumer expectations of both are different.

simondotau · on Sept 2, 2021

It’s worth remembering that the M1, a chip which has exceeded all expectations of performance, is still just a low end CPU, not merely a “laptop” part. It’s purpose is to be efficient in respect to cost. And this includes the parts cost of higher wattage power delivery and the parts cost of more aggressive heat dissipation.

Future M1 derivatives and successors will offer higher performance at a higher price.

nicoburns · on Sept 1, 2021

The rumours are that this year's MacBook pro's will feature an M1X processor with an 8 + 2 configuration which seems like it might well be a good tradeoff to me. The 2 small cores reduce context switching on the big cores making overall performance a lot snappier.

Applejinx · on Sept 1, 2021

I'm going to guess that if you could have 8 Firestorm, you will always be better off having 6 Firestorm/8 Icestorm.

Always. In practical use. Even for the most isolated tasks there'll be enough 'maintenance work' for the system to do, in practice, that the pool of Icestorms will justify their real estate.

If I'm mistaken, it'll be because there proves to be a maximum number of Icestorm cores for the 'maintenance work' that's called for: say if you had a 64/64 system, the idea is that maybe you get to a point where you can have 'too many cores' on the Icestorm side, wastefully so.

But that assumes they play NO role in setting up the Firestorms to perform optimally. All this stuff, all this load balancing and assigning, isn't magically done by some GrandMasterCore, it's being worked out by the same CPUs that are benefitting from the load balancing.

If you can use Icestorm cores to better set up and load balance the Firestorm cores, if you can burn unused CPU on the Icestorm side to optimize the Firestorm side… then there is NEVER a situation where you'd want the Firestorms to outnumber Icestorms. If that's happening then the Icestorms are doing some grunt work to clear a path for the Firestorms, and if you can burn more cycles to further optimize that, well…

Spooky23 · on Sept 1, 2021

> the majority of Desktop PC users don't care about sacrificing performance to save a few watts

If you ask on those terms, agreed. But if quantified in terms of value delivered, it’s a different story. For example, if your computer can be always on with tasks like updates or if you can dedicate tasks to the slow cores there may be a value there.

KuiN · on Sept 1, 2021

> not for Desktop PC's

Intel don't agree. They are using a heterogeneous architecture for desktop class chips in upcoming Alder Lake processors [0]

[0] https://www.anandtech.com/show/16881/a-deep-dive-into-intels...

pdpi · on Sept 1, 2021

A small home server running 24/7 on low load mostly off of the small cores that can seamlessly ramp up to big cores for power when needed sounds exactly like the sort of thing you'd want.

occamrazor · on Sept 1, 2021

Depending on the specific workload, if the “slow” cores are more efficient per watt it may be preferable to have only a small number of “fast” cores for single-threaded applications and many slow cores. Or possibly vice versa if that’s not true.

The M2 processor could be 4-big 64-LITTLE, or 16-big 8-LITTLE or some other combination.

Applejinx · on Sept 1, 2021

Not unthinkable. It depends on how the load balancing is, especially if the little cores are far from helpless on their own.

Synaesthesia · on Sept 1, 2021

Most of the time your PC isn't working hard, and it makes sense to use lower power cores to perform basic tasks.

webmobdev · on Sept 1, 2021

I understand that perfectly. But how willing are you to sacrifice performance for that, is my point. There's a reason why CPU makers focus on building faster chips - that's what people care about most on Desktop PCs.

Synaesthesia · on Sept 1, 2021

You don't need to sacrifice performance since CPUs step down automatically. For a hybrid design they only use the smaller cores for less demanding tasks. In fact this design gives a perceived increase in performamce as the author has written about.

webmobdev · on Sept 2, 2021

Ofcourse you are sacrificing performance when you put slower cores alongside fast cores, instead of using all fast cores only. And for what - ARM cpu cores already are very efficient. So saving a few more watts with the slower cores, at the expense of performance (faster cores) isn't a worthwhile tradeoff when you are doing some cpu intensive task on your desktop. Saving a few extra watts matters on phones / tablets / laptops because you can use it longer on battery.

m_eiman · on Sept 1, 2021

Saving watts means lowering fan RPM, meaning less noise. And that's a big priority for many.

Roritharr · on Sept 1, 2021

Not only that, i'd prefer to have a small amount of ram and cpu to be running 24/7 for always-on features that I'd love to have my PC doing.

I don't like having to run a PI for some stuff just because i don't want my huge tower running all the time, it would be really neat if it could run at anything between 5 - 600W, not sure though if the PSUs would be able to offer that range.

n1000 · on Sept 1, 2021

Also, aren’t desktop CPUs constrained by thermal load at some point or can we use ever bigger coolers? Personally, I find it almost obscene that my desktop PC consumes roughly as much as a good old incandescent lightbulb (60+W) while idling. My laptop uses as much under full load.

Synaesthesia · on Sept 1, 2021

Your PC uses 60w idling? Is that with screen? It's not too much in that case. CPUS and GPUS have gotten a lot better at idle power consumption, and PSUs are also quite efficient these days.

n1000 · on Sept 1, 2021

As discussed here [1], a fairly beefy Ryzen/RTX setup is often found idling around that number.

[1] https://hardforum.com/threads/killawatt-owners-whats-the-idl...

namibj · on Sept 1, 2021

Downvolt and downclock your RAM when possible. Turn off VRM phases when not needed (also on the GPU; RAM should be handled properly already on it).

Allow package-C6. Turn on PCIe power saving stuff.

There is a bunch of stuff that mildly hurts performance and greatly improves efficiency in intermittent workloads, by letting hardware sleep/power-off when not needed.

webmobdev · on Sept 1, 2021

I guess so - I really don't care about the fan noise (I'd rather PC makers think more about sound proofing for that rather than reducing fan RPM thus reducing performance).