A couple of people seem surprised that the "idiomatic" Swift code seems to perform much worse. Unfortunately the blog post doesn't provide much details on the code it is actually running, so I tried my hand at guessing:
let vC = [Float](repeating: 1, count: 4)
func unidiomatic() -> Float {
var vA = [Float](repeating: 0, count: 4)
let vB = [1, 2, 3, 4] as [Float]
var tempA: Float = 0.0
for _ in 0..<1000 {
for i in 0...3 {
tempA += vA[i] * vB[i]
}
for i in 0...3 {
vA[i] = vA[i] + vC[i]
}
}
return tempA
}
func idiomatic() -> Float {
var vA = [Float](repeating: 0, count: 4)
let vB = [1, 2, 3, 4] as [Float]
var tempA: Float = 0.0
for _ in 0..<1000 {
tempA = zip(vA, vB).map(*).reduce(0, +)
for (index, value) in vA.enumerated() {
vA[index] = value + vC[index]
}
}
return tempA
}
When compiled with optimizations (it's unclear if the author did this?), the "idiomatic" code here is indeed much worse. It's easy to fix the worst of that by working on a lazy sequence, i.e.
tempA = zip(vA, vB).lazy.map(*).reduce(0, +)
which will save on an extra array allocation and instead run the sum "in place" (in this case, vB is recognized as a constant and in both cases and distributed accordingly). Unfortunately there is still a small amount of overhead because zip doesn't know that the arrays are of the same length, so the compiler inserts some extra bounds checks; also, the second loop gets transformed into what is essentially
temp = __swift_intrisic_move(vA)
for (index, value) in vA.enumerated() {
temp[index] = value + vC[index]
}
vA = __swift_intrisic_move(temp)
but this should ideally bring the performance of the two much closer.
That just means the compiler needs more work around generating optimal code. Rust has some similar idioms and does much better, with the difference between "C-like" Rust and idiomatic Rust usually being small to nonexistent. With Rust you can still sometimes hit edge cases though where the compiler spits out slow code for some idiom that should be zero cost.
Yes, Swift lags behind on this. Part of it is a language design thing (lots of extra allocations by default, certain invariants cannot be reasonably be expressed to the compiler, lots of optimization information is thrown away before it hits the middle end), part of it (IMO) is a priority thing. Rust has an extremely strong desire to compare favorably against C/C++, and missed optimizations make it look bad. Swift is working in this area but it seems to be late to the game.
Also, just to be entirely clear, an example like this (once I “fixed” it) is probably the best Swift can reasonably expect to get, since it hits all the hardcoded semantic hints that Array has been annotated with. It actually does get quite close to the unidiomatic version; save for some extra bounds checks from the zip and a dubious double move, it’s basically done as well as it could with constant propagation and inlining. More complex examples are unlikely to fare as well.
Bound checks can be an issue in Rust as well, actually. It's often worthwhile to hoist them manually out of loops, which usually enables the compiler to optimize them out within the loop and even vectorize the resulting code.
(This can't happen automatically because the default semantics generally involve a panic or exception to be triggered when the check fails; there's no way to optimize this further.)
A trick I've found to eliminate them in a few cases: pass &[T; size] instead of &[T] when dealing with statically sized arrays. The compiler will elide bounds checking a lot more aggressively with statically sized arrays since the bounds are known at compile time.
In this case none of the accesses should fail because they can all be statically observed to be in-bounds; unfortunately the Swift compiler was not able to see that.
I was under the impression that this category of optimizations ("deforestation") is difficult, and requires some immutability/side-effect-free guarantees to be performed effectively. Is that kind of optimization even feasible in Swift?
Emphasis on similar, though. This sometimes comes up when people are discussing D ranges, and it's important to keep in mind that in some case the autovectorization cannot happen because a tradeoff was made that we value locality over parallelizism i.e. just reduce rather than map-reduce.
Transforming from reduce to map-reduce is usually very trivial, but has to be kept in mind
Unfortunately Swift is kind of verbose to read and doesn't really look good in Godbolt (this is why I didn't link it–also, it only does Intel). Instead I compiled on my own machine and opened the binary in Ghidra.
but is 19% slower in single core performance. However, if you consider that AMD uses 7nm and Apple 5nm technology to build their processors, AMD is a lot better.
>AMD 5850U beats Apple M1 in multicore performance by 29% having the same power consumption (15 Watt):
The last part is not entirely accurate. They have the same TDP. Not the same power consumption. Because 5850U doesn't use 15W in those test. The same goes to M1 which is closer to 20W Max.
The word TDP means Typical TDP by both Intel and AMD and not what it means in literal sense. That is excluding cTDP and other state like PL2.
Worth mentioning the M1 achieve those single thread performance at no more than 5W, if you put the two on equal footing, even accounting for the possible node improvement, M1 is still quite far ahead in terms of pref / watt. And the 5850 is already on Zen 3.
Indeed, thank you. It's plainly pathetic this has to be said so often, especially here FFS. Every time this topic comes up "AMD can do the same, 15W" as if the "15W" figure was a terminating value from a set of industry-standard "TDP" figures. It is not, and the "TDP" listed at best offers a clue as to the power draw. Not much more. Hell, it can even differ by motherboard config. Intel encourages it.
The next emotional preservation tactic usually cites the old GF IO Die, but that was only on H/desktop series chips anyways and furthermore they still lose to Apple in sheer performance per watt.
It's August 2021 and we still have to have this conversation. Sigh
Don't forget the GPU either. With all the cores going full-out, the M1 can still run its GPU at high clocks.
AMD reduced the CU count down to 8 and ramped the clockspeeds which is terrible for the thermal budget. If you need to offload stuff to the GPU, both GPU clocks and CPU clocks dramatically lower. AMD needs a 16CU design with RDNA2 if they hope to actually compete with current and upcoming designs.
Speaking of upcoming, Apple's next generation will be announced in the next 2-3 weeks. A15 and either M1X or M2 (or maybe both) on N5P which should be 10-15% better than the previous N5 process. That's what 5850U is actually competing against considering how long it took to get out the door.
Things aren't looking pretty for x86. Now if we could just get some nice RISC-V designs shipping...
"The TDP of the [AMD 5850U] APU is specified at 15 Watt (default) and can be configured from 10 to 25 Watt by the laptop vendor (most chips are configured higher than 15 Watt)." [1]
I'd love to see actual measurements for both chips.
TDPs don't mean anything even within a vendor lineup, and doing cross comparison is futile (without properly measuring yourself the power consumption at load, then you can start doing real efficiency comparison or apple to apple).
‘Beats’ ‘benchmark’ ‘nm’ are all marketing terms. You can only do so much comparison using such numbers but what you miss completely is that we are not putting them in a server farm. x86 is still coming as a CPU offering while the M1 and ARM chips are SOCs. It reduces system power as far less separate chips are needed to do other stuff while simultaneously being much more performant. A 5850U miniPC uses a 65w power brick. An M1 mini sips at best 39W. You would have to compare 1.65xM1 with 1x5850U
nm might be a marketing term, but it's sure as hell meaningful. This goes double here as we're comparing between TSMC 7nm and TSMC 5nm, of which the latter is definitively superior in power efficiency.
Similarly, benchmarking is a core component of CPU evaluation, allowing for isolated (and combined) analysis of the CPU's performance characteristics.
Handwaving both of these away is extremely misleading.
I don't think that this is correct. My laptop, Lenovo P14S GEN 2 AMD with 5850U CPU has 65 WATT USB-C power adapter. Half of that power will be consumed by monitor (4K and 500 cd/m²) and you write above that the whole system should consume 65 WATT... This is not possible at all, because there is only 30-35 WATT left for the whole system except monitor.
If your system exceeds the maximum amount of power provided by the power adapter, it will use the internal battery. Otherwise it will throttle the CPU.
Laptops are not designed to do that under consumer operating conditions (they can, sure, but they won’t). No manufacturer will ship you a system with an adapter that can’t power the machine at peak load.
That is not true, Apple laptops (and probably others) are known exactly to do this. The infamous i9 MacBook Pro 16" could definitely use more power than the supplied power adapter and drain the batterie under load. And that was perfectly within its specification.
Another possibility is cost reduction as these are lower-end models. At least some of higher-end models (G7 and Alienware) have 180W and 240W power adapters, instead of the 130W included with this model.
> In other words, the laptop will throttle back performance automatically regardless of the settings you chose.
So if you change the performance settings they allow the laptop to draw 10W from the battery while plugged in for a little bit but it will throttle down to 95W to keep itself running. It still throttles which is I think the GGP’s point.
The point is that a laptop with a 65W power adapter can in fact, draw 90W (or more) for a period of time in practice.
Which means, we don't really have a good way to benchmark power usage on laptops in a practical sense. We'd likely need to bust out the soldering iron + oscilloscope and measure currents entering the laptop's VRMs to accurately measure power usage over time.
I know laptops / cores have an "amp-counter" on board somewhere, but there's no guarantee that these devices are consistent or accurate across different laptops. Its sufficient for measuring how much energy different bits of code has (ex: Linux powertop tools), but not sufficient at comparing Apple M1 vs AMD Zen3 chips. We need a 3rd, trusted and independent measurement of power usage.
We can't just assume a 65W power adapter leads to 65W peak usage. Perhaps in the past when laptop designs were more in spec that was a decent assumption. But that time has passed, and today's laptops often do peak at power usages far in excess of their charger capacities (albeit temporarily, but even then, that makes measurements / benchmarks very difficult).
--------
I guess if you physically remove the battery pack (is that still allowed on these laptops?) and then plug it in, we might be getting somewhere. But the Macbook Pro doesn't have an easily removable battery pack.
That's the reason we didn't review laptop CPUs when I reviewed CPUs. You can get exact CPU power draw on a desktop motherboard (by using an amp clamp on the P8 connector) but it's hard (or not possible) to do that across multiple laptop chassis.
Removing battery (when possible) is not a solution either as what you get may differ a lot from classic "plugged in" usage (see the references to the MacBook Pro and Dell that used an i9 that still drained the battery when plugged in, because they can use more power than the power adapter brings).
On top of that, way too much depends on the OEM design and the performance of a given CPU will greatly vary from one chassis to another, because of the various throttling mechanism and the various configurable things that OEM can do (it's not just the cTDP, you can as an OEM play with various turbo times, another person mentionned P2 states, which is one of those).
So a given mobile CPU performance means nothing at the end of the day, only the laptop "as a whole" can be measured, which is why you don't see good quality benchmarks of mobile CPUs.
Anyway, just a small complement :
> laptops / cores have an "amp-counter" on board somewhere
Intel (and AMD to some measure) CPUs all have various sensors on chip that gives you the power consumption in watts (or amps, depending). They can be read with software such as hwinfo [1].
Those are usually not incredibly reliable though, they are not calibrated per CPU and it's very much a guestimate that could easily be in some cases +/- 5W off.
So sadly, not usable either (especially on mobile).
I've even plugged my M1 Air into the 20W USB-C iPad charger with no ill effects (somewhat slower charging while it's in use). It's possible that the battery would stop charging or deplete if I had the brightness cranked all the way up and the CPU was doing something intensive.
Check the sibling comments, your CPU (like most mobile Intel/AMd) can be configured for different TDPs by the OEMs, so you can't compare anything, especially not PassMark results.
Yes, you can throttle the CPU, but then the CPU scores in Benchmarks will be lower and that's not the case... (Benchmarks are collected from different machines). So it looks the Macs consume pretty much the same amount of power.
No, the TDP is configurable up to 25W (and down to 10). Again check the link of the score distribution I put in another comment, it's all over the place, and the passmark results are not a reliable comparison at that point.
No. AMD Ryzen™ 7 PRO 5850U can boost upto 4.4Ghz, likely it would be consuming upwards of 50W for that short duration.
An easy way to verify it, is to measure the benchmark delta when on battery and when connected to an external power source. (M1 benchmarks remains almost the same)
Intel i7 9750H for example has a P2 of above 80W and only then can it break the 4Ghz barrier. Even though the processor is technically rated only 45W. At 45W it can just maintain the base clock i.e. 2.6Ghz on all cores.
M1 is much more efficient than any x86 chip on the market right now.
The M1 is hitting 4gHz across only 4 of it's cores, whereas the 9750 is driving 6 (and 12 threads on top of that). Furthermore, the M1 will have no problem hitting ~20w during peak load, so frankly the math checks out to me. The comparison definitely starts to deteriorate once you consider that the M1 is ~3x as transistor-dense as the Intel chip, and part of me actually wonders why they didn't get more power out of a chip that only needs to worry about a handful of instructions and doesn't know about hyper-threading.
Well in that case both are made by TSMC so it's comparable. But PassMark results in general are not that great (they use micro benchmarks that may not be representative or are optimised for by vendors).
More importantly, as pointed out by a sibling comment, the AMD CPU in question (like most mobile Intel CPUs) has a "configurable" TDP which is set higher on most products sold. And PassMark doesn't differentiate those and only mention the "official" TDP.
To PassMark credit, they give a distribution of performance scores, just compare the distribution of the Ryzen and the M1 and you'll see (you have to scroll down a bit to see the graphs) :
In general, you can't compare TDPs even within a brand, they rarely mean what it used to mean a few years back as they "innovate" with various turbo mechanism and other OEM configurable settings.
Just to give an idea of how crazy it is here is a list of the synopsys TSMC 7nm cell libraries [1] and of course their are other companies that offer additional cell libraries. Also I have no way of proving this, but I wouldn't be at all surprised if both AMD and Apple have their own customizations or additions to whatever cell library they use.
I'd consider waiting a couple weeks. There's several rumors that the redesigned Mac Mini is coming soon. It should have an 8 big-core (and 2-4 little cores) variant and probably up to 64gb of RAM as options in a much smaller package (most of the M1 mini was empty space because it reused the x86 case).
Given similar power consumption, cooling is up to the OEM, not the chipmaker. So this question should be directed at Clevo, Dell, HP, Lenovo, ASUS, Acer, Tongfang, etc.
This is physics after all and 15W are 15W no matter if they go into an Apple M1 or an AMD Ryzen.
It’s not as simple as seeing 15W and saying the thermal load is the same: they don’t have constant usage so you have to compare the actual power usage for workloads you care about. Some manufacturers are more conservative so you need uncommon combinations to hit the maximum output whereas other chips will approach that in normal usage.
This is also where design decisions matter: for example, a while back I measured hashing performance for some boxes which needed to check data integrity and an Intel chip handily lost despite being faster on everything else because the embedded processor I was comparing it to had dedicate SHA hardware which was both faster and more power efficient than a generic x86 implementation. That’s ancient history now but I would expect Apple to aggressively explore opportunities to improve their stack like that since they control it at every level - for example, I believe benchmarks have shown Objective-C message passing is considerably faster on M1.
Optimising applications for specific hardware has nothing to do with the hardware manufacturer if the software and hardware come from different suppliers, though.
Apple has the advantage of developing and deploying hardware, OS and system software completely in-house.
AMD only supplies chips and basic firmware, both of which can be configured by OEMs/ODMs and the OS and software come from entirely different parties again.
So the usage profile depends on external factors, not just the CPU itself. In the end, however, a 15W power budget is a 15W power budget and an M1 under full load and a Ryzen under full load will have the same thermal output if configured the same (as far as power consumption goes).
How well the waste heat is managed is not in the hands of the CPU.
Your position appears to be conflating requiring coordination with impossibility. Apple has it easier in some ways but it’s not like AMD/Intel, Microsoft, Dell, et al. don’t work together, too. Similarly, while you’re not entirely wrong that the ISVs customize firmware and settings, the defaults and range of options are constrained by the CPU design.
Finally, again, 15W TDP is not the same as 15W under normal usage. That misunderstanding appears to be driving most of your disagreements in this thread.
> Finally, again, 15W TDP is not the same as 15W under normal usage. That misunderstanding appears to be driving most of your disagreements in this thread.
No. The CPUs can be configured to consume no more than 15 Watts, even if few OEMs do so. Same goes for the M1 - there's no difference with regards to this: both the MBP 13 and the Mac Mini have higher power limits and active cooling for that reason.
In fact, the latest U-series mobile Ryzen CPUs are even optimised to be most efficient at a 15W power level, contrary to Intel's Ice Lake chips, which get the most performance at a higher wattage configuration of 28 Watts.
The key thing to understand is that this is not measured power consumption while running the benchmarks and you cannot reliably compare the values even across the same product line, much less across chips — especially when we're talking about SoC designs where, for example, the TDP refers to the entire chip but the benchmarks being discussed are all CPU-focused and don't even exercise the GPU at all. We also know that TDP numbers are not a hard ceiling: there are some chips which under some conditions — most commonly but not always synthetic benchmarks — will exceed those figures, possibly by somewhat significant margins.
In theory it's 15W and 15W, but both chips can use more than that amount under certain circumstances and the AMD chip can even be configured by OEMs with different TDPs. Plus they heat up differently under the same workloads due to different chip features and optimizations (as the sibling comment mentioned).
To my knowledge, the only AMD chips approved for fanless designs are their 5-6w embedded chips and the same is true for Intel.
Those chips usually have 3.5-4GHz turbos, but in a fanless config, you'll never see them (and even with active cooling, you won't see them for more than a handful of seconds).
As impressive the M1 is, it leaves a sour taste in my mouth. I have a similar issue with Qualcomms chips, and I guess Tensor will have the same problem:
I can't actually use them.
The M1 innovation is locked to Apple. If I have a great idea for a new device that would be enabled by the M1 perf/power/spec, I'd have no chance of building it.
I hope Google continues what they did with Coral with the Tensor chip. A Raspberry Pi style device, or a compute module of this chip would be a fever dream.
The best thing we have right now is the NXP iMX8 when it comes to performance, and still the RPi when it comes to ease of use.
Maybe if have to dig into Qualcomm and check how to get hands on their 8xx chips. The fact that there are not really SBCs with them however tells me that it's not easy to get them/use them.
If the author is here and able to do so I’d much appreciate if they would share the complete code for the benchmarking as a whole, so that others may use it for benchmarking other code in the same way :)
On another note, I was amazed to see the SHA-1 and SHA-256 (and the Blake2b) performance numbers when running the benchmark of minio’s blake2b-simd (over M1 Golang) on a 16GB M1 Air. For some reason, Go on M1’s SHA-1 and SHA-256 are getting orders of magnitude better numbers (~2.5GB/sec!) and even the Blake2b code is beating the numbers for the same code running on an intel Air (which uses the SIMD codepath).
TLDR: Based on a single simple synthetic benchmark, the low performance "Icestorm" cores were shown to be as much as 52%—or as little as 18%—of the performance of the primary "Firestorm" cores. Highly efficient assembly showed the least performance drop whereas complex "idiomatic" Swift code showed the greatest performance drop.
However the Icestorm cores also use substantially less energy so they are an efficiency win regardless. Plus they take up use significantly less physical space which is a large cost saving for the SOC part.
> Highly efficient assembly showed the least performance drop whereas complex "idiomatic" Swift code showed the greatest performance drop.
I wonder what this means. The efficient assembly probably has fewer instructions that use vector instruction and floating point calculations more, while the "idiomatic" Swift probably has just a larger number of instructions that aren't doing heavy calculation. Does that imply then that the high performance cores does much deeper pipelining, but the the number floating point units or whatever is probably pretty similar across both types?
My initial guess is that it's because Icestorm CPUs have less L1 and L2 cache, resulting in more frequent cache misses in complex loops. I'm by no means an expert in any of this, so I really have no place hypothesising.
Firestorm has 128KB L1 per core and 12MB shared L2.
I've never written native code for an apple device, but have some low-level experience on other things; my guess would be that idiomatic swift contains safety checks, i.e. predictable branches, but branches nevertheless - and that simpler cores with poor or no branch prediction suffer much more heavily than "smarter" cores from that kind of code.
At the end of the day, if all you're doing is basic adding/multiplying, you just don't need a fancy core (and hence: GPGPU, right?)
So this microbenchmark result isn't very surprising, although it's nice to see the principles in action.
That is correct. Also, the checks have a bit of an unfortunate downside that they need to be emitted for each array access if the loop is unrolled (as it should be here).
Firestorm has 2x the ports across the board (int, fp, load/store, and simd). To keep those ports busy, it has a massive instruction window (630 entries). Keeping those fed requires large I-cache.
x86 instructions are 15-20% more dense than uarch64, but the fastest x86 cores still have only 32kb of I-cache (in fact, AMD went down from 64kb in Zen 1 to 32k in Zen 2/3), so I doubt doubling that is the major limitation here.
In this particular micro-benchmark, I doubt even extremely bloated code would get anywhere close to 1k of instructions for the loop let alone 64k. There might be something to say about the size of the dedicated loop cache that many chips have, but that's a completely different animal.
It's generally true. Both Rust and C++ (and Swift to a smaller extent) are both all about the "zero cost abstractions", but they usually aren't. It's hard to beat the C-style for-loop: they map really closely to the underlying hardware, and modern compilers are starting to become really good at auto-vectorizing them.
> It's hard to beat the C-style for-loop: they map really closely to the underlying hardware, and modern compilers are starting to become really good at auto-vectorizing them.
Except with many compilers and depending on the loop, for-loops may optimise much poorer than their C++/Rust iterator based counterparts.
That's because modern language constructs can communicate intent much clearer to the compiler than a primitive for-loop can; the compiler has to make fewer assumptions (which can can be wrong).
That's not a general rule. In Rust iterator chains are frequently faster than for loops, and are just as likely to be auto-vectorized (or not at the whim of the compiler) as for-loops.
This isn't the right lesson to derive from this. These benchmarks show that the high performance cores pack a lot of smarts which can maintain high performance with idomatic Swift.
Idiomatic Swift is still great for code which doesn't have significant performance consequences and/or is likely to only ever run on the Firestorm cores—i.e. any user-interactive process.
If you're writing a background task which is expected to have some non-negligible load on the CPU, it could make sense to experiment with simpler C-style coding for the hottest loops in your code.
Eyballing die shots, Firestorm is about 4 times the size of Icestorm, so if you have perfectly parallelizable workloads, you stand to gain if Firestorm is less than 4 times as fast per core. (This ignores L2 cache and maybe some other supporting transistors which makes things much more complicated.)
In the simple benchmarks, the speed differences range roughly from 2x to 5x. It looks like the current configuration (4 Firestorm + 4 Icestorm) is pretty well balanced: equisized alternatives (5 Firestorm + 0 Icestorm, or 3 Firestorm + 8 Icestorm) can be faster for specific workloads, but probably not across the board. The Apple CPU team really knows what they're doing (but that has been abundantly clear for many years now).
Icestorm cores max out at 2GHz, but have 30-50% the performance of the 3.2GHz Firestorm cores. Accounting for that frequency difference, the actual performance difference if Icestorm were to boost clocks would be closer to 50-60% of the Firestorm IPC.
Zen 3 at 4.9GHz trades blows with M1 at 3.2GHz in single-core benchmarks.
It seems when you adjust all the things that Icestorm should be 10-30% slower than Zen 3.
Putting that in perspective, Icestorm cores (based on this one microbenchmark) would have around the same performance per clock as the original Zen cores, but only consume a 5-10% of the actual power consumption.
Yes, that was weird for a language with such a high focus on performance. My guess is that the compiler couldn't inline the "idiomatic" operations and had to fall back on generic implementations, which will be painfully slow in Swift due to the overhead of being ABI independent.
You could do all this in other languages too, but it wouldn't be the "idiomatic" choice during development; it would be clear that you're giving up on performance for the sake of flexibility.
Consider JS. The builtin implementation is naive and chaining calls results in multiple passes over the array.
Lodash is different. It uses an iterator internally, so chaining isn't much of a performance difference (just the overhead of the function calls which are probably inlined by most JITs anyway).
I see no reason why Swift couldn't (or shouldn't) do this in a future version.
I suspect that, for most cases, it won’t matter (not everything needs to be optimized -heresy, I know).
But for some “inner loop” stuff, it could make a big difference.
Another issue, is that visitors will hit the entire set; even when it is no longer necessary to continue. We can do some tricks to make subsequent visits shorter, but they’ll happen, nonetheless. This always bothers me.
I've been very disappointed in my m1 MBA performance. I had used a 8 year old PC desktop and Chromebook before. The PC died, so I tried to get a MBA to replace both, and honestly it's just slower than either at most things. The thing it does well, and the reason I keep using it, is that my Chromebook doesn't support a few software applications I like using.
There are also what appear to be intermittent sound card issues. But Everytime I make an appointment to get it fixed, it decides to not have it.
I mean it's ok, but I think wait for v2 of the chip. But I just don't get how people's experience with the M1 is so different than mine.
It sounds to me like you probably got a bad unit: I did a simple benchmark of compiling the abcl codebase from scratch and the M1 was significantly faster than my year old desktop with an AMD 3800 and 128G RAM.
With this CPU design some cores are optimised for performance (at the expense of using more power) while some cores are optimised for efficiency (using the least power at the expense of computing performance). This makes sense for laptops and smartphones, as it can save power and thus run longer when being powered by batteries. But (in my opinion) not for Desktop PC's where most people care more about computing performance than saving a few watts.
> not for Desktop PC's where most people care more about computing performance than saving a few watts.
Not for all workloads. Daemons and services don’t usually need much oomph, and being able to run them at double the efficiency and avoid them competing for resources with software which actually wants the higher power cores would be amazing: less context switching, less cache thrashing, less random variability, …
Hell, it would also help make low-power software more reliable e.g. foobar and the audio output don’t need much but they need it. If something else is loading the machine to the gills they can get starved and audio will start breaking down.
I do understand the pros and cons. My point was that it doesn't matter when you prefer peak performance vs efficiency on a desktop PC. I'd rather have 8 Firestorm cores on my PC, with one or two of these fast cores dedicated for the "lighter" workload even if it consumes a few more watts. When your desktop computer is slow, doing a CPU resource intensive task, nobody is going to appreciate that it is consuming less power - everyone will be cribbing about the slow speed.
Assuming die area is the constraint, the choice is between 4 firestorm + 4 icestorm and 5 firestorm, because a firestorm core is 4x bigger than an icestorm core.
Or between 8 firestorm and 7 fire + 4 ice (or 6 fire + 8 ice).
I thought the idea was that not all processes are the same class of citizen. Some of 'em go to high performance cores, but some of them always get assigned to what's basically a second computer running in parallel and handling interface stuff or system management, typically a lot less powerful (though according to this article, far from helpless).
In this way the CPU-intensive stuff never has to break its concentration by sullying its tiny processors with utility calculations, but keeping the observable system responsive (a much less demanding task) goes to the second, weaker computer, that's running as if nothing else is happening. So neither thing ends up being slow, because neither thing particularly has to be interleaved with the other.
Source: I'm no CPU or SoC designer, but I own a Mac M1 laptop for the purposes of compiling open source plugins to the new machines, so I've done hundreds of XCode compiles on the machine, and lots of housekeeping work in Finder. None of it seems slow, by any metric or perspective.
I understand that. But that wasn't the point. Historically laptop and desktop PC CPUs have been different because of the different requirement - laptop's prioritise efficiency whereas desktop pc users prioritise performance. Apple instead has used the same CPU on their tablet, laptop and Desktop PC.
(And note that even Intel / AMD processors do the kind of power scaling that you described without any specialised low power cores - all cores are designed to run at varying speed. Thus, when a small task is executing, the core executing it runs it at the minimum acceptable speed and saves power. But the same core can run at full speed on demand. That's a more acceptable tradeoff for me for Desktop PCs, especially considering that ARM CPUs already are very efficient.)
> This makes sense for laptops and smartphones, as it can save power and thus run longer when being powered by batteries. But (in my opinion) not for Desktop PC's where most people care more about computing performance than saving a few watts.
You're probably right about laptops and smartphones. I don't have a view on whether you're right about desktops, but I think it's probably not true for the big datacentre providers, who absolutely optimise for power and heat.
If the only niche that doesn't care about power is the desktop, and desktops are absolutely a niche when compared to the total market of smartphones, laptops and datacentres, I suspect power efficiency is coming to the desktop whether you want it or not, whether it's a positive or not, if only because of economies of scale as far as chip development and production goes.
Yes, I realise that datacenters do care about power consumption too, and that is why they find the idea of a ARM server cpus so attractive. My point ofcourse was directed at home users of Desktop PC - I don't like the idea of sacrificing performance to save a few watts, when the ARM cpus are already very efficient. (I mean, when you are rendering a 3d scene, or encoding a video, or compiling a code, no one I know is going to appreciate the longer times it takes to execute these tasks because not all cores are equal).
I'm not sure that you could make a case for this not making sense in a desktop computer, as everything is ultimately a trade-off.
It's fairly clear that the Icestorm cores represent a performance gain in terms of performance per watt, but also die area. The four Icestorm cores and their support infrastructure takes up about the same physical space as one Firestorm core with its support infrastructure.
I doubt that an M1 with five Firestorm cores would perform as well as the eight cores we did get.
My case for having 8 fast Firestorm cores instead of 4 fast + 4 slow cores is simple - the majority of Desktop PC users don't care about sacrificing performance to save a few watts. When a PC is running slow because of a cpu intensive task, nobody is going to appreciate that it is saving a few more watts.
And from the figures in the article, it looks like 1 Firestorm core is very roughly equal to 2 Icestorm cores -
Relative to their Firestorm times, Icestorms performed more slowly by:
190% running assembly language
330% running simd (Accelerate) library functions
280% running simple Swift
550% running ‘idiomatic’ Swift
So 5 or 6 Firestorm cores could have matched the current 4 + 4 config (disregarding the valid point you made about the die area).
More Firestorm cores means more die area, which means the SOC gets significantly more expensive and/or supply won't keep up with demand.
Obviously it would be great to have 8 Firestorm cores in a desktop Mac—and this will probably be what we'll get in a future A1X—but as long as we're doing silicon fan fiction, why not ask for 32 Firestorm cores? That would be cool.
The M1 is cheaper than previous Intel i7 Mac Minis. So I don't think it would have mattered if the CPU was slightly more expensive. People would have been willing to pay that price. It was just less riskier, and cheaper, for Apple to use the same chip on both laptops and Desktop PCs even though the CPU trade-offs and consumer expectations of both are different.
It’s worth remembering that the M1, a chip which has exceeded all expectations of performance, is still just a low end CPU, not merely a “laptop” part. It’s purpose is to be efficient in respect to cost. And this includes the parts cost of higher wattage power delivery and the parts cost of more aggressive heat dissipation.
Future M1 derivatives and successors will offer higher performance at a higher price.
The rumours are that this year's MacBook pro's will feature an M1X processor with an 8 + 2 configuration which seems like it might well be a good tradeoff to me. The 2 small cores reduce context switching on the big cores making overall performance a lot snappier.
I'm going to guess that if you could have 8 Firestorm, you will always be better off having 6 Firestorm/8 Icestorm.
Always. In practical use. Even for the most isolated tasks there'll be enough 'maintenance work' for the system to do, in practice, that the pool of Icestorms will justify their real estate.
If I'm mistaken, it'll be because there proves to be a maximum number of Icestorm cores for the 'maintenance work' that's called for: say if you had a 64/64 system, the idea is that maybe you get to a point where you can have 'too many cores' on the Icestorm side, wastefully so.
But that assumes they play NO role in setting up the Firestorms to perform optimally. All this stuff, all this load balancing and assigning, isn't magically done by some GrandMasterCore, it's being worked out by the same CPUs that are benefitting from the load balancing.
If you can use Icestorm cores to better set up and load balance the Firestorm cores, if you can burn unused CPU on the Icestorm side to optimize the Firestorm side… then there is NEVER a situation where you'd want the Firestorms to outnumber Icestorms. If that's happening then the Icestorms are doing some grunt work to clear a path for the Firestorms, and if you can burn more cycles to further optimize that, well…
> the majority of Desktop PC users don't care about sacrificing performance to save a few watts
If you ask on those terms, agreed. But if quantified in terms of value delivered, it’s a different story. For example, if your computer can be always on with tasks like updates or if you can dedicate tasks to the slow cores there may be a value there.
A small home server running 24/7 on low load mostly off of the small cores that can seamlessly ramp up to big cores for power when needed sounds exactly like the sort of thing you'd want.
Depending on the specific workload, if the “slow” cores are more efficient per watt it may be preferable to have only a small number of “fast” cores for single-threaded applications and many slow cores. Or possibly vice versa if that’s not true.
The M2 processor could be 4-big 64-LITTLE, or 16-big 8-LITTLE or some other combination.
I understand that perfectly. But how willing are you to sacrifice performance for that, is my point. There's a reason why CPU makers focus on building faster chips - that's what people care about most on Desktop PCs.
You don't need to sacrifice performance since CPUs step down automatically. For a hybrid design they only use the smaller cores for less demanding tasks. In fact this design gives a perceived increase in performamce as the author has written about.
Ofcourse you are sacrificing performance when you put slower cores alongside fast cores, instead of using all fast cores only. And for what - ARM cpu cores already are very efficient. So saving a few more watts with the slower cores, at the expense of performance (faster cores) isn't a worthwhile tradeoff when you are doing some cpu intensive task on your desktop. Saving a few extra watts matters on phones / tablets / laptops because you can use it longer on battery.
Not only that, i'd prefer to have a small amount of ram and cpu to be running 24/7 for always-on features that I'd love to have my PC doing.
I don't like having to run a PI for some stuff just because i don't want my huge tower running all the time, it would be really neat if it could run at anything between 5 - 600W, not sure though if the PSUs would be able to offer that range.
Also, aren’t desktop CPUs constrained by thermal load at some point or can we use ever bigger coolers?
Personally, I find it almost obscene that my desktop PC consumes roughly as much as a good old incandescent lightbulb (60+W) while idling. My laptop uses as much under full load.
Your PC uses 60w idling? Is that with screen? It's not too much in that case. CPUS and GPUS have gotten a lot better at idle power consumption, and PSUs are also quite efficient these days.
Downvolt and downclock your RAM when possible.
Turn off VRM phases when not needed (also on the GPU; RAM should be handled properly already on it).
Allow package-C6.
Turn on PCIe power saving stuff.
There is a bunch of stuff that mildly hurts performance and greatly improves efficiency in intermittent workloads, by letting hardware sleep/power-off when not needed.
I guess so - I really don't care about the fan noise (I'd rather PC makers think more about sound proofing for that rather than reducing fan RPM thus reducing performance).