I've been playing around with single tone detection using the Minsky circle algorithm. Without roundoff, they are equivalent, but Minsky tolerates roundoff better, especially at low frequencies — you can think of Goertzel as representing the phase or quadrature information as the difference s[n] - s[n-1], which obviously gets less bits allocated to it than the s[i] themselves, particularly near the peaks and at low frequencies.
Precisely, you can derive Minsky’s algorithm (modified for frequency detection by adding the input samples into the register you produce the output from) from applying Goertzel’s algorithm to the first backward differences of the input signal.
The problem I'm working on requires something that will work well in the presence of chords and even though I've made really good progress I'm still about 4% away from SOTA, but 100's of times faster. Not quite ready to give up.
Anything linear should work well in the presence of chords, I would think? (Although maybe PLLs would work better, depending on what you're doing, despite being nonlinear.)
The HN discussion has some good information, but it also has some mathematical errors.
Presumably you don't mean the School of the Arts or Summits on the Air (though both involve detecting frequencies), so what is SOTA?
PLL is an interesting option, never thought of that, that is worth pursuing, thank you. And like the other comment already wrote, state of the art.
The project is automatic music transcription, the 'gold standard' that I've found so far is Magenta based 'Onsets and Frames' (https://magenta.tensorflow.org/onsets-frames ), which scores pretty good on a large benchmark.
I'm still a ways away from being their score but there is real chance that it will happen. The tensorflow based software is super slow, many orders of magnitude slower than realtime would require and I have a bunch of applications in mind where slower than realtime is not an option.
Here are the two side-by-side across the same testset (first mine):
// di 24 dec 2019 16:49:29 CET overall total 17482 correct 12604 fp(fantasized) 2723 fn(missed) 4878 f_measure 76.833 precision 82.234 recall 72.097 accuracy 62.381
vs (Magenta O&F):
// wo 25 dec 2019 20:09:51 CET overall total 17482 correct 13865 fp(fantasized) 2908 fn(missed) 3617 f_measure 80.952 precision 82.663 recall 79.31 accuracy 67.999
Clearly the O&F software has the edge in detection but it is close enough that I have some hope of beating that score, I'm already way ahead of what they list as the 'previous state of the art'.
Congratulations! So you're equal in precision but still suffer on recall? Anything in common among the notes your system misses? My friend Alastair Porter recently did his dissertation on tune identification, so he told me a bit about the transcription problem. It's hairy!
I think PLLs make sense when you have chirps, like vibrato or portamento. Not sure they'll help in other cases. For piano they can only hurt, I imagine.
Thank you. There ought to be something to show for working this hard between jobs since August... It's been a very long slog.
> So you're equal in precision but still suffer on recall?
That's about the gist of it. But I suspect that if I fix the recall I'll end up being down on precision again, that's roughly how this has been going so far. Two steps forward, one step back.
> Anything in common among the notes your system misses?
Restrikes are really hard, as are octaves. They are conceptually super simple but when you get all the combinations thrown at you it stops being simple in a hurry.
I'm reasonably optimistic about the restrikes, less so about the octaves, especially when the lower note is the louder one (power chords, for instance).
> My friend Alastair Porter recently did his dissertation on tune identification, so he told me a bit about the transcription problem. It's hairy!
No kidding. Tune identification is usually done on the binary encoding of the melody (up or equal = 1, down = 0) so I can see why transcription would come in handy there.
> I think PLLs make sense when you have chirps, like vibrato or portamento. Not sure they'll help in other cases. For piano they can only hurt, I imagine.
They might be useful in verification of notes already generated.
Well, it is not so much a problem as it is a solution, FFTs are a very important part of what I've built. But the low end is very hard to do precise using FFTs (because the bin size is large compared to the difference between individual notes), for that I use autocorrelation, and for the verification step I would like to use something else than FFTs to get a 'second opinion' on the notes detected.
We made the first guitar tuner on a mobile phone in 2006 using the algorithm. It ran like a champ on the Motorola Razr and other ‘dumb phones’ of that time. We charged $25 for it and people loved it as it performed way better than the $25 stand alone tuners. My testing showed it was equivalent to a ~$75 model at the time.
I had a background in signals and controls and was familiar with the DFT. Goertzel was a new variant to me but ran much better on the limited resources.
I suppose that made sense because of the very limited resources, but note that it's not ideal because it doesn't estimate pitch at all, it only tells you if there's any content inside a frequency bin you provide. A guitar tuner is expected to have an arrow pointing up/down to tell you which way you should adjust, and this algorithm can't give you that. Getting the direction of that arrow accurately can be challenging because of harmonic content (which guitars have a lot of).
These days it would make more sense to use an algorithm that takes harmonics into account and gives you an estimated fundamental frequency, like Miller Puckette's pitch detection algorithm.
Yes, I remember the harmonics were an issue. We were excited when the iPhone came out and had the resources to tune all the strings at once.
I can’t remember what we did, but we did provide sharp / flat indication. It definitely worked just like a normal tuner you would find, letting you dial it in quite precisely. Maybe I will crack open the code and look at it...
This is true. Without reaching for state-of-the-art psychoacoustics research, though, even the second autocorrelation peak will take harmonics into account. And you can compute it without the STFT latency, which matters here.
It is possible to estimate where a signal is inside the frequency bin; that's how phase vocoders work.
Couple of ways to do it. If you don't want to deal with the latency required to take the block-to-block phase difference (which somehow ended up with the $50 nomenclature "phase vocoder"), you can just do a parabolic fit on the magnitudes of the central bin and its adjacent pair. 99% of the time, the latter is sufficient. With only three points the math is super trivial [1].
For sharp/flat indication, I'd be inclined to look at the adjacent-bin magnitudes. But it's been explained to me numerous times that tuning an instrument isn't just a matter of measuring its frequency, and while I can't argue with that, I don't think I'm ever going to truly grok it.
> But it's been explained to me numerous times that tuning an instrument isn't just a matter of measuring its frequency, and while I can't argue with that, I don't think I'm ever going to truly grok it.
It depends on the instrument. If the instrument has only a few , single physical elements per tone generated and the instrument is frequently tuned then you are correct (say a guitar), there is nothing to be done beyond setting the center frequency correctly depending on how you want to tune.
But tuning a piano or harpsichord is a different matter. First of all, there are many different ways of tuning, for instance there is 'stretch' tuning which deals with inharmonicity (sp?) in the higher regions.
It gets more complex when you have multiple physical resonating elements (such as strings), for instance depending on where you are on the piano keyboard you might have one, two or three strings per note called a 'unison' (sound together).
The strings in a single unison not only need to sound good right after tuning, they need to continue to sound good together afterwards and that means that all of them need to degrade gracefully over time as well and that in turn means that they need to lower in tension in a similar way. Good tuners are able to ensure that the string tension is distributed equally between the 'sounding' part of the string (the middle part) and the non-sounding parts (the ends). The amount of tension on the strings and the friction of the pin block and the pins (stick slip) can make this a lot harder than you might think.
Finally, just turning the pins until the frequency matches is going to result in a piano that de-tunes very quickly, the best approach is to go a bit too high and then to slowly lower the pitch while hitting the keys quite hard so that the tension of the string gets distributed evenly.
Added complexity is that the combined 18,000 KG of tension on the frame will cause the tuning of one section of the strings to affect all the others. If a piano is off very far from being properly tuned you may have to make multiple passes to get the job done.
Even a guitar string has inharmonicity, so the pitch you measure by measuring the lowest resonant frequency is somewhat too low: a sine wave at that frequency will sound lower than the string does. I thought that was CamperBob2 meant. It's certainly what pierrec was talking about, and it's an advantage of ACF-based pitch estimation algorithms.
That's an interesting question; you'd have to do it either by fret placement or pitch bending with finger pressure. In the latter case it would only happen when you were playing chords with far-apart notes.
Maybe that does give you less latency (if it doesn't require narrower bins for equivalently good results), but it also requires you to compute three frequency bins, which requires 3× the computation with Goertzel (or Minsky), though still far less than a full autocorrelation function.
An amusing thought, which surely JOS has explored: if you use a Karplus–Strong delay line as your resonating element instead of a sinusoid, you can take the overtones into account for free. Maybe that reduces to the ACF, minus some kind of normalization. I'm not sure.
Bizarre that people keep mentioning their "guitar tuners" that use Goertzel. Are you looping over every single possible frequency? You should be using real fundmental frequency estimation/pitch detection algorithms ([1], [2]) instead.
So did you indeed loop over every possible frequency?
I know I attempted a Goertzel-based pitch detector once, where I looped over frequencies and picked the one with the largest magnitude. I picked 25Hz and 4200Hz as my limits (from a piano).
This is a neat trick if you need to detect only a few tones out of the possible range offered by an FFT, it essentially allows you to compute only a single bucket with minimal overhead. As the article states as long as you run it < log2(n) times where 'n' would be the size of the FFT required you should come out ahead.
Unfortunately for my application it did not give an improvement but I thought it was interesting anyway, and on embedded platforms something like this may be your only option.
Interesting project. What do you think of https://github.com/sevagh/transcribe ? It seems like in your project the word transcription isn't referring to sheet music transcription?
Well, you can go to sheet music if you want to but there are plenty of packages out there that will display midi files as sheet music so that would seem to be an unnecessary complication. The core problem is polyphonic music -> midi and that is hard enough!
I will take a look at your software, it definitely sounds interesting, the question at the end of https://github.com/sevagh/transcribe/blob/master/doc/snac.md can be answered in the affirmative, the Onsets and Frames package mentioned elsewhere in this thread does just that and works very well.
I used this algorithm to detect the signal sent by heart belt sensors a few years ago. Those belts send a quite weakly powered 5kHz "chirp" through a coil for each heart beat. You're suppsed to detect those chirps via another coil magnetically coupled and tuned with an LC circuit to the 5kHz resonance frequency. Through a combination of a lot of analog amplification and this algorithm I was able to implement detection on an AVR microcontroller, which would never have been possible using FFT or similar techniques.
Whoa. The capability in these little chips blows my mind. Nice to be able to eliminate the LC passives with a bit of code. If the chirps are of adequate length could probably sleep or sample other inputs a good bit of the time too.
For many simple use cases, it is vastly easier to count zero crossings per unit time. One period of a sinusoid contains a single positive-to-negative and a single negative-to-positive transition. No DSP required, though you may first bandpass the audio if you've got a target frequency range. Assumptions: the fundamental is the strongest harmonic present, negligible DC offset.
Simple example: For a 1.0 second audio sample, if the sign changes from +/- or -/+ 880 times, the frequency is 440Hz.
Strong harmonics which overpower the fundamental will fool a zero-cross analysis, resulting in, say, a double or triple frequency output if the 2nd or 3rd harmonic is strongest. The strongest signal present will overwhelm the amplitude sign of the combined signal; so it still works even in the presence of some noise or other signal.
High amplitude noise will also skew results resulting in a higher frequency output. Low-pass or band-pass filter can work wonders.
Again, zero-cross analysis is suitable for simple use cases where you are directly sampling a signal.
I wonder if you could take just the sign changes of the input signal and use that in lieu of the zero crossings and get something useful out of it, for instance, a finger print for chords or harmonics.
DSP question: how do I detect signals, such as certain words, in a series? Do I convolve a reference signal? And how do I remove such a signal from the series automatically? Context is I want to remove eye blinks and other such artifacts from an EEG.
I don't know about EEG, but a standard way to perform this type of action is to correlate your signal with a known signal (signature of a blink in your case) and if the correlation result exceeds some arbitrary threshold you set you have a match.
That is getting quite close to the way my software does chord detection one tone at the time, so yes, that works. What I do is subtract the match from the signal and then re-run the analyzer. This works well enough for signals that are orthogonal, less well for signals that have a lot in common (such as octaves or other close multiples of the original). Correlation is super expensive numerically if you don't know the input phase by the way, you'd have to hunt for the best match, if you can take it out of the time domain then it will get a lot better.
A bit of research suggests to me the best way to do what I want is convolve the signal to remove with the series, and subtract the signal at the peak of the convolution, if it passes the threshold.
Also found out that convolution can be calculated much more quickly than N^2 (my original concern) by using the Discrete Fourier Transform. The DFT of convolved series is the same as pointwise multiplying the DFT of each series. So, taking the inverse DFT of the pointwise multiplied series' DFTs will give me the convolution with a cost of N log N instead of N^2, if I use the Fast Fourier Transform.
That's not exactly the genesis of the field but it is an interesting story about genetic algorithms and the bit that always got me was how there were parts in there used in ways that no human designer would ever use them and yet they were critical to its functioning.
"Dr. Thompson peered inside his perfect offspring to gain insight into its methods, but what he found inside was baffling. The plucky chip was utilizing only thirty-seven of its one hundred logic gates, and most of them were arranged in a curious collection of feedback loops. Five individual logic cells were functionally disconnected from the rest— with no pathways that would allow them to influence the output— yet when the researcher disabled any one of them the chip lost its ability to discriminate the tones. Furthermore, the final program did not work reliably when it was loaded onto other FPGAs of the same type."
Still amazing to re-read, and taking this into account it will be a long long time before we can finally say that we have cracked the DNA code and all its gory little details.
I always wanted to know what you would get if you developed the same thing, but forced it to deal with different FPGAs during its evolution. Would it still come up with something surprisingly inhuman?
Anything you like. I mean, it precisely calculates the inner product of a constant-amplitude complex exponential with the signal, so the tradeoff is the same as with the FFT, except that you have the option of using a fractional number of cycles.
In the Dercuano note I mentioned in my comment above, I go into some detail about alternative ways to window.
No, not about the same, precisely the same, except for rounding error (which is worse for Goertzel.) The numerical result you will get is precisely the same. If bin 27 of your FFT over some segment of some signal is 4.557 + 6.288j, then Goertzel for that frequency and that segment will also give you 4.557 + 6.288j.
There are of course FPGA implementations of Goertzel, but for almost any DSP application with 10 MHz or less of bandwidth, a CPU is plenty. No need to deal with Xilinx’s shitty proprietary software and debugging with sigrok for something you can do on a CPU.
In software there is nothing stopping you from changing your Goertzel coefficient, and thus the frequency detected, on the fly.
Unfortunately the electronics industry has adopted the propaganda term “intellectual property” to mean “designs”, so that's what the parent is referring to: decades worth of validated designs.
Not really. After all if you can do something in 'just software' your market size is suddenly a very large multiple of what it would be if you did an FPGA version, the marginal costs are much lower and the delivery can be done over the internet.
Precisely, you can derive Minsky’s algorithm (modified for frequency detection by adding the input samples into the register you produce the output from) from applying Goertzel’s algorithm to the first backward differences of the input signal.
The derivation is in Dercuano in notes/goertzel-minsky-prefix-sums.html. The latest, and probably the penultimate, version of Dercuano is at http://canonical.org/~kragen/dercuano-20191215.tar.gz.
There are some tricks in there for windowing the computation over different timescales, too.