In the project I work on we do have the usual periodic crawls and use ScrapyRT to let the frontend trigger realtime scrapes of specific items, all of this using the same spider code.
Edit: Worth nothing that we trigger the realtime scrapes via AMQP.
These are rough guidelines based on current best-practices that I know of, and shouldn't be treated as doctrine obviously. Numerical analysis/linear algebra is actually a pretty fast evolving field as far as applied math goes. Though statistics is a bit less dynamic at the moment, I'd say.
Honestly, after a certain amount of time, I'd expect a new hire would be able to teach me what is state-of-the-art in the field based on new literature.
But the questions I'd have in mind would be couched like this:
Linear system:
- Is it small, square, and numerically well-conditioned? Use LU - it's pretty fast to write and pretty fast to use in practice.
- Is it small, but rectangular (ie. overdetermined), or not as well conditioned? QR is a good choice.
- Is it small, but terribly conditioned, or do you want to do rank revealing, or low-rank approximation while you're at it? Will you be using this matrix to solve many problems (multiple right-hand sides)? SVD would fit the bill
- Is it large, or sparse, or implicitly-defined (ie. you don't actually have access to the elements of the matrix defining the system - you just have a surrogate function that gives you vectors in its range, or something)? Use an iterative algorithm. Krylov subspace methods (MINRES, GMRES, conjugate gradient, etc) are your friends here.
Pattern matching (more specific in question formulation):
- If you wanted to determine the "strength" of a waveform (in a finite uniform sampling of data) that is re-occurring in a fairly regular way (like arterial pulse in an array of data taken from an oximeter), what type of transformation would you use, and how would you use the resulting information given in transform domain?
- What if you wanted to determine the strength of a waveform that is short lived/impulse in nature, but re-occurs without any known periodicities (eg. eye blinks artifacts in a sample of EEG data)
- How would your answers to the above questions change if there are n separate channels of data collected simultaneously (ie. sampled in different locations), which may be analyzed together?
Statistical analysis (might seem vague, I'd be more interested in good discussion with a candidate here than in actual whiteboard writing):
- What does statistical significance mean, in the context of decision making? Is it a property of the test you perform, or is it a property of your data? (sort of a trick question, this basically rehashes the Fisher vs. Neyman & Pearson debates of the 20th century stats community)
- Some cod problems of when to use z, t, f tests. Basically, you use them when your situation matches the appropriate inference models (two normally-distributed sample comparison of means with identical variances? t test.)
- How do you construct an optimal test from scratch, if one doesn't already exist for your particular situation? (basically, if minimize type II error with fixed type I makes you comfortable for your problem, can you do use the Neyman-Pearson lemma to do likelihood ratio construction correctly)
- What does a p-value actually mean? What if you instead wanted to have actual probabilities for your hypotheses, or you had apriori information that you wanted to use? (Bayesian inference is the winner here)
- Probably something from point estimation: least squares, minimax criterion, Bayesian MAP estimates, general model fitting, those sort of thing. Brings it all back around to numerical (where I'm most comfortable). Like how applying statistical ridge regression is just knowing how to code up Tikhonov regularization, when you get down to implementation.