Babbage: A Clojure library for accumulation and graph computation

eschulte · on Feb 3, 2013

I actually wrote something similar in bash which I use frequently when I need to munge a table of numbers on the command line [1]. The whole time I was thinking I should really be doing this in common lisp.

[1] http://eschulte.github.com/data-wrapper/

kenko · on Feb 2, 2013

here's the announcement email (very boiled down version of the readme, essentially):

babbage is a library for easily gathering data and computing summary measures in a declarative way.

The summary measure functionality allows you to compute multiple measures over arbitrary partitions of your input data simultaneously and in a single pass. You just say what you want to compute:

    > (def my-fields {:y (stats :y count)
                      :x (stats :x count)
                      :both (stats #(+ (or (:x %) 0) (or (:y %) 0)) count sum mean)})

and the sets that are of interest:

    > (def my-sets (-> (sets {:has-y #(contains? % :y})
                       (complement :has-y))) ;; could also take intersections, unions

And then run it with some data:

    > (calculate my-sets my-fields [{:x 1 :y 2} {:x 10} {:x 4 :y 3} {:x 5}])
    {:not-has-y
     {:y {:count 0}, :x {:count 2}, :both {:mean 7.5, :sum 15, :count 2}},
     :has-y
     {:y {:count 2}, :x {:count 2}, :both {:mean 5.0, :sum 10, :count 2}},
     :all
     {:y {:count 2}, :x {:count 4}, :both {:mean 6.25, :sum 25, :count 4}}}

The functions :x, :y, and #(+ (or (:x %) 0) (or (:y %) 0)) defined in the fields map are called once per input element no matter how many sets the element contributes to. The function #(contains? % y) is also called once per input element, no matter how many unions, intersections, complements, etc. the set :has-y contributes to.

A variety of measure functions, and structured means of combining them, are supplied; it's also easy to define additional measures.

babbage also supplies a method for running computations structured as dependency graphs; this can make gathering the initial data for summarizing simpler to express. To give an example that's probably familiar from another context:

    > (defgraphfn sum [xs]
        (apply + xs))
    > (defgraphfn sum-squared [xs]
        (sum (map #(* % %) xs)))
    > (defgraphfn count-input :count [xs]
        (count xs))
    > (defgraphfn mean [count sum]
        (double (/ sum count)))
    > (defgraphfn mean2 [count sum-squared]
        (double (/ sum-squared count)))
    > (defgraphfn variance [mean mean2]
        (- mean2 (* mean mean)))
    > (run-graph {:xs [1 2 3 4]} sum variance sum-squared count-input mean mean2)
    {:sum 10
     :count 4
     :sum-squared 30
     :mean 2.5
     :variance 1.25
     :mean2 7.5
     :xs [1 2 3 4]}

Options are provided for parallel, sequential, and lazy computation of the elements of the result map, and for resolving the dependency graph in advance of running the computation for a given input, either at runtime or at compile time.

defrost · on Feb 3, 2013

Cutting to the chase, does this make the summary results available in the midst of the sequence; eg: if it takes two hours to gather pressure data (or any other time series data) does this expose the running variance 10 minutes in, an hour in, etc. ?

kenko · on Feb 3, 2013

Not currently, but it would certainly be possible to add something like that---exposing the running stats for partial subsequences of the input sequence would just be a matter of replacing the "reduce" in the definition of calculate with "reductions" (well, and at least one other change, but at a similar level of complexity). That wouldn't give you ten, sixty, etc. minutes in to the data gathering, because it wouldn't be tied to how long the actual computation of the elements of the input seq---something outside calculate's purview, ATM---was taking, but it would start delivering running answers right away.

defrost · on Feb 3, 2013

Running answers right away is fairly useful; a bit of a challenge in that problem domain is with multichannel sensors ("cameras" with multiple frequency bands, satellites like MODIS, radiometric spectrometers, etc.) where the sharpest "image" is produced by using an SVD (singular value decomposition) type transform to reduce (say) 256 input channels to (say) 6 major dimensions and using those to recreate an enhanced image. Producing branchless code to generate basic running stats (min, mean, max, variance, trends) on multiple input channels is a bit of puzzle, generating an efficient rolling SVD enhancement (best image based on most recent observations) is a bit trickier.

The application areas are continuous processing of continuously arriving data, infinite unbounded sequences.

yayitswei · on Feb 2, 2013

This will be great for building our stats dashboard. Thanks!

innovate · on Feb 2, 2013

we use this actively internally @ReadyForZero for a variety of different analysis, hopefully it's helpful for you and others

furqanrydhan · on Feb 2, 2013

This is great!