Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Babbage: A Clojure library for accumulation and graph computation (github.com/readyforzero)
89 points by ithayer on Feb 2, 2013 | hide | past | favorite | 8 comments


I actually wrote something similar in bash which I use frequently when I need to munge a table of numbers on the command line [1]. The whole time I was thinking I should really be doing this in common lisp.

[1] http://eschulte.github.com/data-wrapper/


here's the announcement email (very boiled down version of the readme, essentially):

babbage is a library for easily gathering data and computing summary measures in a declarative way.

The summary measure functionality allows you to compute multiple measures over arbitrary partitions of your input data simultaneously and in a single pass. You just say what you want to compute:

    > (def my-fields {:y (stats :y count)
                      :x (stats :x count)
                      :both (stats #(+ (or (:x %) 0) (or (:y %) 0)) count sum mean)})
and the sets that are of interest:

    > (def my-sets (-> (sets {:has-y #(contains? % :y})
                       (complement :has-y))) ;; could also take intersections, unions
And then run it with some data:

    > (calculate my-sets my-fields [{:x 1 :y 2} {:x 10} {:x 4 :y 3} {:x 5}])
    {:not-has-y
     {:y {:count 0}, :x {:count 2}, :both {:mean 7.5, :sum 15, :count 2}},
     :has-y
     {:y {:count 2}, :x {:count 2}, :both {:mean 5.0, :sum 10, :count 2}},
     :all
     {:y {:count 2}, :x {:count 4}, :both {:mean 6.25, :sum 25, :count 4}}}
The functions :x, :y, and #(+ (or (:x %) 0) (or (:y %) 0)) defined in the fields map are called once per input element no matter how many sets the element contributes to. The function #(contains? % y) is also called once per input element, no matter how many unions, intersections, complements, etc. the set :has-y contributes to.

A variety of measure functions, and structured means of combining them, are supplied; it's also easy to define additional measures.

babbage also supplies a method for running computations structured as dependency graphs; this can make gathering the initial data for summarizing simpler to express. To give an example that's probably familiar from another context:

    > (defgraphfn sum [xs]
        (apply + xs))
    > (defgraphfn sum-squared [xs]
        (sum (map #(* % %) xs)))
    > (defgraphfn count-input :count [xs]
        (count xs))
    > (defgraphfn mean [count sum]
        (double (/ sum count)))
    > (defgraphfn mean2 [count sum-squared]
        (double (/ sum-squared count)))
    > (defgraphfn variance [mean mean2]
        (- mean2 (* mean mean)))
    > (run-graph {:xs [1 2 3 4]} sum variance sum-squared count-input mean mean2)
    {:sum 10
     :count 4
     :sum-squared 30
     :mean 2.5
     :variance 1.25
     :mean2 7.5
     :xs [1 2 3 4]}
Options are provided for parallel, sequential, and lazy computation of the elements of the result map, and for resolving the dependency graph in advance of running the computation for a given input, either at runtime or at compile time.


Cutting to the chase, does this make the summary results available in the midst of the sequence; eg: if it takes two hours to gather pressure data (or any other time series data) does this expose the running variance 10 minutes in, an hour in, etc. ?


Not currently, but it would certainly be possible to add something like that---exposing the running stats for partial subsequences of the input sequence would just be a matter of replacing the "reduce" in the definition of calculate with "reductions" (well, and at least one other change, but at a similar level of complexity). That wouldn't give you ten, sixty, etc. minutes in to the data gathering, because it wouldn't be tied to how long the actual computation of the elements of the input seq---something outside calculate's purview, ATM---was taking, but it would start delivering running answers right away.


Running answers right away is fairly useful; a bit of a challenge in that problem domain is with multichannel sensors ("cameras" with multiple frequency bands, satellites like MODIS, radiometric spectrometers, etc.) where the sharpest "image" is produced by using an SVD (singular value decomposition) type transform to reduce (say) 256 input channels to (say) 6 major dimensions and using those to recreate an enhanced image. Producing branchless code to generate basic running stats (min, mean, max, variance, trends) on multiple input channels is a bit of puzzle, generating an efficient rolling SVD enhancement (best image based on most recent observations) is a bit trickier.

The application areas are continuous processing of continuously arriving data, infinite unbounded sequences.


This will be great for building our stats dashboard. Thanks!


we use this actively internally @ReadyForZero for a variety of different analysis, hopefully it's helpful for you and others


This is great!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: