Data Structures in Clojure: Singly-Linked Lists

aredington · on Jan 17, 2014

One of Clojure's fundamental opinions, asserted by the language eagerly, is that vector quantities should be treated as values the way scalar quantities are in most languages. Specifically, this means that when you operate with a list and append to it, instead of modifying the value of the next (cdr) pointer in the last node, you create a new list and return the new list as the result of the append operation. These immutable collections afford certain classes of algorithm that destructively modified ones do not. Rich reviews this concept pretty well here: http://www.infoq.com/presentations/Value-Values

This story doesn't spend much time discussing this particular opinion of Clojure's or why you would choose to circumvent it by making a mutable linked list. Additionally by virtue of skipping the immutability discussion, it fails to discuss how append to the head of a list is crucially different from append to the tail when working with data structurally sharing underlying data.

Additionally it prefers to define an interface over a protocol (using definterface instead of defprotocol), and it uses camelCase naming instead of Clojure's culturally embraced hyphen-cased naming.

In short this describes a way to build a data structure in Clojure, but is not a particularly good as an example to be followed of how you should build data structures in Clojure. johnwalker cited a couple more idiomatic examples, e.g. data.finger-trees: https://github.com/clojure/data.finger-tree/blob/master/src/...

stiff · on Jan 17, 2014

Those days I guess the phrase "linked list" shouldn't really appear without the phrase "locality of reference" somewhere near. From what I understand, the only practical context in which classic linked lists could be better than dynamic arrays on current hardware, maybe except some really atypical applications (huge objects in the nodes?), is when there is a lot of appending and removal done on the ends of the list, or if you need a lot of insertions somewhere in the middle.

The wikipedia article on linked lists linked in the article explores a lot of angles and is really excellent. There are some newer, cache-friendlier versions of linked lists, I hope they will start being covered more commonly:

http://en.wikipedia.org/wiki/Unrolled_linked_list

http://en.wikipedia.org/wiki/CDR_coding

Given linked lists are the workhorse of every Lisp, I wonder if there has been work on using those kinds of data structures in Lisp implementations, or is for example Clojure still using classic linked lists internally.

johnwalker · on Jan 17, 2014

Some time ago, hypirion wrote about how persistent vectors are implemented in Clojure.

http://hypirion.com/musings/understanding-persistent-vector-...

There are actually several different implementations of the core data structures:

https://github.com/clojure/data.finger-tree

https://github.com/clojure/core.rrb-vector

Of course, the short answer is that Clojure isn't using classic linked lists, and that regular Java collections can be used and created pretty easily. [0] [1]

[0] https://stackoverflow.com/questions/4313505/converting-cloju...

[1] http://blog.raek.se/2011/01/24/executors-in-clojure/

stiff · on Jan 17, 2014

Thanks for the great resources. I still wonder to what extent is it at all possible to do cache-friendly data structures on the JVM. Besides the somewhat abstract data structure considerations of the kind posted, there is still the issue of whether in the end references are stored in the collections memory or the values themselves. On the surface level it looks like Java collections generally use references for everything, except perhaps for arrays and (maybe) ArrayLists of primitive types. See for example:

http://java.dzone.com/articles/slab-guaranteed-heap-alignmen...

pron · on Jan 17, 2014

Java currently only stores primitives inline (in arrays or in objects). However, the memory allocation scheme (at least in HotSpot) places objects that are created one after another by the same thread consecutively in memory.

Nevertheless, cache-friendly data structures are still very much possible on the JVM. Consider something like a B-Tree node, with an array of references to the children and an array of keys. The node object and the arrays will be stored together in memory, so the only cache-missing pointer following you'd do is when you jump from one node to another, just as would happen in C/C++. So the memory placement is certainly not as flexible as in C/C++, but you can certainly be pretty cache friendly. In fact, this is a very active topic, as a lot of people are starting to use Java in low latency applications. For example, look at the discussions here: https://groups.google.com/forum/#!forum/mechanical-sympathy

There is work being done to support value types in Java, that are immutable, cannot be referenced (hence only compared by value), and stored inline in arrays (well, the latter is an implementation issue and up to the particular JVM to decide). This may be included in Java 9.

johnwalker · on Jan 17, 2014

I'll have to defer to this paper. It mentions that the core persistent vectors are cache-friendly, but I would suspect that the llvm will always be an upper-bound for the jvm.

infoscience.epfl.ch/record/169879/files/RMTrees.pdf‎

I'm very interested in better answers to this question, though.

drcode · on Jan 17, 2014

Certainly a nice intro to Clojure types, though it might be a bit confusing to a beginner, given that linked lists are already built in to Clojure and there's nothing I can see in the tutorial saying "You don't actually have to do any of this, this is just an exercise it for pedagogical purposes."

aaronem · on Jan 17, 2014

I was wondering that myself. "Isn't Clojure a Lisp? Doesn't it thus give you linked lists for free? What's the point of reimplementing them at the JVM level?"

dschiptsov · on Jan 18, 2014

Why not just stop at the point that a Pair is just two pointers and a List is a bunch of nested Pairs, without any classes, deftypes or monads?)

Btw, the classic trick from SICP lectures, where a Pair was implemented as a closure (a procedure) that accepts messages (to illustrate code-is-data principle) is much more interesting abstraction than this overengineered OO stuff.)

dschiptsov · on Jan 18, 2014

Now I have a full keyboard instead of that clumsy phone.)

The very idea "to implement a List" in what is supposed to be called a Lisp is.. well, I do respect the amount of work that has been done, but.

Such kind of confusion, it seems, comes from a wrong perspective of what a Lisp is. Let's look first at what it is not.

It is not some "modern OO" language with parenthesized prefix notation, it could be so boring. Second, a List is not some "data structure in Lisp", it is the most important, non-separable fundamental feature of the language. Without "native conses" a Lisp is nonsense.

The explanations is going to grow lengthy, but it worth a try.

A Lisp is a set of layered DSLs and ADTs. Its "miracle" in what it is small and good-enough, in accordance with "Less is more" maxima.

There is no fucking "Class Node" in what a List is. It is just (cons 1 (cons 2 (cons 3 '()))) or 1:2:3:[] if you wish.

The big ideas is to separate clearly the use from the implementation, which is called an Abstraction Barrier.

A complete definition of the basic building block - a cons cell, or the Pair ADT is cons, car, cdr, pair? and nil while a List is just one more "constructor" - list. This is minimal and it is good-enough.

It is so "good" that it "created" another small miracle - cadr/cddr notation, which is just an accidental piece of beauty created by a human mind (like a Lisp itself). Only complete tasteless person could abandon this in order to use dull and mediocre first/rest or any other names for selectors that doesn't "chain" and reflect the notion that almost everything could be made out of conses.

The problem is that we have way too many of those who cannot grasp the beauty of what a cadr or cddr is.

Implementation in a language's runtime is quite another topic. There we probably should have a notion of a pool-based allocation (in order to not call malloc for each cons) and locality (to properly implement persistence) etc. but these implementation specific things must never leak int "high-level" use. Again, there is no fucking Class Node.

Speaking of Lisp apart form Lists is like speaking of water apart from the waves. Water has waves, Lisp has Lists in its own nature.

Giving this nature of Lists we could have a general mapper procedure, or map. The idea of a Functor from Math (or these control abstractions from Haskell) has nothing to do with it. It is just the recursive nature of List's structure allows us to define recursive processes.

Even trees not necessarily must have any Class Node. Here is a classic example:

   (define make-tree cons)
   (define datum car)
   (define children cdr)
   (define (leaf? n)
      (null? (children n)))

   (define (tree-map fn tree)
      (make-tree (fn (datum tree))
	     (map (lambda (ts) (tree-map fn ts))
		  (children tree))))

There is no "Single-Linked List Data Structure in Lisp" because there is no Lisp without conses.

Last but not least. This very site is build in a dialect of Lisp which never tries to redefine what a List is.

More about Lisps - http://karma-engineering.com/lab/wiki/Bootstraping

More about piling up useless abstractions - http://karma-engineering.com/lab/wiki/Monads

djb_hackernews · on Jan 17, 2014

Pretty indepth stuff, it'd be useful if you point out why you even need the INode interface, unless I missed that.

Anderkent · on Jan 17, 2014

`deftype` only lets you implement functions that match an interface, protocol, or override Object. It's meant to narrow down the meaning of a type, and usually any non-interface methods would just be clojure functions that take an object.

It does play weirdly with the fact that mutable fields are private, I agree. But that is a very rare case, and in general you'd usually create a type to implement an interface or protocol.

Quoting Rich Hickey: (https://groups.google.com/forum/#!msg/clojure/pZFl8gj1lMs/qV...)

"AOT deftype vs gen-class touches on the same Clojure semantics vs Java/ host semantics, with the objectives from before - support implementing interfaces but not concrete derivation. So, no concrete base classes, no super calls, self-ctor calls, statics, methods not implementing interface methods etc."

johnwalker · on Jan 17, 2014

That's just the way records work in Clojure. Methods that are implemented within a record have to come after their interface.

blt · on Jan 17, 2014

Get rid of that s-t ligature. It's distracting.

aaronem · on Jan 17, 2014

What ligature would that be? I'm afraid I don't follow.

mej10 · on Jan 17, 2014

On words that have "st" there is a ligature between the s and t. It looks like this: http://typophile.com/files/Letter_st_6439.jpg

There is also one on "ct".

simcop2387 · on Jan 17, 2014

I'm not seeing that either, can you take a screenshot?

eurleif · on Jan 18, 2014

I'm seeing it too: http://imgur.com/puK8OiW

simcop2387 · on Jan 19, 2014

Hmm, i was using firefox. I wonder if it's a webkit only thing?