It ended up being longer than I had planned.. The main reason being that I wanted to demonstrate the fact that Object.clone doesn't work as I would have anticipated, and that I wanted to demonstrate a possible solution.
Greg Brown was kind enough to steer me onto the right track though, this is mainly a design problem. I should be avoiding copying wherever possible.
The Io language has a similar gotcha - when I cloned one of my prototypes that contained a List member, I was surprised to find that all the clones shared the same List! It actually makes sense - it was the reference to the list that was copied, but all the references pointed to the same list.
The problem with always defaulting to deep copy in a language where all object slots are by reference is "where do you stop?" Do you copy all the objects known to that object? If the object holds a data file or a resource, do you deep copy that too? What about object graphs? What about two objects mutually holding references to each other? What if the object holds a reference to the global application object?
So the most common way to do it is default to only a shallow copy. It's up to the user to define a deep copy if they need it because only the user knows what members are semantically "part of" the parent object and what are "pointers" to unowned objects.
I ran into the exact same problem when trying to solve some (not very complicated in retrospect) errors in a body of text. Shallow copying a string that I was applying permutations to and storing in an array meant that every single string in the array was the same, so I could never have more than one permutation without pulling some hacks like this.
I would be very interested in a cleaner variant on the marshal load hack for non-primitives, or even some interesting doc/writeup on how this works.
Are you kidding me? The "solution" is to serialize and deserialize? That's a incredible waste.
edit: I thought it was not uncommon for higher level languages to pass arrays and objects by reference, so this post wasn't particularly new or interesting. Unless you're coming from PHP, which is, IMO, a nightmare because everything is passed by value (by default) except objects.
I have actually come to Ruby from PHP, maybe that's where I got my incorrect assumptions from, it definitely wouldn't be the first bad habit PHP has left me with.
I'm actually looking for feedback on this. I know there are quite a few Ruby hackers hanging round HN.
Is there something I'm missing here? Some idiom that lets you side-step this problem? I'm questioning it because this problem/solution seems very much at-odds with the elegance and thoroughness of the rest of the language.
You're missing a few details, especially around "primitive" data types. There's not really any such thing in Ruby - everything is an object, but some objects - numbers, booleans, nil - are immutable.
When you copy an object, all you do is copy its set of instance variables, which are just references to other objects. For an array, the instance variables are its set of indexes, which again are just references. Copying an array just means making a new list of references, but the objects they point to remain unmodified and uncopied.
Consider:
<pre>
array = ["foo"]
copy = array.dup
</pre>
array and copy are independently mutable - modifying the index in one does not affect the indexes in the other - but they still both contain references to the single string "foo". Thus:
<pre>
copy.first.gsub! /foo/, "bar"
</pre>
modifies the string referenced by copy, which is the same string referenced by array. So array becomes ["bar"].
If you want a true deep copy, do something like this:
<pre>
def deep_copy(object)
case object
when Array
object.map { |item| deep_copy(item) }
when Hash
object.inject({}) do |hash, (key,value)|
hash[deep_copy(key)] = deep_copy(value)
hash
end
# handle other data structures if need be
else
object.respond_to?(:dup) ? object.dup : object
end
end
</pre>
No worries, thanks for taking the time to write it out. I'm glad that I wrote the post (and that I'm getting hammered a little for my assumptions) because making mistakes is probably the only way I'm going to get a deeper understanding of the language...
One thing that confused the issue a little for me is the fact that some objects in Ruby are actually only really 'pretend objects'. ie:
Well in fairness, I did mention that the realization made me feel stupid :-p
It's not necessarily obvious if you're coming from other languages that don't behave this way. That being said I'm surprised that I had never run into this problem before. I think that most of the time I had the right idea with not copying objects, but in this case I had memoized a method call and the Hash 'cache' was getting corrupted which was what brought it to my attention... A slightly more unusual situation.
This is not a surprise. I have the same issues with Strings too.
A String is passed around your app and someone changes it - capitalizes/chomps etc. The String is changed throughout the app ! You have to dup() it, if you want to ensure no one changes it.
This means if you have classes returning Strings, such as first_name, last_name, address etc, your getter should return a dup() if you want to ensure no accidental change to it. That sucks, if you ask me.
I cannot remember the exact cases, but the point is that you have an API on one hand, and the user of an API on the other.
The API returns strings to you, the user at some point needs to (say) perform multiple operations on that String. Say, multiple gsubs. So rather than create a new string with each, he uses a gsub!.
I've actually once had a discussion about this on ruby-forum when i faced this issue. We talked of a copy-on-write string. But i did not want to change my entire application.
It is inefficient for the API to keep returning dup()'ed strings. otoh, if the user accidentally changes the string (which she can), your API can throw an error or malfunction.
I actually just had a chat via email with Greg Brown (author of Ruby Best Practices). I updated the blog post to get to the point quicker and I included his take on the matter...
It's not a bug ... the references are copied by value (don't know what the big deal is ... it's the same thing happening in Java), the cloning done on the basic collections is shallow (again, same thing happening in many other languages) and the basic types like Fixnum are immutable.
When learning a new language, after playing around with code-snippets I then usually read the language's reference.