The problem is that here the "generic" tools like "apply" are not really generic. Hence you can not use your intelligence and creativity to deduce a working program using a small toolset that you know fully.
Instead you must use a huge number of special tools that do only a few things. The code is hard to read, hard to write, slow, the compiled version is big. It is also error prone since you must use a large number of different paradigms. Some might have their arguments slightly differently. It's produced by googling for everything (or if there's a decent builtin help system, using that).
It seems strange that such concepts, like generic data types, operators and functions are not more widely spread. For example if you can calculate "max(a-b)" even if "a" is a matrix and "b" is a scalar, that's already quite nice. This can increase productivity a lot.
Since such design models are so fundamental, they are rarely found in "matlab-like" extensions for other programming languages: instead you must constantly do type conversions and transformations by hand and use awkward middle man functions to access data types. Most of your code is housekeeping and little actual progress.
See for example the problems with Julia where column and row vectors are totally different types: http://2pif.info/op/julia.html
This is pretty much the only thing that I do not like about Julia. Matlab also uses this concept, and I have had my Matlab code broken several times because of this issue.
Plyr is wonderfully expressive, but it breaks down when you have hundreds of levels of a factor. tapply, on the other hand keeps trucking on up to a couple of thousand with very few problems. Hadley says that he may use data.table for plyr2 though, which would make my life a lot better.
R has a rich heritage of trolling unsuspecting programmers coming from other languages. In ancient versions of R from the early 2000s, the underscore operator was synonymous for '<-', aka assignment. This is why in R some use periods for separators (ice.cream) rather than underscores (ice_cream), though just to make things more lulzy some package authors have started to use Java style camel case (iceCream) too.
This is the very same critic that was (and still is) made about perl.
>> I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).
Having many ways to do one thing is good - whatever float your boat you know. Just for the given examples, I see how one may be better in a loop (the one where you use 1) while another one may be better when you type it on the command line to check stuff (the one with flavour)
>> The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–
No. You learn by doing. You write some stuff, test it, and if you don't get the result you expect go back to try and figure out what's wrong.
And even before doing that, you set aside some time to LEARN about the language.
But maybe, just maybe, the problem is not with the tool but with the person using it?
>> I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.
At least that's honest. Maybe you don't do well with a language (any language!) because huh, you didn't take time to study it and expect it to magically work??
I know I'm no good in java, but I also know why - I never took some time to actually learn it. I can do things with java, but not very complex things, and when I fight my way out of a problem I caused, I won't blame java but myself and my lack of knowledge of java.
No. You learn by doing. You write some stuff, test it, and if you don't get the result you expect go back to try and figure out what's wrong.
No. This is how you learn to do whatever simple things you want in a language. The way you learn how to do things correctly is mainly by reading code. If you have an expert to review your code that's better, but almost no one has that opportunity.
Edit:
I guess I should come right out and say what I'm thinking: not everyone's opinions about a programming language are equal. I suppose this is why people are constantly having discussions about the pluses and minuses of "dynamically typed" languages, while type theorists don't even recognize "dynamic typing" as a form of typing. Expert analysis has consistently shown problems with the R language design and implementation. It's not a good language. Its features are often misused or poorly used and it doesn't have a strong sense of what support it wants to give to its users.
R works really remarkably well and clearly for folks who work in this domain day to day. It's easier to learn than a lot of other languages for people who are coming from subject knowledge and need to program to implement that knowledge.
It may be a terrible language by some objective external computer science criteria. But as a domain specific language, that's not really important. What is important is that the domain specific aspect works well for people with domain knowledge.
It's not the right tool for every job or every person, but it's incredibly expressive for people working with a specific set of problems.
What you're talking about is the environment. R seems to have a very nice environment -- a lot of people have written libraries for it and it is popular among the community.
I was quite clearly criticizing the language, which is bad. That doesn't mean you shouldn't use it if it's the best option for your use case, just that it's bad.
I've been interested in type theory for a while, but haven't found a good avenue for getting into it.
Could people here list any resources on type theory that comes to there mind? Books, blogs, people, etc. A book that explains the fundamentals would be great.
While not being exactly about type theory like Pierce books (you can't waste your money on those), I really like the part about types in "The Implementation of Functional Programming Languages" by Peyton Jones (the chapters about type system are written by Peter Hancock). It's seen mostly from a technical point of view. The book notably contains a fully explained implementation of a type checker.
Sorry for the ambiguity. It's certainly not a waste of money. I found "Types and Programming Languages" really good (for whatever that means, I'm far from being an authority in the subject). It seems to be wildly used as a textbook and is a reference in the subject. While I didn't read it, I expect the second one to be in the same vein.
The second link is relevant but only speaks to the performance of the language, which is something fixable in the implementation. Personally, I think the R language is pretty nice, and well suited to its domain. It has first-class functions and first-class environments, and an object system inspired by common lisp and Dylan. While not homoiconic, it's pretty easy to compute on the language. Statistically, having support for missing values built in at a very low level is important.
But maybe you just don't like "dynamically typed" languages?
Please reread the paper. It contains many things about the language semantics, e.g.
As for sharing, the semantics
cleary demonstrates that R prevents sharing by performing copies at assignments. The
R implementation uses copy-on-write to reduce the number of copies. With superassignment, environments can be used as shared mutable data structures. The way
assignment into vectors preserves the pass-by-value semantics is rather unusual and,
from personal experience, it is unclear if programmers understand the feature
And no, I did my master's thesis in Racket. I'm fine with unityped languages, I just describe them properly.
That's a consequence of having first class environments - but given that it's rarely used, shouldn't a sufficiently intelligent compiler be able to optimise it away in the common case? I'm sure I read a paper along those lines, but I can't find it again.
But seriously, every under-used feature harbors a ton of bugs in the compiler and drains resources for implementation. Promises, for example, seem like a "feature" in R that only really serves to destroy performance and no one uses it.
Agreed, but it's hard to know in advance what features will be useful, and once a language is established, it's hard to remove less-used features without breaking existing code.
They do, parent was just wrong, and ignoring the fact that people use "language" colloquially to include the operational semantics, and pretending that only denotational semantics exist.
Dynamically typed languages have a typed runtime, do not a type in the static program text. Hence the commonly used correct qualifiere: "dynamic" and "static".
A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute.
The word “static” is sometimes added explicitly--we speak of a “statically typed programming language,” for example--to distinguish the sorts of compile-time analyses we are considering here from the dynamic or latent typing found in languages such as Scheme (Sussman and Steele, 1975; Kelsey, Clinger, and Rees, 1998; Dybvig, 1996), where run-time type tags are used to distinguish different kinds of structures in the heap. Terms like “dynamically typed” are arguably misnomers and should probably be replaced by “dynamically checked,” but the usage is standard.
Perhaps I should have been more clear that we use the words "dynamically typed" because they have entered the lexicon, but they are misnomers -- they do not properly capture what type theorists mean when they say "type".
> This is the very same critic that was (and still is) made about perl.
It's funny how similar the arguments are.
They can both be described as a language with some serious gotchas, but that's simple enough and has such a great package collection that it's easy to get a lot of things done very quickly.
Perl is a language that pursued extreme flexibility and programming-style agnosticism as a design goal...that made it less than ideal for large scale software engineering...but enabled a lot of experimentation with powerful, sometimes sweet and expressive and sometimes obfuscated idioms.
R is a bit of a train wreck. It makes hard things easy (if you know the right library or idiom), but often makes easy things hard.
Perl is the duct tape of languages, R is more like the A-10 Warthog, ugly but powerful for a specific job.
R has great capabilities. But it is not great for encouraging readability. For learners, it tends to feel confusing and lawless. It requires an investment. This is a significant fact about R. It is not equally true of all languages. It isn't ideal, either. We can do better. But for years there was nothing comparable in open source (perhaps even outside open source). That is also true of Perl.
R is a great set of statistics wrapped in a crime of a programming language. You have to fight the latter to get to the former. Is the fight worth it? I and many others say yes... but it sucks that we even have to make that decision.
The metaphor I use to explain R is that most programming languages are like different varieties of swiss army knife: general tools that can do a lot of different stuff with relative ease. R, on the other hand, is like a fillet knife: its totally obtuse for most things, but in spite of its oddities it outshines all other choices at one specific task (statistical data analysis / filleting a fish respectively). Depending on what sort of camping trip you're going on you might want to bring one or the other or both.
Language issues asside, R is being propelled by an awesome group of developers (such as Hadley Wickham, Dirk Eddelbuettel, John Myles White, Brian Ripley etc). These people are why R is as successful as it has become - both through continued work on various aspects and packages and direct interaction with users new and old. Frankly, there should be some kind of slightly awkward parade for them all.
I think R's biggest strength of R is the package ecosystem. I would not trade the large community of active package developers for slightly friendlier syntax.
Indeed, the ecosystem of actively deved numerical libs is the only upside to R over other tools. It's a pretty large cliff for any other tool chain to climb over.
--someone who's writing numeric/ data analysis tools in Haskell.
Off-topic but: Can you comment on the significance of this cliff for a team considering moving from R to Haskell for data analysis? Is the availability for statistics packages really sparse in Haskell?
the answer is: it depends! Shoot me an email at Wellposed and I can try to better answer your question.
I am quite literally building a full data analysis stack (as a product) in haskell, some parts of which will be available as a sort of proprietary augmented version of the haskell platform, and some parts are / will be open source.
I do think that there are compelling reasons to consider Haskell / GHC for analytical workloads, but depending on the details it really depends.
The principal cliff is just the HUGE number of (mostly poorly designed) libraries for many standard analyses written in R. Theres some nice engineering approaches to circumvent this, and theres some really exciting libs that a uniquely awesome and handy in haskell land.
A notable example is AD, a really easy to use auto differentiation lib by Edward Kmett, which has a really exciting refactor thats nearly done that will make it useable by mortal Haskellers :)
http://hackage.haskell.org/package/ad
and https://github.com/ekmett/ad
(I've some neat bits i'll be hopefully adding to AD myself in the next month)
The package management is weird. It makes sense, but it is weird they go into more detail with dependencies like "enhances" or "enhanced by" and other stuff. It is a more complex graph, and lends more to discovery, which is cool, but it through me for a loop for a second. You can still (after a bit of perusing the model) figure out what concrete dependencies are, but there are some packages that do some circular weirdness. There is also still somewhat of a binaries problem for the various flavors of Mac, Windows, Linux. You can get source for most things, but bubbling up of system dependencies you're on your own like other platforms, although Java & Maven seem to have a solution for distributing some binaries a little better possibly...
When I think about R I find it difficult to stop making analogies between it and JavaScript. Both are dynamic, "script" languages with C-based syntax and with LISP-like nature that lurks underneath it. Both have functions as first class citizens. Both are inconsistent in different ways. Just to name a few that first come to my mind ... I think that these analogies reduced frustration that I had felt about R once when I "discovered" them and helped me to adopt and learn R.
That makes sense - they were both inspired by Scheme (R is in many ways just a Lisp that uses M-expressions), but evolved to deal with immediate concerns that, in retrospect, have created difficulties as the years have gone by. It makes R and JavaScript great and frustrating at the same time.
At this point, mostly library support. Python is quickly getting closer to being a viable R replacement, but it's simply not there yet IMO. The two biggest holes in python for me are:
* The lack of built-in dataframes and libraries to work on them (like plyr). Pandas seems to be getting pretty good, but it's still not as mature as R's solutions.
* Visualization. ggplot for R is great, matplotlib for python not so good IMO. I've heard Bokeh and rplot are attempting to bring ggplot functionality to python. Again, not nearly as mature as R's solution.
I'd love to move to Python because R is not a fun language to develop software in. But at this point, R is by far the better tool for working with data (for my needs at least).
A large pain point for me is remembering the huge number of useful commands in R - I use at least 5-10x as many keywords in R as I did in C++. As I mention in another comment I basically need a reference (usually google and ?) at all times to keep track of these commands and their parameters. In many ways it's great that all these functions exist because I can analyze datasets an order of magnitude more quickly (in development time) than I could in a more traditional language, but developing in R is far more stressful to me. When I'm writing production-level code I constantly have to worry about readability and whether someone else who is reading my code will intuitively understand what I'm doing. With so many commands and so many ways to do things this can be challenging. I find my R code is often lower 'quality' than what I write in something like Python or C.
There are many little idiosyncrasies in R's syntax, I feel like I never grokked the language. For example, pretty much anytime I see '~' I have to relearn what is going on. From a mathematical perspective I appreciate that vectors are indexed from 1 instead of 0, but from a programming perspective it can be annoying.
BTW, thanks for contributing so many great things to R, I owe a lot of what I do to you.
I think a big problem is not just the language, but how it's usually taught - in other words, I think you shouldn't need such a large vocabulary if you learn the right primitives (which don't always exist in base R). The lack of a solid common foundation it also what makes code readability a challenge.
Part of the problem is the base packages: as soon as you open R you have ~1600 functions you can use, and you obviously can't memorise a significant proportion of them. Learning R is as much about what you don't learn as what you do.
Great point, I'll have to reevaluate how I use R. My training in R was very informal and mostly involved reading other people's code, so my "vocabulary" is probably a union of other peoples' and is probably too large.
Actually one thing that has helped in the past year was reading your split/apply/combine paper and using the plyr package more.
I'm working on a curriculum for basic programming in R, hopefully that will eventually help outline what everyone should know about R, regardless of what they use it for.
I agree, I work in python mostly but started doing all my exploratory + simple regressions in R just because of ggplot. I think eventually, matplotlib + ipython + numpy + scipy + pandas will be a great competitor, but they really need to improve matplotlib, it's no where near ggplot yet
Use both and you'll see. Python can do modeling, can do exploratory work and is great in production. R is brilliant at exploratory work and modeling. Having a core in functional programming helps too.
As the article discusses, R has a steep learning curve. Mastering R requires knowing dozens of commands and tricks, but once you learn those the language becomes very expressive and quick to work with.
I agree it's frustrating. I've been working with R on and off for more than five years and almost exclusively for 1-2 years, and I still need google every ten minutes to remember some esoteric command that is perfect for the problem at hand. In my experience this is still quicker and less error-prone than python, which often requires you to effectively roll out your own solutions for small data processing needs.
I'm about where you are at. I struggled with doing things in R for a year or so, while I was initially using it to do some cool graphics that I couldn't elsewhere, following mostly by example. After spending enough time using it, things began to "click" for me, and now I prefer to do most of my work in R.
I still struggle sometimes to write nice looking code in R, but I've known people who are pretty good at managing it. If you have to inherit an R program from someone else though, it's almost guaranteed to be a nightmare to comprehend.
That's just part of learning programming languages. Learn a language well, then decide. I thought Python was better than R, but a homoiconic functional language just makes sense for statistics. You get really neat functions like with, within, local, that let you keep everything tidy. S3/4 classes are hugely inspired by CLOS. It's also a great language for extending C with.
I've just been learning R for my grad studies (and general data exploration), and was really torn between R and Python, especially with IPython Notebook, pandas etc. I came from a Ruby background so both would be newish to me. What made me start with R (although I'm still following IPython dev, and will probably end up switching eventually) is the massive amount of material available to learn stats etc with R.
I've seriously got something like a 100 textbooks on R on my system, intro to stats with R, machine learning with R, questionnaire analysis with R, bayesian stats with R... There are Coursera classes, the amazing r-bloggers.com, etc. For someone who is simultaneously learning statistics and a tool, this is invaluable.
For Python, there's still very little. There's Wes' book, which is mostly about pandas and a lot of finance/time series stuff... I haven't seen a single "intro to stats in social sciences with Python"... There must be one book out there that is open source, which could just be rewritten with Python examples? (The most useful thing I've seen is people reworking examples from Machine Learning for Hackers, or one of the Coursera R courses, in Python).
I don't know how Python handles it, but R uses vector operations for almost every function, meaning that for large data sets (10000+), the efficiency of data manipulation in R is always O(1), and may otherwise be O(n) in Python.
I have a function in R that given a list of keywords, for each word (~100 keywords), it checks a data frame that stores a list of keywords (~5000 lists of keywords, ~10 keywords/list) to see which keywords matches the keywords in each row of the data frame.
Efficiency using straight for loops would be O(mno) and could take awhile. But in R, all those operations are vectors, and it takes less than a second.
I imagine your mileage may vary depending on the functions and operations that you are performing. I did a simple test of python vs R and python was the overwhelmingly faster. I believe both languages are primarily written in C, and as such, the speed of your program will depend on the 'efficiency' of the underlying code for the functions you are using.
R is horrifically slow on loops. If you can vectorize the operation, then it will be reasonably fast. If you have to use an explicit loop, then python will be faster. My previous company had a saying that "R is fast if your write it in C". If you have many nested loops, you'll probably need to write a C extension.
R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply()
Exactly that. I've always found R a horribly confusing mess, compared to general programming languages (I'm proficient in Ruby, JS, Obj-C).
On the other hand, compared to some of the other commercial stats packages, it's beautiful and logical and reasonable. I regularly use Stata, where you're only allowed one data table in memory at once, and almost everything relies on side effects and Byzantine macros. E.g want to calculate a mean? First, 'summarise' the variable, then assign 'r(mean)' to a var name, then quote that the right way to be substituted into an expression where it's needed.
Instead you must use a huge number of special tools that do only a few things. The code is hard to read, hard to write, slow, the compiled version is big. It is also error prone since you must use a large number of different paradigms. Some might have their arguments slightly differently. It's produced by googling for everything (or if there's a decent builtin help system, using that).
It seems strange that such concepts, like generic data types, operators and functions are not more widely spread. For example if you can calculate "max(a-b)" even if "a" is a matrix and "b" is a scalar, that's already quite nice. This can increase productivity a lot.
Since such design models are so fundamental, they are rarely found in "matlab-like" extensions for other programming languages: instead you must constantly do type conversions and transformations by hand and use awkward middle man functions to access data types. Most of your code is housekeeping and little actual progress.
See for example the problems with Julia where column and row vectors are totally different types: http://2pif.info/op/julia.html