Software: SAS

In the previous posts I talked primarily on what to do and what not to do when working with data. Now I want to switch gears a bit and discuss the tools of the trade, i.e. the packages and programming environments one could use to this end. I have probably had a chance to confront the vast majority of existing analytic solutions, and as such, was able to form an opinion on each. Before I start, however, I would like to share a nice summary of what your statistical language of choice reveals about you. As usual with these kinds of lists, about eighty percent of information thereof is pretty accurate.

I will start with SAS. There are multiple companies out there that depend on SAS heavily for their data analytics, and it is understandable: for a long time, SAS was the only statistical package that could deal with large datasets. Unlike most other high-level packages such as R or Stata, SAS does not keep the data in RAM, so the HDD space is effectively the only limit on the amounts of data it can process. This probably explains why most data analysts with over ten years of experience have a solid proficiency in SAS: when they started their careers, other tools were not an option.

Personally, I am not a huge fan of SAS, for three major reasons. First, it is slow relative to other packages that keep data in memory. These days getting a host with 60+ GBs of RAM is trivial, and hosts with over 200 GBs of RAM are not unheard of. While the raw data size also grew accordingly, most datasets usually shrink to tens of gigabytes when they are processed and ready for analysis, and thus a host with 64 GBs of RAM works out fine nine times out of ten.

Second, SAS has its own peculiar programming language that is unlike anything else out there. This implies a steep learning curve for anyone new to the environment. This is not to say that the problem is unique to SAS—Stata has perhaps even less intuitive language syntax. However, this is still a downside, because it takes time to get new people up-to-speed on current work done within the team, which is never a good thing.

Finally, SAS is outrageously expensive. A relatively modest annual license can easily cost in the low six digits for the first year and in the mid-five digits every year thereafter. While SAS offers an amazingly rich array of modules that are designed to take care of the grunt work such as loading data from databases directly into SAS, these price tags are still hard to justify.

For a while, SAS was able to benefit from the fact that CPU speed grew roughly at the same rate as the size of the average dataset. As long as this was true, one could solve the “big data” problem by allocating a larger host for the analysis. In the early 2000s, however, datasets started growing much faster, and reading from disk became the hardware bottleneck. This gave birth to distributed computing frameworks (read: MapReduce), and the people at SAS Institute were for some reason remarkably oblivious to this structural change. As a result, SAS is now hopelessly behind when it comes to reading data from a distributed storage system such as HDFS. It will be interesting to see what they can make of the situation at hand.