By Roman Zykov, Founder/Data Scientist @ TopDataLab
It is very important to choose the right tool for data analysis. On the Kaggle.com forums, where international Data Science competitions are held, people often ask which tool is better. R and Python are at the top of the list. In this article we will tell you about an alternative stack of data analysis technologies, based on Scala programming language and Spark distributed computing platform.
How did we come up with it? At Retail Rocket we do a lot of machine learning on very large data sets. We used to use a bunch of IPython + Pyhs2 (hive driver for Python) + Pandas + Sklearn to develop prototypes. At the end of summer 2014 we made a fundamental decision to switch to Spark, as experiments have shown that we will get 3-4 times the performance improvement on the same park of servers.
Another advantage is that we can use one programming language for modeling and code that will run on production servers. This was a huge benefit for us, since before we were using 4 languages simultaneously: Hive, Pig, Java, Python. It’s a problem for a small team of engineers.
Spark supports working with Python/Scala/Java through APIs well. We decided to choose Scala because it is the language Spark is written in, which means that we can analyze its source code and fix bugs if needed. It is also the JVM on which the Hadoop runs.
I must say that the choice was not easy, since no one in the team knew Scala at the time.
It is a well-known fact that to learn to communicate well in a language, you need to immerse yourself in the language and use it as much as possible. So we abandoned the Python stack in favor of Scala for modeling and fast data analysis.
The first step was to find a replacement for IPython notebooks. The options were as follows:
- Zeppelin – an IPython-like notebook for Spark;
- Spark Notebook;
- IBM’s Spark IPython Notebook.
- Apache Toree
So far the choice has been ISpark because it’s simple – it’s IPython for Scala/Spark. It’s been relatively easy to bolt on HighCharts and R graphics. And we had no problem connecting it to the Yarn cluster.
Let’s try to answer the question: does average purchase amount (AOV) in your online store depend on static customer parameters, which include settlement, browser type (mobile/Desktop), operating system and browser version? You can do this with Mutual Information.
We use entropy a lot for our recommendation algorithms and analysis: the classical Shannon formula, the Kullback-Leibler divergence, Mutual Information. We even submitted a paper on this topic. There is a separate, albeit small, section devoted to these measures in Murphy’s famous textbook on machine learning.
Let’s analyze it on real Retail Rocket data. Beforehand I copied the sample from our cluster to my computer as a csv file.
Here we use ISpark and Spark running in local mode, which means that all calculations are performed locally and are distributed among the processor cores. Everything is described in comments to the code. The most important thing is that in output we get RDD (Spark data structure), which is a collection of case classes of type Row, which is defined in the code. This will allow you to refer to fields via “.”, for example _.categoryId.