Wyncode Academy Blog Posts
Read about South Florida’s first development bootcamp
Five Things Developers Should Know About Programming for Data Science
Written by Administrator on 12th July 2016, 3:18 PM
By Ajit Jaokar
According to Harvard Business Review, the role of the Data Scientist is one of the hottest jobs for the foreseeable future. However, as a developer, learning to program for Data Science can be daunting. Programming for Data Science differs from the traditional programming languages (for example the emphasis on statistics).
Here are five insights for programmers based on my role in teaching Data Science:
R vs. Python: Does it really matter when programming for Data Science?
There is often a debate between R and Python as programming languages for Data Science. However, this is a contrived argument. The actual programming for Data Science is (relatively) easy. However, it is more significant to understand the statistics behind the coding. In that sense, I prefer R to Python. I started working with Python before I started working with R. Certainly, in my view, Python is a lot easier than R. This is so partly because there is one (very good) way to create data science programs in Python i.e. in the use of the scikit-learn library. Scikit-learn is a free machine learning library software for the Python programming language
In contrast, R has many ways to do the same thing i.e. a vast number of CRAN packages. These packages keep growing and are written by different people. Many times there is overlapping functionality. (I.e. We have many ways to do the same thing.)
Then why learn R? Here are two reasons:
I believe that, commercially, R will be more valuable because of the uptake from companies like Microsoft (Azure), HPE (Vertica), SAP (Hana), Oracle, Hitachi (Pentaho) etc.
R has a vast set of libraries for specific verticals that are open sourced. (ex. genomics libraries for R)
If I were to start off with one algorithm, which one would it be and why?
Typically, Data Science involves understanding and applying a set of algorithms for tasks such as classification, regression, clustering etc. We also need to understand more specialized algorithms such as support vector machines, random forests, gradient boosting, k-means and others. Many of these algorithms follow similar principles. I always recommend starting with regression and learning regression in detail.
In its simplest form, regression can be understood with even basic high school maths. An example of a linear regression model is the equation
= β0 + β1x
Ignoring the error component for simplicity, fitting a model means obtaining estimators for the unknown population parameters β0 and β1 .
The equation takes the form of a straight line as per:
The first step is to obtain a sample of size n from the relevant population. Then for each sample unit, obtain measurements (y1, x1), (y2, x2), …, (yn, xn). To fit the model for this sample size, we wish to find estimators b0, b1 that are ‘best’ in some sense. One of the methods that produces a best estimate is the Method of Least Squares also called Ordinary Least Squares (OLS) method. In this context, by ’best’ we mean the values of β0, β1 that produce a line closest to as many of the n observations as possible. To do this, we have to minimize the function that gives the distances of each observation to the line.
This can be shown below:
Thus, even with the most basic of high school maths, we can get started with regression.
Regression can be used for both prediction and classification (using Logistic Regression) and we can then explore this algorithm in detail.
How does Data Science fit in with Big Data?
To address this, let us clarify some terminology. (Definitions adapted from Wikipedia.)
Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Machine Learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.
Big Data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Architecturally, big data is built on a set of technologies around Hadoop. Thus, by definition, big data is distributed (i.e. involves processing across multiple nodes).
Apache Spark was designed to overcome the limitation of the map reduce paradigm from the original Hadoop by providing a more real-time architecture.
Based on the above, when machine learning algorithms apply to big data, we can consider them inherently distributed. For example, Apache Spark has machine learning libraries like Sparm.ML which can be used to apply machine learning to distributed datasets.
What is your view on SQL
SQL is fast becoming the lingua-franca of big data. By extension, SQL also enables Data Science. For example, Spark machine learning libraries can be used with SQL and access to Hadoop via SQL is the norm.
Where do I begin learning Data Science as a developer?
Because R, Python and other languages are open source, there is a large amount of information on the Web. In itself, this can be challenging. As a developer, I believe the best place to start is using simple programs and by understanding the libraries and APIs. This works because the libraries encapsulate the statistics models. You can thus start by doing – and build up on small successes and then explore the statistics behind the models later.
Wyncode is hosting a full-day data science workshop led by Ajit Jaokar in the Wyncode classroom at The LAB Miami on October 29th for individuals with coding experience. Participants can sign up here .