Header image
Image sourcePixabay (Free image)


Are you looking for “GPU-powered data science”?

Imagine yourself to be a data scientist, or a business analyst, or an academic researcher in Physics/Economics/Neuroscience…

You do a lot of data wrangling, cleaning, statistical tests, visualizations on a regular basis. You also tinker with a lot of linear models fitting data and occasionally venture into RandomForest. You are also into clustering large datasets. Sounds familiar enough?

However, given the nature of the datasets you work on (mostly tabular and structured), you don’t venture into deep learning that much. You would rather put all the hardware resources you have into the things that you actually do on a day-to-day basis, than spending on some fancy deep learning model. Again, familiar?

You hear about the awesome power and the blazing-fast computation prowess of GPU systems like the ones from NVidia for all kinds of industrial and scientific applications.

And, you keep on thinking — “What’s there for me? How can I take advantage of these powerful pieces of semiconductor in my specific workflow?”

You are searching for GPU-powered data science.

One of your best (and fastest) options to evaluate this approach is to use the combination of Saturn Cloud + RAPIDSLet me explain in detail…


GPUs in the AI/ML folklore have primarily been for deep learning

While the use of GPUs and distributed computing is widely discussed in the academic and business circles for core AI/ML tasks (e.g. running a 1000-layer deep neural network for image classification or billion-parameter BERT speech synthesis model), they have found less coverage when it comes to their utility for regular data science and data engineering tasks.

Nonetheless, data-related tasks are the essential precursor to any ML workload in an AI pipeline and they often constitute a majority percentage of the time and intellectual effort spent by a data scientist or even an ML engineer. Recently, the famous AI pioneer
Andrew Ng talked about moving from a model-centric to a data-centric approach to AI tools development. This means spending much more time with the raw data and preprocessing it before an actual AI workload executes on your pipeline.

So, the important question is: Can we leverage the power of GPU and distributed computing for regular data processing jobs?

Image source: Author created collage from free images (Pixabay)


While the use of GPUs and distributed computing is widely discussed in the academic and business circles for core AI/ML tasks, they have found less coverage in their utility for regular data science and data engineering tasks.


The fantastic RAPIDS ecosystem

The RAPIDS suite of software libraries and APIs give you — a regular data scientist (and not necessarily a deep learning practitioner) — the option and flexibility to execute end-to-end data science and analytics pipelines entirely on GPUs.

This open-source project was incubated by Nvidia by building tools to take advantage of CUDA primitives. It specifically focuses on exposing GPU parallelism and high-bandwidth memory speed features through the data-science-friendly Python language.

Common data preparation and wrangling tasks are highly valued in the RAPIDS ecosystem. It also lends a significant amount of support for multi-node, multi-GPU deployment, and distributed processing. Wherever possible, it integrates with other libraries which make out-of-memory (i.e. dataset size larger than individual computer RAM) data processing easy and accessible for individual data scientists.

Image source: Author created collage


The three most prominent (and Pythonic) components — that are of particular interest to common data scientists — are,

  • CuPy: A CUDA-powered array library that looks and feels just like Numpy, while using various CUDA libraries e.g., cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT, and NCCL to take full advantage of the GPU architecture underneath.
  • CuDF: This is a GPU DataFrame library for loading, aggregating, joining, filtering, and manipulating data with a pandas-like API. Data engineers and data scientists can use it to easily accelerate their task flows using powerful GPUs without ever learning the nuts and bolts of CUDA programming.
  • CuML: This library enables data scientists, analysts, and researchers to run traditional/ classical ML algorithms and associated processing tasks fully leveraging the power of a GPU. Naturally, this is used mostly with tabular datasets. Think about Scikit-learn and what it could do with all those hundreds of Cuda and Tensor Cores on your GPU card! On cue to that, in most cases, cuML’s Python API matches that of Scikit-learn. Furthermore, it tries to offer multi-GPU and multi-node-GPU support by integrating gracefully with Dask, wherever it can, for taking advantage of true distributed processing/ cluster computing.

Can we leverage the power of GPU and distributed computing for regular data processing jobs and machine learning with structured data?


Is it different from using Apache Spark?

You may ask how this GPU-powered data processing is different than using Apache Spark. Actually, there are some subtle differences, and only recently, with Spark 3.0, GPUs are a mainstream resource for Spark workloads.

Accelerating Apache Spark 3.0 with GPUs and RAPIDS | NVIDIA Developer Blog

We do not have time or space to discuss the unique differences of this GPU- powered data science approach vs. Big Data tasks that are particularly suitable for Apache Spark. But ask yourself these questions and you will probably understand the subtle difference,

As a data scientist who models economic transactions and portfolio management, I want to solve a linear system of equations with 100,000 variables. Do I use a pure Linear Algebra library or Apache Spark?”

As part of an image compression pipeline, I want to use Singular Value Decomposition on a large matrix of millions of entries. Is Apache Spark a good choice for that?”

Big problem size does not always mean Apache Spark or Hadoop ecosystem. Big Computation is not equivalent to Big Data. As a well-rounded data scientist, you need to know both to tackle all kinds of problems.


RAPIDS specifically focuses on exposing GPU parallelism and high-bandwidth memory speed features through Python APIs.


What are we showing in this article?


Crisp examples of CuPy and CuML only

So, in this article, we will just demonstrate crisp examples of CuPy and CuML,

  • how they compare (in speed) with corresponding Numpy and Scikit-learn functions/ estimators
  • how the data/problem size matters in this speed comparison.


CuDF examples in a later article

Although data engineering examples akin to Pandas data processing are of high interest to many data scientists, we will cover the CuDF examples in a later article.


What is my GPU-based hardware platform?

I am using a Saturn Cloud Tesla T4 GPU instance as it is literally 5 minutes of work to spin up a fully featured and loaded (with DS and AI libraries) compute resource on the cloud for all my data science work with their service. As long as I don’t exceed 10 hours of Jupyter Notebook usage per month, it’s Free! If you want to read more about their service,


Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!


GPU computing is the future of data science. Packages such as RAPIDS, TensorFlow, and PyTorch enable lightning-fast…

Apart from having the Tesla T4 GPU, it is a 4-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz machine with 16 GB of RAM and 10 GB persistent disk. So, this is a quite normal setup from a hardware config point of view (limited hard drive because of the free tier) i.e. any data scientist may have this kind of hardware in his/her possession. The only distinguishing factor is the presence of the GPU and setting up all the CUDA and Python libraries in a proper way so that the RAPIDS suite works without any hiccup.

Big problem size does not always mean Apache Spark or Hadoop ecosystem. Big Computation is not equivalent to Big Data. As a well-rounded data scientist, you need to know both to tackle all kinds of problems.


Solving a linear system of equations

We create linear systems of equations of varying sizes and use the Numpy (and CuPy) linalg.solveroutine to solve that with the following code,

And, the code changes by a single letter (in multiple invocations) for the CuPy implementation!

Also note, how we can create CuPy arrays from Numpy arrays as arguments.

The result is dramatic though. CuPy starts slow or at a similar pace that of Numpy, but beats it squarely for large problem sizes (number of equations).


Singular value decomposition

Next, we tackle the problem of singular value decomposition using a randomly generated square matrix (drawn from a normal distribution) of varying sizes. We don’t repeat the code block here but just show the result for brevity.

Significant to note that the CuPy algorithm does not show markedly superior performance to that of the Numpy algorithm in this problem class. Perhaps, this is something to be taken up by the CuPy developers to improve upon.


Going back to the basic: Matrix inversion

Lastly, we go back to the basics and consider the fundamental problem of matrix inversion (used in almost all machine learning algorithms). The result again shows strongly favorable performance gain by the CuPy algorithm over that from the Numpy package.


Tackling a K-means clustering problem

Next, we consider an unsupervised learning problem of clustering using the all-too-familiar k-means algorithm. Here, we are comparing a CuML function with an equivalent estimator from the Scikit-learn package.

Just for reference, here is the API comparison between these two estimators.

Image sourceScikit-learn and CuML website (Open-source projects)


Here is the result for a dataset with 10 features/dimensions.

And, here is the result of another experiment with a 100-feature dataset.

Clearly, both the sample size (number of rows) and dimensionality (number of columns) mattered in how the GPU-based acceleration performed superior.


All-too-familiar linear regression problem

Who can ignore a linear regression problem for speed comparison while dealing with tabular datasets? Following the cadence as before, we vary the problem size — this time both the number of samples and dimensions simultaneously — and compare the performance of CuML LinearRegression estimator to that obtained from the Scikit-learn stable.

The X-axis in the following figure represents the problem size — from 1,000 samples/50 features to 20,000 samples/1000 features.

Again, the CuML estimator performs much better as the problem complexity (sample size and dimensionality) grows.



We focused on two of the most fundamental components of the RAPIDS framework, which aims to bring the power of GPU to the everyday tasks of data analysis and machine learning, even when the data scientist does not perform any deep learning task.

Image source: Made by the author with free Pixabay images (Link-1Link-2Link-3)


We used a Saturn Cloud Tesla T4 based instance for easy, free, and quick setup and showed a few features of CuPy and CuML libraries and performance comparisons of widely used algorithms.

  • Not all algorithms from the RAPIDS libraries are vastly superior but most are.
  • In general, the performance gain increases rapidly as the problem complexity (sample size and dimensionality) grows
  • If you have a GPU, always give RAPIDS a try, compare and test if you are gaining any performance, and make it a trusted workhorse of your data science pipeline.
  • The code change is minimal, almost non-existent for switching over.

Let the power of GPU jumpstart your analytics and data science workflow.

You can check the author’s GitHub repositories for code, ideas, and resources in machine learning and data science. If you are, like me, passionate about AI/machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Thanks to Mel.

Original. Reposted with permission.


Source link

Leave a Reply

Your email address will not be published.