By Dario Radečić, Consultant at NEOS



Photo by David Emrich on Unsplash

 

Everyone and their grandmother know what a CSV file is. Parquet, not so much. Is it a type of flooring? Well, yes, but no — it’s a highly efficient data storage format, and you’ll learn all about it today.

CSVs are everywhere — from company reports to machine learning datasets. It’s a data format that’s simple and intuitive to work with — just open a file, and you have direct access to the data. It’s not as good as it sounds, for the reasons you’ll read in a bit.

There are many alternatives to CSVs — such as Excel, SQLite databases, and HDFs — but one in particular stands above the competition. That’s Parquet.

Today’s article answers the following questions:

  • What’s wrong with CSVs?
  • What’s a Parquet data format?
  • CSV or Parquet? Which one should you use?

 

What’s wrong with CSVs?

 
 
I love CSVs, and so do you, probably. Pretty much any piece of software can generate CSV files, even plain text editors. If you’re into data science and machine learning, just head over to Kaggle.com — almost all tabular datasets are in this format.

CSVs are row-orientated, which means they’re slow to query and difficult to store efficiently. That’s not the case with Parquet, which is a column-orientated storage option. The size difference between those two is enormous for identical datasets, as you’ll see shortly.

Adding insult to injury, anyone can open and modify a CSV file. That’s why you should never use this type of format as a database. There are safer alternatives.

But let’s dial into what’s really important to companies — time and money. You’ll likely store data in the cloud, either as a basis for web applications or machine learning models. Cloud service providers will charge you based on the amount of data scanned or the amount of data stored. Neither looks good for CSVs.

Let’s look at the storage pricing for Amazon S3 first. The numbers were taken from here:



Image 1 — Amazon S3 storage pricing for different data formats (image by author)

 

Yikes. Parquet files take much less disk space than CSVs (column Size on Amazon S3) and are faster to scan (column Data Scanned). As a result, the identical dataset is 16 times cheaper to store in Parquet format!

Next, let’s take a look at the speed increase with Apache Parquet:



Image 2 — Amazon S3 storage and query pricing comparison for different data formats (image by author)

 

Once again — yikes! Parquet is 99.7% cheaper if your original CSV file is 1TB in size.

So no, CSVs are nowhere near a good option when your data is big — both time and dollar wise.

 

But what’s a Parquet data format exactly?

 
 
It’s an alternative format for storing data. It’s open source and licensed under Apache.

Both CSV and Parquet formats are used to store data, but they can’t be any more different internally. CSVs are what you call row storage, while Parquet files organize the data in columns.

Here’s what that means. Imagine you have the following data:



Image 3 — Dummy table data (image by author)

 

Here’s how the previous table would be organized in both row and column storage:



Image 4 — Row vs. column storage (image by author)

 

In a nutshell, column storage files are more lightweight, as adequate compression can be done for each column. That’s not the case with row storage, as one row usually contains multiple data types.

Apache Parquet is designed for efficiency. The column storage architecture is the reason why, as it allows you to skip data that isn’t relevant quickly. This way both queries and aggregations are faster, resulting in hardware savings (read: it’s cheaper).

In a nutshell, Parquet is a more efficient data format for bigger files. You will save both time and money by using Parquet over CSVs.

If you’re a Python guy, working with CSV files in Pandas should sound familiar. Let’s see how the library handles Parquet files next.

 

Hands-on CSV and Parquet comparison

 
 
You’ll use the NYSE stock prices dataset for this hands-on part. The CSV file is around 50MB in size — nothing big — but you’ll see how much disk space Parquet will save.

To start, load in the CSV file with Python’s Pandas:

import pandas as pddf = pd.read_csv('data/prices.csv')
df.head()

 

Here’s how it looks like:



Image 5 — Head of NYSE stock prices dataset — CSV (image by author)

 

You’ll have to use the to_parquet() function to save the dataset. It works the same as the to_csv(), for the most part, so using it won’t feel different:

df.to_parquet('data/prices.parquet')

 

The dataset is now saved. Let’s see how to load it before comparing file sizes. Pandas has read_parquet() function for reading Parquet data files:

df_parquet = pd.read_parquet('data/prices.parquet')
df_parquet.head()

 

Here’s how the dataset looks like:



Image 6 — Head of NYSE stock prices dataset — Parquet (image by author)

 

Both datasets are identical. The command df.equals(df_parquet) will print True to the console, so feel free to use it for verification.

So, are there any decreases in file size? Umm, yeah:



Image 7 — CSV vs. Parquet file sizes on NYSE stock prices dataset (image by author)

 

That’s roughly four times less in disk space usage.

Does it matter for 50MB datasets? Probably not, but the savings scale on larger datasets, too. That’s especially important if you’re storing data on the cloud and paying for the overall size.

That’s something to think about.

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you: https://medium.com/@radecicdario/membership

 

Stay connected

 
 

 
Bio: Dario Radečić is a consultant at NEOS.

Original. Reposted with permission.

Related:



Source link

Leave a Reply

Your email address will not be published.