“You can have data without information, but you cannot have information without data.” — Daniel Keys Moran

This database is called the UCI machine learning repository and you can use it to structure a self-study program and build a solid foundation in machine learning.

The repository contains more than 497 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.

The UCI Machine Learning Repository is one of the oldest sources of data sets on the web. Although the data sets are user-contributed, and thus have varying levels of documentation and cleanliness, the vast majority are clean and ready for machine learning to be applied. UCI is a great first stop when looking for interesting data sets.

You can download data directly from the UCI Machine Learning repository, without registration. These data sets tend to be fairly small, and don’t have a lot of nuance, but they’re great for machine learning

Here are some examples:

  • Wine classification — contains various attributes of 178 different wines.
  • Diabetes Data SetDiabetes files consist of four fields per record. Each field is separated by a tab and each record is separated by a newline.
  • Iris Data Set — The aim is to classify iris flowers among three species (setosa, versicolor or virginica) from measurements of length and width of sepals and petals.
  • SMS Spam Collection Data SetThe collection is composed by just one text file, where each line has the correct class followed by the raw message.


Leave a Reply

Your email address will not be published.