Beginners Data Science Projects

Beginners Data Science Projects

One of the most valuable things an aspiring data scientist can do is build a portfolio of work. Now it’s likely that you’ll have read this advice all over the internet and for good reason- it is genuinely good advice! But the work ‘portfolio’ itself causes problems for a lot of people.

When people talk about a portfolio for an aspiring data scientist, what they really mean is just some data science mini-projects that you’ve had a go at. For example, once you’ve learnt some Python (other languages are available), try taking a simple data set and running machine learning algorithms on it / creating visualisations from it / cleaning, transforming, and formatting it.

The equivalent would be a wannabe web developer creating a website for the first time. The first attempt will be pretty terrible and may not even get finished. And that’s fine! For your first data science project, do whatever you can, save your work (code, graphs, data), and move onto another one with your new-found knowledge. Repeat, and eventually you’ll find a mini-project that is especially interesting and a bit meatier. Then you can invest some real time and get a great result to show off.

Below, we’ve listed some simple, data science mini-project ideas for beginners.

Iris Categorisation

The iris classification problem is an absolute classic; you’ll be hard-pressed to find a machine learning expert that hasn’t heard of it! It involves using a very simple, perfectly formatted dataset about iris flowers. The aim is to build an algorithm that categorises data into one of three types of iris. The only tricky bit for beginners may be loading the data into Python.

Level – very easy

Skills – supervised machine learning

Datasethttps://archive.ics.uci.edu/ml/datasets/Iris

Tutorialhttps://machinelearningmastery.com/machine-learning-in-python-step-by-step/

California Cities

The aim here is practice your statistics and visualisation skills. The data set contains simple numerical information about many cities in California. Try to find correlations between variables, handle any ‘null’ values, and create a visualisation of populations on a map! Note that there are several options when it comes to creating the map, so if the one in the tutorial below doesn’t work for you, jump onto Google and find another.

Handwritten Numbers

Like the iris dataset, the MNIST digit recognition dataset provides another absolute classic data science project. It is essentially a collection of photos of handwritten digits which you must train an algorithm to classify as 0, 1, 2, 3, … This is an excellent place to begin playing with neural networks. So once you’ve managed it with ‘normal’ neural networks, try it again but using convolutional neural networks (CNNs). Be warned, you might find getting the data into Python a bit confusing at first.

Level – easy

Skills – neural networks

Datasethttp://yann.lecun.com/exdb/mnist/

Tutorialhttps://www.python-course.eu/neural_network_mnist.php

Tutorial (advanced) – https://towardsdatascience.com/image-classification-in-10-minutes-with-mnist-dataset-54c35b77a38d

Movie Recommendations

This project is a little more involved. You will use a large dataset of movie information and ratings in order to predict which movies a particular person will enjoy. This is exactly what Netflix got extremely good at doing and is a large part of the reason they are so successful! There are loads of different ways of achieving this, so feel free to deviate from the tutorial.

Pro Tips

Some practical tips to make your data science projects easier and more useful.

  • Save your code to a public GitHub repository. This way, you can share you code with whoever you want (e.g. potential employers) and you’ll learn how to use the massively popular GitHub.
  • Start by copying projects. Especially for more difficult tasks. Rather than spending hours making little progress, copy what others have already done and then use the same logic to complete similar tasks with different datasets. Reusing code in programming is to be encouraged!
  • Don’t skip the data cleaning and preparation! In the real-world, you’ll spend most of your time battling to format the data, cleaning up bad data points, and uploading it to the correct place. These are VERY useful skills to develop, no matter how pointless they may seem at the time.
  • Comments are your friend. In your code, you should include comments to let readers and your future self understand what is going on. To be specific, you should include comments at the beginning of your code to describe the big-picture, and then in-line comments throughout your code to explain how the programme is working. Easy for us to advise this, but most people end up learning the value of comments the hard way!