7 Fun & Friendly Datasets To Kick-Start Your First Project in Data Science
Work Hard, Have Fun & Grow (quicker than you think)
Much can be said about data science and data analysis. However, I doubt whether the words “fun” and “friendly” will ever come up as the first associations to anyone who enters this field. I would say, data science is more like a marathon: sweat, tears, effort, and suffering is our everyday life.
Mathematics, statistics, programming skills…with so few favorable circumstances, data science seems difficult, to say the least. Nevertheless, it is also one of the most sought after professions in recent years. It can’t be that bad, can it?
One thing is certain, it is an extremely demanding field. Today, I am here to pat on your shoulder and say: Don’t worry — we’re all in the same boat. While it will take a lifetime to improve in this area, don’t quit halfway through the race. There is something we can do to make this learning process a little more enjoyable. In fact, it’s about having fun (don’t forget it!). Rule number 1: choose a captivating dataset.
“In God we trust. All others must bring data” — W. Edwards Deming
Using an engaging dataset will help you stay motivated when things get tough.
Alright, stop talking, start doing!
Today we take a look at Kaggle’s most engaging, fun, and publicly available datasets to take your analytical skills to the next level. Below is a list of 7 super fun and interesting datasets. Hope they will poke your curiosity too!
1. Disney Movies
Level: Basic
Dataset: https://www.kaggle.com/prateekmaj21/disney-movies
We are all children (at least at heart). I am sure you will love this dataset at first sight. The database is simple, so it’s a great choice to practice your analytical skills. The data include all Disney movies that were released between 1937 and 2016. I bet you know the majority of them? Conduct some exploratory analysis or play with pandas and numpy on the first attempt. Also, it will be a great resource for practicing your visualization skills in matplotlib and seaborn.
“If you can dream it, you can do it” — Walt Disney
What you can do with this data:
- Which Disney movie made the most revenue? (is it your favorite childhood memory?)
- Which genre was the most profitable?
- Which year was the most profitable for Disney, and what contributed to it?
2. Parties in New York
Level: Basic
Dataset: https://www.kaggle.com/somesnm/partynyc
“The city that never sleeps” is a famous nickname for New York City, popularized by Frank Sinatra. Having analyzed this dataset, you will have no further doubts that it was the case in 2016. You will find data on complaints received by the police along with timely calls, police response time, and the part of the city where the incident took place. Again basic, but informative data. Perfect for your first EDA (Exploratory Data Analysis).
What you can do with this data:
- Determine the loudest neighborhoods in New York
- How long did it take to respond to the complaints?
- Does this city genuinely never sleep? Did you notice any pattern?
3. Fantasy Premier League
Level: Advanced
I have to confess something here. My fiancé is a very loving man, but in his order of priorities, I am surely placed second. Indeed, he lives for football. If it weren’t for him, I wouldn’t know this game existed. Fantasy Premier League is the most famous, free football game of our time with more than 7 million players worldwide! You can create your leagues and compete with friends or other players from all over the world, which makes the experience even more exciting. Definitely no. 1 dataset for all football fans! ⚽️
What you can do with this data:
- Exploratory data analysis
- Predict player performance with supervised machine learning or neural network
- Build your dream team to win against your friends
4. Women’s Shoe Prices
Level: Advanced
Dataset: https://www.kaggle.com/datafiniti/womens-shoes-prices
“Give a Girl The Right Shoes, And She Can Conquer The World” — Marilyn Monroe
An astonishing sample of 10,000 women’s shoes and brand data provided by Datafiniti. Here you will find detailed information such as colors, categories, and a full shoe description. Very promising and impressive data. Check it out now! 👠
What you can do with this data:
- Determine big brand strategies for luxury shoes
- Do you have your favorite brand? See how it was ranked
- Which brand has the widest price distribution? How do you think, why?
- Is there a pattern for shoe price distribution per brand?
- What is the most expensive shoe brand ever?
5. Formula 1 World Championship
Level: Advanced
Dataset: https://www.kaggle.com/rohanrao/formula-1-world-championship-1950-2020
Another chance for a male audience (but not only!). The FIA Formula 1 World Championship has been the most important form of motor racing in the world since 1950. The dataset includes circuits, constructor performance, driver rankings, seasons, and qualifications. In this pure war of constructors, data is the bread and butter of the FIA teams. Challenge yourself to be an analyst for your favorite team. Just dive in now and get the most out of it.
What you can do with this data:
- Choose your favorite driver, check his statistics, analyze results during the season
- Pick up the most dominant drivers of all time and see what helped them get this position?
- Find patterns in the performance of the best teams
- Build a model that will predict results for future seasons
6. Netflix
Level: Basic
Dataset: https://www.kaggle.com/shivamb/netflix-shows/notebooks
The way we watch TV has changed forever. This dataset includes our favorite Netflix movies and shows up to 2019. It’s worth noting how general TV preferences have changed gear in the last few years. There you will find information such as ratings, type, country, release year, and duration of the movie. Grab the popcorn, sit back on the couch, and dive into it!
What you can do with this data:
- Discover what’s the content in many countries around the world
- Find interesting insights on actors/directors from your favorite show
- What is Netflix’s business strategy recently? Anticipate how it will change in the years to come
7. Friends
Level: Basic
Dataset: https://www.kaggle.com/rezaghari/friends-series-dataset/notebooks
Who has not heard of Friends? It’s the most recognizable and popular series ever, and of course one of the funniest I have ever seen. Among the many interesting columns as: year of production, episode title, duration, or votes, you will find a full summary of each episode. Take advantage of it and develop your first NLP (Natural Language Processing) project!
What you can do with this data:
- Sentiment analysis by season/episode
- What was the longest season ever?
- Who is the most famous director?
- Which episodes are rated the highest?
Conclusion
Starting a marathon requires good preparation. To achieve your goals, you need to be resilient and work very hard. Learning data science, like many other fields, is no different. However, I have good news for those of you who, like me, are embarking on this journey. Data science can also be fun, positive, and extremely addictive under one condition: choose your first dataset wisely. I believe using an engaging and fun dataset especially at the beginning of your data science journey, will help you stay on track.
It does not hurt to try, no?
Thank you for reading!
If you enjoyed this article, follow me on Medium ❤
If you want to say “Hello” connect me on Linkedin