Dplyr is equivalent to the Pandas library in Python which enables easy data exploration and manipulation

I started out my data science journey learning how to use the Pandas library and truthfully, there is everything to love about it — It is easy to use, straightforward and has functionalities for just about any tasks that involve manipulating and exploring a data frame.

Heck, I even made a full video series on YouTube teaching other people how to use Pandas. Feel free to check it out (shameless plug)!

However, lately, I find myself spending more and more time on R primarily because I am preparing for my actuarial exams but also I am curious to learn the…

Linear regression is one of the most fundamental knowledge in statistics, here’s how to perform and interpret it in R

It’s been a while since my last article on here and that’s because I have been busy preparing for my actuarial exam that is coming up in just two months. In the process of studying these past couple of weeks, I ran into a good old friend from way back in my first ever statistics class, linear regression.

As I started to learn more complex machine learning algorithms, I sometimes get caught up with building the fanciest model to solve a problem when in reality, a simple linear model could have easily gotten the job done. …

A beginner’s guide to the great and powerful k-means algorithm

Photo by Markus Spiske on Unsplash

In this article, we will discuss k-means clustering, an unsupervised learning algorithm and learn how to implement it in R.

Introduction to unsupervised learning and k-means clustering

First of all, what is unsupervised learning?

In contrast to supervised learning where the label (output) of a predictive model is explicitly specified in advance, unsupervised learning allows the algorithm to identify the clusters within the data itself and subsequently label them accordingly.

K-means clustering is an example of an unsupervised learning algorithm and it works as follows:

  1. Choose a number of clusters, K (this is what the k stands for in k-means clustering), into which the data are to…

MinMaxScaler vs StandardScaler vs RobustScaler

Feature scaling is the process of normalising the range of features in a dataset.

Real-world datasets often contain features that are varying in degrees of magnitude, range and units. Therefore, in order for machine learning models to interpret these features on the same scale, we need to perform feature scaling.

Photo by Stepan Babanin on Unsplash

In the world of science, we all know the importance of comparing apples to apples and yet many people, especially beginners, have a tendency to overlook feature scaling as part of their data preprocessing for machine learning. …

There are better ways to impute missing values than just taking the average

Photo by Vilmos Heim on Unsplash

Missing values is one of the most common problems during data analysis and machine learning. Machine learning models require that a dataset does not contain any missing values before they can be fitted to the data. Therefore, it is crucial that we learn how to properly handle them.

I released a video tutorial a while back about handling missing data using Pandas and in that video, I spoke about the two main ways to deal with missing values in a dataset:

  1. If there are only a few rows with missing values or if a column has an overwhelming number…

And why you should stop using Pandas get_dummies

One of the most crucial preprocessing steps in any machine learning project is feature encoding. It is the process of turning categorical data in a dataset into numerical data. It is essential that we perform feature encoding because most machine learning models can only interpret numerical data and not data in text form.

In this article, we will learn:

  • The difference between a nominal variable and an ordinal variable
  • How OneHotEncoder and OrdinalEncoder can be used to encode these variables respectively
  • Why the Scikit-learn library is preferred over the Pandas library when it comes to encoding categorical features

As usual…

A comprehensive guide to selecting the most important features in any dataset

Picture this. You’re excited to finally start on your first machine learning project, having spent the last couple of weeks completing an online machine learning course. You come up with a problem that you would like to solve using machine learning and one that you think you can properly put your new knowledge to the test. You happily jump onto Kaggle and found a dataset that you could work with. You open up Jupyter notebook, import and read the dataset.

All of a sudden, the initial confidence that you had disappears as you stare hopelessly at the hundreds of features…

Seaborn is an immensely powerful data visualisation library that is built upon the Python programming language

I recently caught up with a data scientist working in the consulting industry and he was telling me about the grossly underrated part of his job that he wished he had known before starting his role and that is communication. Contrary to his initial belief that technical skills are all that is required to be a successful data scientist, he quickly realised that the ability to communicate is also equally if not more important.

In the real world, clients often do not have the luxury of time to go through the code has been written for a project line by…

My solution and analysis of the Titanic survival prediction competition on Kaggle

If you know me, I am a big fan of Kaggle. God only knows how many times I have brought up Kaggle in my previous articles here on Medium.

But my journey on Kaggle wasn’t always filled with roses and sunshine, especially in the beginning. Due to my lack of experience, I was initially struggling to grasp the workflow behind a machine learning project as well as the different terminologies that were being used.

It was only after months of sifting through online resources which included many technical articles, documentation and tutorial videos that I slowly started to learn concepts…

Pandas Zero to Hero is a video tutorial series aimed at teaching beginner-friendly ways of using Pandas

Before I get into the shameless promotion, I want to first share my discovery of Pandas, why it is so popular amongst the data science community and my motivation for starting the video series.

At the start of this year, I embarked on my journey to self-learn data science. After sifting through the unlimited amounts of online resources, there is one platform I frequently find myself reverting back to. It is also the very place I started out on my journey, and that is Kaggle.

Kaggle is a subsidiary of Google and an online community of data scientists and machine…

Jason Chong

Actuarial Science Student & Aspiring Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store