Take a deep Breath!.. Think about the next 365 days… Every day we will have various tasks, hopefully many good moments, few bad moments, and invaluable memories. Among all of those things, we have decided to put the #365daysMLChallenge into action.
In order to succeed at this challenge, we have defined a ML framework. In this post, we will also have a look at resources we will use and the tasks of the Week 1.
What resources do we have?
ML people are divided into two groups: Python users & R users. I, personally, am using Python heavily. However, whatever language you are using, the end results should be same.
For Python, there are multiple integrated development environment (IDEs) you can choose from. I am using the Jupyter Notebook and am quite satisfied with the IDE. However, if you are familiar with the RStudio, you may find the Spyder IDE more comfortable.
In any case, you can download Anaconda Navigator and choose whichever IDE you want to use. You may also download multiple other IDEs standalone by just googling.
For R, there is no many options. Therefore, I’d go right away with the RStudio, the official IDE for R language. Be aware that you have to download and install both R and RStudio.
I am mostly googling the datasets or dataset concepts I have in my mind. However, there are many websites that offer free datasets. The one I really love is the UCI Machine Learning Repository. Another source I am favoring is the Kaggle. For more, please use Google.
Where to Save Our Work?
365 days, 365 code snippets. We should have a track of each day and have a back up of each code snippet. The only resource I am using and recommending is the GitHub. After signing up, create two folders: One for the datasets you will use, and one for the codes. Make sure you have a consistent and clear name for each day (i.e. Day1_Code.ipynb, Day1_Dataset.csv). You may also use the GIST of GitHub to save each code snippet and post anywhere.
Episode 1- The Realization
01.01 – Import and Practice the Libraries
Import the main libraries and practice couple of basic functions. For R you can try data.table, dplyr, ggplot2. For Python it can be Pandas and Numpy.
02.01- Data Import from local drive
Try multiple ways of importing data from local drive. Try to import .csv, .xlsx, and more data types. You can choose any dataset you want.
03.01- Data Import from Internet
Try multiple means of importing data from internet. You can choose any dataset url you want.
04.01- Describe your dataset and its characteristics
Get the summary of your dataset: description, data type, statistical summary, etc.
05.01- Check for Missing Values and eliminate them
Check if there are any missing values and based on your reasoning try multiple means to eliminate those missing values.
06.01- Check for Outliers
Check if your data has any outliers. You can use visualisation methods as well.
07.01- Use multiple visualization methods for your dataset
Practice visualizing your dataset. Use any visualization package you find comfortable (i.e. I prefer ggplot2 for R and matplotlib for Python). Visualize numerical values, categorical values, statistical summaries, etc.