ML Pipeline for #365daysMLChallenge

by Heydarov

Starting a new learning phase is always complicated and not as comfortable as it is seen at the beginning. In order to succeed at any challenge, there should be certain framework defined.

Within the next few days I will post more about the resources and constraints we will, probably, have in 365 days of commitment.

Now I would like to define a conceptual framework of ML that we shall excel at. Since the interest in the field is increasing exponentially, there are dozens of papers published almost every day. Namely, it is easy to be lost without reaching the target. Therefore, to keep ourselves on the target, we will focus on going through each step, and once we are done with all the concepts within each point we more to the next one.

Let’s have a rough draft of what each step looks like in ML Pipeline.

  1. Data Import -> Choosing a correct data is not always possible. It is either due to the fact that business cases are mostly unstructured, or because the target process is not compatible with the data. However, there are multiple sources that we can get structured data start with. Learning how to import various data types is an important skill in the process. Even after importing the data, it is not processable yet. In order to clean the data, we go to the next step
  2. Clean the Data -> also known as Data Pre-Processing, this process together with the Manipulation Step (3) takes almost 80% of the solution process. So, it makes much sense to spend a lot of time here.
  3. Prepared data -> this is the outcome of Step 2 where we most of the times have to also do the Feature Engineering. Once our data is structured and vectorized, the next steps will be mainly about using various algorithms and tuning their parameters to achieve the optimized result. Moreover, the dataset will be split into training, validating and testing sets. This has to be also properly done to get a correct results.
  4. Apply ML Algorithm -> At this point we will apply our models, either supervised learning or unsupervised learning. I would suggest to start with the Supervised Learning models. Few examples of SL category are: Linear Regression, Logistic Regression, Random Forest, Neural Networks, Support Vector Machines, etc. Of course, in order to apply them it is highly important to understand the mathematical and intuitive idea behind the models. Therefore, this will be one of the challenges we should get ourselves ready for. In this step we will also perform some Grid Search, optimum parameter finding technique.
  5. Candidate Model(s) -> After an initial model application, there will be few models left with the highest accuracy metrics. At this step we should come up with the reasoning why to choose the specific model over others. This decision will be made based on various factors, such as interpretability of the model, robustness and so on.
  6. Model Application -> After we have defined our model, the final process is to creating an automated process and getting a result. Creating an end-to-end process is as important as having just a predicted output.

Now we are familiar with the output conceptual framework of ML. The next post will give some resources necessary to start the challenge. If you have more ideas, in next 13 days write it in comments and we will surely consider those ideas. The idea is in such a way that everyone can independently do apply everything on their own way. Main goal here is to learn a concept per day and share it with each other…