How to start the Data Scientist Journey?

by Verena Ojeda, Computer Engineer, Willdom

This is the brief overview of key steps when embarking on a Data Science journey, based on my professional and personal experiences. Nearly everyone in this field (particularly those who have been in it for a while), like to say that there is no one way of doing things in Data Science. The magic of Data Science lies in the variety of analysis, solutions and interpretations.

Image 1. Data Science environment.

Breaking down Data Science into key steps

What is the first step to start venturing into this world that sounds so complex? The answer is easy: the best starting point is the problem. Data Science intends to find solutions to an array of real-life everyday problems ranging from problems that seem irrelevant at first glance and often overlooked to problems of vital importance. While it is common to associate Data Science to areas like health, environment or economy, we can also see it widely applied in areas such as finance, marketing, language processing, audio and image processing, etc. But none of these problems could be solved without information about them, and information resides in data. If we want to solve a problem and we have (or can collect) data related to it, we can apply Data Science techniques to find interesting and innovative solutions.

Data Science lifecycle
Image 2. Data Science lifecycle.

Once we have clarity on our problem at hand and also the necessary data, the next step comes into play: the search for solutions. We are going to spend a considerable amount of time leveraging pre-existing knowledge about the problem, algorithms, programming languages, tools, among others to seek the solution. In this phase we will also do a lot of research, taking into consideration previous works, solutions that were also applied to the problem and how they can be improved.

Machine Learning techniques
Image 3. Machine Learning techniques.

Last but not least, there is also an analysis phase which serves as a general evaluation of the obtained results or as an evaluation of intermediate solutions which generally leads to new experiments. This phase requires a high knowledge of the problem, the data and the solution. It involves tables, visualizations, metrics and every other tool that can help us in the understanding and comparison of results. In the end, this will contribute to the selection of the best performing solution.

Evolution of techniques and arise of Deep Leerning
Image 4. Evolution of techniques and arise of Deep Learning.

So far everything sounds like an established cycle with a set of steps to follow, but when it comes to practice, the diversity of problems and solutions makes the game more entertaining. Let’s share a couple of case studies to make things more concrete.

Case study 1: Dengue outbreaks prediction in Paraguay

In Paraguay, dengue is an endemic disease, causing epidemics year after year. Hence, as part of my graduate thesis, my partner and I decided to focus on the prediction of dengue outbreaks in the country under the hypothesis that the geographical situation, climate and population were variables involved in the spread of the virus.

A simple overview of my process for this project:

In this case, we needed existing data, otherwise we would have to wait for a couple of years of new data collection. Luckily, the Ministry of Health was able to provide us with data on dengue cases since 2013, the Climate Directorate collaborated with data on climate variables for those years, and the Directorate of Surveys and Census provided us with population data. However, we were not ready to model the solution just yet:, although we had a good amount of data, not all the data were standardized. That was the first mishap, and we had to spend a good amount of time cleaning and preprocessing the data. In addition, we also had to anonymize the dengue cases data, as it was health data, considered sensitive. Once everything was ready, we moved on to the solution search phase, for which we used some high-level tools in which we ran supervised learning algorithms such as decision trees and random forest and analyzed their results, until we arrived at an effective solution.

While this work was not intended to elaborate the best solution, we did learn a lot about the problems that can arise when obtaining and processing data. In addition, we were able to find answers to some of our key questions: for example, we realized that climatic variables did not have too much impact on the number of cases in contrast to geographic location and population density.

Case study 2: Automatic detection of toxoplasmosis in fundus images

During the last year I have been part of a project that seeks to automate the detection of toxoplasmosis from fundus images. To start the research we had to establish an imaging protocol and start collecting the data. After the first few months, we realized that the process was slow and we would not obtain the amount of images needed to apply the techniques we planned to consider. In this case, we noticed that when dealing with very particular data, such as images of the ocular retina that are usually captured with special and expensive cameras, both time factor and quantity and quality factors are important as moving to the next stages depends on it. Thanks to members of another project that also worked with fundus images and happily shared the information with us, we could continue with our project.

Once an acceptable amount of data was obtained, we started the experimental phase. For this case we decided to use deep learning, we trained residual neural networks to identify images with lesions caused by toxoplasmosis and categorized them as active or not active. The project is still in the experimental phase, but the results are extremely interesting. As we move forward we not only get answers to our initial questions, we also find new questions and the cycle begins again. Currently, we are analyzing experiments regarding interpretability, which is a relatively new area of interest specially for deep learning techniques. Another usual characteristic of this field is that the described cycle can be repeated several times until a favorable solution (or answer) is found.

Conclusions

I hope these short examples have helped illustrate how data science can be applied to different problems and situations. In order to move towards the construction of viable solutions, a deep understanding of the problem is necessary. However, it is not enough. Beyond understanding the problem, we might need to collect data or generate data. After this stage, we enter the data science phase when we leverage tools to seek solutions and further, translate these solutions to actionable insights. Hence, it is data that will help us answer our key questions and ensure that the proposed solutions will make a difference. Data is the reason why this field bears its name, and it is indeed, the beginning of the journey.

--

--