What is involved in Data Science and where do I start?

What is involved in Data Science and where do I start?

This post is in response to the question that was asked multiple times by my co-workers and other professionals who are interested to explore Data Science opportunities.

What is involved in Data Science and where do I start?

I often try to give a elaborate answer, packed with lots of information and I’ve sensed an information overload a number of times.  This article is an explanation for the above sensed information overload expressed. 

Slide 1 – What I was trying to convey through this slide is to show you a simple and a very basic end-to-end flow to address a specific business requirement for a Predict/Forecast scenario.

Some of these scenarios include predictions like face recognition in real-time or end of the day predictions for a future demand for a product. What is shown here is a very basic flow and it can get pretty complicated based on the processes and variables involved.

Data Science Experiments are always depends on the data. Referring to the data, it is important to understand the basics of how data is captured and stored.

Q: Do I need to know how the data is captured and where and how it is stored?

Not necessarily but having some idea on where the data is coming from will definitely set a starting point in your Data Science Experiment journey.

Slide 2 – Shows data capture and storage. This slide shows some forms of data origin points like a mobile app capturing user’s activity, a website recording user’s actions, a IoT device recording instrumentation data, a Point of Sale device capturing transactions etc.

Each of these activities may be capturing data into their respective databases/files on-premises maintained and managed by the organization or writing data to a cloud based database.

 

As these databases are designed to provide high response rates to the end users you may be restricted to query data directly from these databases. From a design perspective majority of these databases are configured to provide CDC (Change Data Capture) tables separate from the main transactional tables.

Extracting data from CDC or extracting data from transactional tables during non-peak hours, dumping this data into a central area is called Raw Data Collection. Very limited data transformations are applied to preserve the raw state of the data.

HDFS was one of the preferred storage but now many technologies are available in this space. Over time this raw data becomes a crucial data asset for the organization.

Well, do I need to learn or work in this area to explore Data Science?

Let me answer this question in parts.

  1. This is not a requirement, but as the area demands ETL skills you may need to have a high level knowledge of the same. Many tools try to fill the gaps or attempt to automate ETL processes, hence you would not need to learn or work in this area.
  2. If your data science experiment demands data to be extracted directly from Raw Data then you need to have some basic skills on getting the data using ODBC or OLEDB.
  3. If you are dealing with  Raw Data directly, it is expected that YOU will be taking care of all the required transformations on the data. This calls for a level of understanding of ETL concepts and data transformation techniques.

What if this ETL work is done for you and now you have access to the more summarized data?

Majority of the organizations opt to implement a Data Warehouse solution or setup Analytical Base Tables to support their Data Analysis and Reporting needs. From a Data Science point of view, this is a good place to start.

Slide 3Do I need to learn Date Warehouse concepts or work in this area?

Well, you don’t need to be involved in designing or developing Data Warehouse/Data Marts/Analytical Base Tables but having a good understanding of Fact Tables, Dimensions and Analytical Base Tables helps you along the way in getting the required data from Data Warehouse in the required format.

One of the key aspects of Data Science is having a deep understanding and profiling of data.

Slide 4 – Assuming that you have a good understanding of the data,  What are next steps in the learning path that  you would need to take?

Build a Model. This is the most important work you would be as part your Data Science Experiments. In this task  you would apply various Statistical Methods on the data to identify or discover hidden patterns and use them to Predict/Forecast future outcomes.

Which language should be learned to perform Data Science Experiments?

Popular languages for Data Science are Python and R.  I would recommend Python as it is easy to learn and plain English programming syntax. Apart from learning the language you also need to invest some time and effort in learning IDE (Integrated Development Environments) to make it easy to code, debug, version control and organize projects.

There are a number of IDEs () for both R and Python. Python has more number of IDEs compared to R.

  • R  – R Studio and some others
  • Python –  Spyder, Jupyter, Visual Studio Code, PyCharm, Visual Studio Community, Azure Data Studio (Notebooks)

In this area you would be developing Models using various Machine Learning modules. As part of building model you als0 train your model and test it against test data to validate the accuracy of the model.

Using existing modules like pandas and SciKit will help you generate a model with a few lines of code but having knowledge of Statistics and applying that knowledge to refine your model is what Data Science is all about.

Understanding the statistics of the data is crucial compared to the understanding of the data. In the beginning simple datasets may help but in real world you may be training and testing against terabyte or petabyte size datasets.

Slide 5 – Once the model is validated with the results from training and testing (aka Supervised Learning) the next step is to publish the model. There are modules to perform this task, for example in Python we can use pickle module to save the finalized model to the file system.

SQL Server (assuming that Python is enabled) allows execution of Python code directly from T-SQL and stores the Model in a table. Model will be stored as binary object in SQL Server.

If your Data Science Architecture suggests SQL Server then SQL Server knowledge is required.

Keep in mind that you may be the Model Builder but not necessarily the consumer of the Model.

Slide 6 – Consuming the model in other words putting your model to work. You can still help/guide/assist in this area but the business team who had initiated this experiment will be more interested in consuming the model and use the model to predict/forecast based on the requirements.

Consumers will be using the same language that was used to develop the model. Predict/forecast process can be automated as well. I will explain the Automation in a separate article.

Hope you have gained a basic understanding of the expectations of Data Science from this article. Having a good understating of Math and Statistics positions you to become a smarter Data Scientist.