Introduction
The data in the raw form is it is difficult to understand. The process of understading data or extracting information out of it is called data analysis.
Extracing non-obvious information from data is the purpose of the analysis process, this information could be used for predictions. The objective of data analysis is the prediction of the future. Usually the analysis generates a model, which is used to do the predictions. Models are mathematical form of the system being under study.
To understand data there are several methods of visualizations you need to understand. Usually visualization is done with different types of charts.
In this document we will explain in general about the data analysis process, trying to avoid using any lenguage in particular. You can apply any of this ideas to any lenguage or library you want, those are just tools.
Types of data
Data can be divided in many different ways. We choose these ways to categorize data in this particular ways since it was easy for the analysis process. You can find other ways to do this. In particular we split data in two main categories: categorical and numerical. Categorical data can be divided in groups and numerical are values or measures.
Categorical can be divided in two other sub categories:
- Nomimal: Has no intrinsic order.
- Ordinal: Has an order specified.
And numerical can be divided also in two categories:
- Discrete: Can only take certain values.
- Continous: Can take any value within a range.
The process
These are the steps to process data to obtain
Problem definition
Doing data analysis arises from a problem that need to be solved. Before doing any analysis you need to know which problem you have to solved. Usually this step is given to you as a data analysis engineer, but this doesn't mean that you don't have to do anything. It's a good idea to think about your problem and understain the domain of the problem. This will help you to think other alternatives and don't over-engineer the problem. Data analysis is usually used when you need to do predictions of something unknow. Problem definition should prodive you a guideline to solve the entire problem. Remeber that having a guideline and using the proper solution will avoid you to get the analysis paraliysis. At this step some documentation is usually written. Remeber we don't recommend to write long documents, just a page or two will be enough.
Data extraction
To obtain a successfull analysis we need to obtain data that not only has a good quality and quantity, we need to extract relevant data of our problem. Sometime obtaining this data requires more knowledge than just simple technical or data extraction skill, you require to have previous experience in data analysis proyects.
Data could be extracted from private datasets, sensors logs or even the internet. Today is pretty common to extract information from APIs and the web, this will force the data extractor to learn how to use API or parse HTML to do scrapers.
Data cleaning
Since most of the data will come from different sources the extractor or the one who is doing the analysis will be merging this data into one. Usually if the data cames from the web, html data will be cleaned and a simple and more friendly format will be used. Most of the times after this step is obtained a table with each relevant columns with a standard data formar in each cell. This step is very simple one and is usually very technical, a special care should be taken when parse null values since most of the times this produces errors.
Data exploration
Data exploration is about finding patterns, relationships or correlations in the data. At this step is common to use charts for visualization of this patterns. Some correlation analysis could be done to reduce reduncandy or to report that some input has correlation. At this step we recommend to do a summary report of the data, this will help you to build the correct model in the next step. The report should contain:
- A summary of all the data, like means, correlation against all other values.
- For categorical data is recommended to do some grouping reports, like sums, clustering, etc.
- Anomalies of the data.
- Any trend or relationship between values.
Note that at this step you are obtaining information!
Building predictive models
With the summary from the previous step it should be easier and intuitive which model you should start with. Usually the model depends a lot on the problem as we said in the problem definition step. Some people like to choose the model based on it's data type output like this:
- Classification models: categorical.
- Regression models: numeric.
- Clustering models: descriptive.
As a rule of thumb start always with simple models and if those model are good in the validation steps, keep using those. In case the model predicition is very low try to move to more complex or difficult models. We usually start with k-means, regression models.
Model validation/test
Usually you model will take data to learn (for supervised) and it's very common that you split you input data into two:
- Train data: This data is used for the model to learn about the data.
- Test data: This data is used to assess the model quality.
Test data should be selected with some carefull, since it could be a biased selection of the data. There are several ways to split the data in two, some of them are:
- Random split.
- Stratified sample
- Cluster sampling method.
- Time based split.
At this step you will be able to compare your model with other models also.