Setup a python data analysis environment



Usually with python when you need to install big libraries that need to be compiled like numpy it could be a mess. Usually when installing with pip numpy or simular you will need to install a lot of os libraries with apt-get or similar package manager. In this tutorial we are going to show hot to use anaconda to quickly install numpy or other tools.

Step 1: Installing anaconda and cookiecutter

First we need to download the anaconda installer from the official anaconda webpage.

When ready execute the downloaded .sh file. In my case I was using OSX. While installing be very carefull when it ask about adding anaconda to the PATH environment library, if you choose yes it will change your default python interpreter! We don't recommend to overwrite the PATH, just do the export PATH=/Users/username/...:$PATH everytime you need to do a data analysis. But if you are always using python for data analysis, just go ahead and overwrite the PATH env.

bash Anaconda3-4.2.0-MacOSX-x86_64.sh

when the installation finishes make sure you are using anaconda:

 $ which python
[  9:39PM ]  [ username@hostname:~ ]
 $ export PATH=/Users/username/anaconda/bin:$PATH
[  9:39PM ]  [ username@hostname:~ ]
 $ which python
[  9:39PM ]  [ username@hostname:~ ]

Now you can install cookicutter with pip:

pip install cookiecutter

Now we are going to use the conda command to install numpy very very fast:

conda install numpy

After some minutes all numpy binaries will be downloaded and no compilation and os libs will be required.

Note the difference between pip and the conda command. You can always use pip with anaconda, but that's not the idea. We used pip for cookiecutter since it is a pure python project and no compilation is required.

Step 2: Using cookiecutter to start a data analysis project

Now we are going to use cookiecutter, which is a command-line utility that creates projects from project templates. For this type of projects I like to use the Cookiecutter Data Science.

let's create a new project:

cookiecutter https://github.com/drivendata/cookiecutter-data-science

Step 3: Understading the directory structure

The most important thing to is to respect each directory responsability, especially with the src directory and inside of it. For example when you are ready with your results, save them in a csv format in the data/processed, the downloaded original raw format save them in the data/raw directory.

If you are working with jupyter reports, save them in the notebooks directory, however if you have reports use the reports directory (very usefull for saving the latex files!).

Just check this brief summary of each directory.

├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
├── models             <- Trained and serialized models, model predictions, or model summaries
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

For more information on the Data science template, click here.