Introduction
Usually with python when you need to install big libraries that need to be compiled like numpy it could be a mess. Usually when installing with pip numpy or simular you will need to install a lot of os libraries with apt-get or similar package manager. In this tutorial we are going to show hot to use anaconda to quickly install numpy or other tools.
Step 1: Installing anaconda and cookiecutter
First we need to download the anaconda installer from the official anaconda webpage.
When ready execute the downloaded .sh file. In my case I was using OSX. While installing be very carefull when it ask about adding anaconda to the PATH environment library, if you choose yes it will change your default python interpreter! We don't recommend to overwrite the PATH, just do the export PATH=/Users/username/...:$PATH everytime you need to do a data analysis. But if you are always using python for data analysis, just go ahead and overwrite the PATH env.
bash Anaconda3-4.2.0-MacOSX-x86_64.sh
when the installation finishes make sure you are using anaconda:
$ which python
/Users/username/.pyenv/versions/analyzer/bin/python
[ 9:39PM ] [ username@hostname:~ ]
$ export PATH=/Users/username/anaconda/bin:$PATH
[ 9:39PM ] [ username@hostname:~ ]
$ which python
/Users/username/anaconda/bin/python
[ 9:39PM ] [ username@hostname:~ ]
Now you can install cookicutter with pip:
pip install cookiecutter
Now we are going to use the conda command to install numpy very very fast:
conda install numpy
After some minutes all numpy binaries will be downloaded and no compilation and os libs will be required.
Note the difference between pip and the conda command. You can always use pip with anaconda, but that's not the idea. We used pip for cookiecutter since it is a pure python project and no compilation is required.
Step 2: Using cookiecutter to start a data analysis project
Now we are going to use cookiecutter, which is a command-line utility that creates projects from project templates. For this type of projects I like to use the Cookiecutter Data Science.
let's create a new project:
cookiecutter https://github.com/drivendata/cookiecutter-data-science
Step 3: Understading the directory structure
The most important thing to is to respect each directory responsability, especially with the src directory and inside of it. For example when you are ready with your results, save them in a csv format in the data/processed, the downloaded original raw format save them in the data/raw directory.
If you are working with jupyter reports, save them in the notebooks directory, however if you have reports use the reports directory (very usefull for saving the latex files!).
Just check this brief summary of each directory.
âââ LICENSE
âââ Makefile <- Makefile with commands like `make data` or `make train`
âââ README.md <- The top-level README for developers using this project.
âââ data
â âââ external <- Data from third party sources.
â âââ interim <- Intermediate data that has been transformed.
â âââ processed <- The final, canonical data sets for modeling.
â âââ raw <- The original, immutable data dump.
â
âââ docs <- A default Sphinx project; see sphinx-doc.org for details
â
âââ models <- Trained and serialized models, model predictions, or model summaries
â
âââ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
â the creator's initials, and a short `-` delimited description, e.g.
â `1.0-jqp-initial-data-exploration`.
â
âââ references <- Data dictionaries, manuals, and all other explanatory materials.
â
âââ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
â âââ figures <- Generated graphics and figures to be used in reporting
â
âââ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
â generated with `pip freeze > requirements.txt`
â
âââ src <- Source code for use in this project.
â âââ __init__.py <- Makes src a Python module
â â
â âââ data <- Scripts to download or generate data
â â âââ make_dataset.py
â â
â âââ features <- Scripts to turn raw data into features for modeling
â â âââ build_features.py
â â
â âââ models <- Scripts to train models and then use trained models to make
â â â predictions
â â âââ predict_model.py
â â âââ train_model.py
â â
â âââ visualization <- Scripts to create exploratory and results oriented visualizations
â âââ visualize.py
â
âââ tox.ini <- tox file with settings for running tox; see tox.testrun.org
For more information on the Data science template, click here.