Exploring big data tools for python: apache arrow



Apache Arrow is a columnar in-memory analytics. The big part of the Arrow project is the specification for in-memory columnar data layout, it is not a piece of software. Arrow was tought to exploit CPU Effeciency, like cache locality, super-scalar and vectorized operations. The main feature of arrow is that two programs written in different lenguages which can speak arrow will share information with low overhead (cross-lenguage, cross-system communication). We will focus on this ideas of sharing information with low overhead in this tutorial, we will show you how to put information in memory and how to use it.

If you want to check the specification click here.

Apache arrow was tough for memory, for disk you need to take a look to the parquet project.

Step 1: Install with conda

In this tutorial we explain how to build from source code pyarrow, however if you want to go to the shortest path and you use python anaconda, install it with:

conda install -c conda-forge pyarrow

If you want to install from source code check the tutorial on how to compile arros from source code here.

Step 1: Download csv and load into pandas data frame

Most of the classes of the PyArrow package warns the user that you don't have to call the constructor directly, use one of the from_* methods instead.

Let's first review all the from_* class methods:

  • from_pandas: Convert pandas.DataFrame to an Arrow Table
  • from_arrays: Construct a Table from Arrow Arrays
  • from_batches: Construct a Table from a list of Arrow RecordBatches

The from_pandas method seems to be the most friendly for the task we need to acomplish in step 1. We are going to use python request to download the csv data of a time series.

import requests
from io import StringIO
csv_data = requests.get('http://www.google.com/finance/historical?output=csv&q=AAPL')
csv_data = csv_data.content.split()
columns = csv_data.pop(0).split(',')
# pandas read_csv uses a file as parameter, let's use StringIO to aovid using the fs.
csv_file = StringIO()
csv_file.write( '\n'.join(csv_data))

Let's import a csv using pandas first:

import pandas as pd
csv_file.seek(0)  # since we wrote on the file (we are at the end of the file) we need to start from the begin of it.

data_frame = pd.read_csv(csv_file, names=columns)

Step 2: Load PyArrow table from pandas data frame

Now we have all our data in the data_frame, let's use the from_pandas method to fill a pyarrow table:

table = Table.from_pandas(data_frame)

Now our arrow table object is now with all the content that the data frame has.

Step 3: Fill pandas data frame with arrow information

In this we are going to do the opposite and we will fill the data frame.

data_frame_2 = Table.to_pandas(table)

Step 3: Seeing the impact

If you followed the steps 1 and 2, you will start to ask why using apache arrow and which is the advantage of it. In most of the cases if you wan to access a DB you are force to use a driver between your app and the database, this brings you the problem of overhead that is required to transform the data. To see the impact, think that databases will start to use the apache arrow format and your software will directly access to it, skipping the serialization part.

Keep update about the apache arrow project to know more about this.