How to compile and install apache arrow from source code



Apache Arrow is a columnar in-memory analytics. In this tutorial we are going to compile PyArrow from source code. We don't recommend doing this, but it could be a good learn experience. In some cases you can't use anaconda to install, so right now this is the path to follow.

Step 1: Clone arrow repository

First we will clone the arrow repository which had the cpp and python code that we require.

sudo apt-get update
sudo apt-get install git
git clone https://github.com/apache/arrow.git

Step 2: Install OS Requirements

We require to have install python anaconda.

sudo apt-get install g++ libboost-all-dev libncurses5-dev wget
sudo apt-get install libtool flex bison pkg-config g++ libssl-dev automake
conda install cython numpy

Step 3: Update ubuntu cmake

This step is optional, if you have problems with cmake in the next steps you can go back to this one. Now since we are going to use the lastest lib boost we need to have a newer cmake than the one that comes with Ubuntu. Check appendix for libbootst libraries not found error: Download cmake source code to install version 3.7.1

wget https://cmake.org/files/v3.7/cmake-3.7.1.tar.gz
tar xf cmake-3.7.1.tar.gz
cd cmake
./bootstrap --system-curl
sudo make install

Verify that you are using the new version with:

cmake --version
``` bash

The output should be:

cmake version 3.7.1

CMake suite maintained and supported by Kitware (kitware.com/cmake).

# Step 4: Install Thrift and Parquet-cpp

First we will install Apache Thrift, it is required by parquet-cpp package:

``` bash
git clone https://github.com/apache/thrift.git
cd thrift
./configure --without-php_extension --without-tests --without-qt4
sudo make install
thrift --help

Before installing parquet we need to do a symbolic link in some boost libs, since the installed boost libraries are multi-threading safe.

sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_regex.a /usr/lib/x86_64-linux-gnu/libboost_regex-mt.a
sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_system.a /usr/lib/x86_64-linux-gnu/libboost_system-mt.a
sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_filesystem.a /usr/lib/x86_64-linux-gnu/libboost_filesystem-mt.a

Now we will install parquet-cpp, however you will get an error on the download_thirparty.sh step. We need to do some fixes in the parquet scripts to download thirdparty libs.

git clone https://github.com/apache/parquet-cpp.git
cmake . -DCMAKE_BUILD_TYPE=Release
sudo make install

Step 4: Install arrow-cpp

Now we are ready to install arrow-cpp

cd arrow/cpp
mkdir release
cd release
cmake .. -DCMAKE_BUILD_TYPE=Release
make unittest
make install

Step 5: Installing PyArrow

We are going to clone the arrow repository which includes libraries for using arrows. In particular we are oging to install PyArrow, but in each directory you can find the library for other lenguages.

cd arrow/python
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
pip install -r requirements.txt
python setup.py build_ext --inplace


Cmake Error: Could not find the following Boost libraries:

Usually if you find the files in your filesystem this means that cmake need some gcc library, check the cmake log carefully. Possible Solution: Update cmake to a newer version, see step 2.

Protocol "https" not supported or disabled in libcurl

The following steps are for Ubuntu. Download libcurl and compile with ssl support:

sudo apt-get update
sudo apt-get install libssl-dev
wget https://curl.haxx.se/download/curl-7.52.1.tar.bz2
tar xf curl-7.52.1.tar.bz2
cd curl-7.52.1
./configure --with-ssl
sudo make install

Could not find the Arrow library. Looked for headers in /include, and for libs in /lib

After you make install the file libarrow.a and other should be somewhere in your filesystem, search for the path with

sudo find / -name "libarrow.a"

When you have the path add it to the LD_LIBRARY_PATH.

If the find command does not find anything, look at the step 3.

Error while loading shared libraries: libthriftc.so.0: cannot open shared object file: No such file or directory

You need to add the path where libthriftc.so is on the LD_LIBRARY_PATH.

sudo find / -name 'libthriftc.so.0'

In my case libthriftc.so.0 was located at /usr/local/lib, so we need to the that path to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Error: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC

Edit CMakeCache.txt and Change the CMAKE_CXX_FLAGS:STRING=-fPIC re-compile and install.