Introduction
Apache Arrow is a columnar in-memory analytics. In this tutorial we are going to compile PyArrow from source code. We don't recommend doing this, but it could be a good learn experience. In some cases you can't use anaconda to install, so right now this is the path to follow.
Step 1: Clone arrow repository
First we will clone the arrow repository which had the cpp and python code that we require.
sudo apt-get update
sudo apt-get install git
git clone https://github.com/apache/arrow.git
Step 2: Install OS Requirements
We require to have install python anaconda.
sudo apt-get install g++ libboost-all-dev libncurses5-dev wget
sudo apt-get install libtool flex bison pkg-config g++ libssl-dev automake
conda install cython numpy
Step 3: Update ubuntu cmake
This step is optional, if you have problems with cmake in the next steps you can go back to this one. Now since we are going to use the lastest lib boost we need to have a newer cmake than the one that comes with Ubuntu. Check appendix for libbootst libraries not found error: Download cmake source code to install version 3.7.1
wget https://cmake.org/files/v3.7/cmake-3.7.1.tar.gz
tar xf cmake-3.7.1.tar.gz
cd cmake
./bootstrap --system-curl
./configure
make
sudo make install
Verify that you are using the new version with:
cmake --version
``` bash
The output should be:
cmake version 3.7.1
CMake suite maintained and supported by Kitware (kitware.com/cmake).
# Step 4: Install Thrift and Parquet-cpp
First we will install Apache Thrift, it is required by parquet-cpp package:
``` bash
git clone https://github.com/apache/thrift.git
cd thrift
./boostrap.sh
./configure --without-php_extension --without-tests --without-qt4
make
sudo make install
thrift --help
Before installing parquet we need to do a symbolic link in some boost libs, since the installed boost libraries are multi-threading safe.
sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_regex.a /usr/lib/x86_64-linux-gnu/libboost_regex-mt.a
sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_system.a /usr/lib/x86_64-linux-gnu/libboost_system-mt.a
sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_filesystem.a /usr/lib/x86_64-linux-gnu/libboost_filesystem-mt.a
Now we will install parquet-cpp, however you will get an error on the download_thirparty.sh step. We need to do some fixes in the parquet scripts to download thirdparty libs.
git clone https://github.com/apache/parquet-cpp.git
thirdparty/download_thirdparty.sh
thirdparty/build_thirdparty.sh
cmake . -DCMAKE_BUILD_TYPE=Release
make
sudo make install
Step 4: Install arrow-cpp
Now we are ready to install arrow-cpp
cd arrow/cpp
mkdir release
cd release
cmake .. -DCMAKE_BUILD_TYPE=Release
make unittest
make install
Step 5: Installing PyArrow
We are going to clone the arrow repository which includes libraries for using arrows. In particular we are oging to install PyArrow, but in each directory you can find the library for other lenguages.
cd arrow/python
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
pip install -r requirements.txt
python setup.py build_ext --inplace
Appendix
Cmake Error: Could not find the following Boost libraries:
Usually if you find the files in your filesystem this means that cmake need some gcc library, check the cmake log carefully. Possible Solution: Update cmake to a newer version, see step 2.
Protocol "https" not supported or disabled in libcurl
The following steps are for Ubuntu. Download libcurl and compile with ssl support:
sudo apt-get update
sudo apt-get install libssl-dev
wget https://curl.haxx.se/download/curl-7.52.1.tar.bz2
tar xf curl-7.52.1.tar.bz2
cd curl-7.52.1
./configure --with-ssl
make
sudo make install
Could not find the Arrow library. Looked for headers in /include, and for libs in /lib
After you make install the file libarrow.a and other should be somewhere in your filesystem, search for the path with
sudo find / -name "libarrow.a"
When you have the path add it to the LD_LIBRARY_PATH.
If the find command does not find anything, look at the step 3.
Error while loading shared libraries: libthriftc.so.0: cannot open shared object file: No such file or directory
You need to add the path where libthriftc.so is on the LD_LIBRARY_PATH.
sudo find / -name 'libthriftc.so.0'
In my case libthriftc.so.0 was located at /usr/local/lib, so we need to the that path to LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Error: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
Edit CMakeCache.txt and Change the CMAKE_CXX_FLAGS:STRING=-fPIC re-compile and install.