Apache Arrow is a columnar in-memory analytics. In this tutorial we are going to compile PyArrow from source code. We don't recommend doing this, but it could be a good learn experience. In some cases you can't use anaconda to install, so right now this is the path to follow.
Step 1: Clone arrow repository
First we will clone the arrow repository which had the cpp and python code that we require.
sudo apt-get update sudo apt-get install git git clone https://github.com/apache/arrow.git
Step 2: Install OS Requirements
We require to have install python anaconda.
sudo apt-get install g++ libboost-all-dev libncurses5-dev wget sudo apt-get install libtool flex bison pkg-config g++ libssl-dev automake conda install cython numpy
Step 3: Update ubuntu cmake
This step is optional, if you have problems with cmake in the next steps you can go back to this one. Now since we are going to use the lastest lib boost we need to have a newer cmake than the one that comes with Ubuntu. Check appendix for libbootst libraries not found error: Download cmake source code to install version 3.7.1
wget https://cmake.org/files/v3.7/cmake-3.7.1.tar.gz tar xf cmake-3.7.1.tar.gz cd cmake ./bootstrap --system-curl ./configure make sudo make install
Verify that you are using the new version with:
cmake --version ``` bash The output should be:
cmake version 3.7.1
CMake suite maintained and supported by Kitware (kitware.com/cmake).
# Step 4: Install Thrift and Parquet-cpp First we will install Apache Thrift, it is required by parquet-cpp package: ``` bash git clone https://github.com/apache/thrift.git cd thrift ./boostrap.sh ./configure --without-php_extension --without-tests --without-qt4 make sudo make install thrift --help
Before installing parquet we need to do a symbolic link in some boost libs, since the installed boost libraries are multi-threading safe.
sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_regex.a /usr/lib/x86_64-linux-gnu/libboost_regex-mt.a sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_system.a /usr/lib/x86_64-linux-gnu/libboost_system-mt.a sudo ln -s /usr/lib/x86_64-linux-gnu/libboost_filesystem.a /usr/lib/x86_64-linux-gnu/libboost_filesystem-mt.a
Now we will install parquet-cpp, however you will get an error on the download_thirparty.sh step. We need to do some fixes in the parquet scripts to download thirdparty libs.
git clone https://github.com/apache/parquet-cpp.git thirdparty/download_thirdparty.sh thirdparty/build_thirdparty.sh cmake . -DCMAKE_BUILD_TYPE=Release make sudo make install
Step 4: Install arrow-cpp
Now we are ready to install arrow-cpp
cd arrow/cpp mkdir release cd release cmake .. -DCMAKE_BUILD_TYPE=Release make unittest make install
Step 5: Installing PyArrow
We are going to clone the arrow repository which includes libraries for using arrows. In particular we are oging to install PyArrow, but in each directory you can find the library for other lenguages.
cd arrow/python export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib pip install -r requirements.txt python setup.py build_ext --inplace
Cmake Error: Could not find the following Boost libraries:
Usually if you find the files in your filesystem this means that cmake need some gcc library, check the cmake log carefully. Possible Solution: Update cmake to a newer version, see step 2.
Protocol "https" not supported or disabled in libcurl
The following steps are for Ubuntu. Download libcurl and compile with ssl support:
sudo apt-get update sudo apt-get install libssl-dev wget https://curl.haxx.se/download/curl-7.52.1.tar.bz2 tar xf curl-7.52.1.tar.bz2 cd curl-7.52.1 ./configure --with-ssl make sudo make install
Could not find the Arrow library. Looked for headers in /include, and for libs in /lib
After you make install the file libarrow.a and other should be somewhere in your filesystem, search for the path with
sudo find / -name "libarrow.a"
When you have the path add it to the LD_LIBRARY_PATH.
If the find command does not find anything, look at the step 3.
Error while loading shared libraries: libthriftc.so.0: cannot open shared object file: No such file or directory
You need to add the path where libthriftc.so is on the LD_LIBRARY_PATH.
sudo find / -name 'libthriftc.so.0'
In my case libthriftc.so.0 was located at /usr/local/lib, so we need to the that path to LD_LIBRARY_PATH:
Error: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
Edit CMakeCache.txt and Change the CMAKE_CXX_FLAGS:STRING=-fPIC re-compile and install.