Installing Apex Docker sandbox
The Apex Docker sandbox is built based on Apache Bigtop (http://bigtop.apache.org/), which brings together many of the Hadoop ecosystem components from an infrastructure perspective by providing support for packaging, testing, and deployment. Apex is part of Bigtop and there is a Docker image available with a single node Hadoop cluster and Apex pre-installed at https://hub.docker.com/r/apacheapex/sandbox/.
If Docker isn't already installed, visit https://docs.docker.com/engine/installation/.
The Apex Docker sandbox is a lightweight and friction-free solution to get a working single node Hadoop cluster and a suitable option for a typical local Apex development environment. The laptop should ideally have 16 GB (or more) RAM with 8 GB allocated to Docker so that the development environment on the host OS and the single node Hadoop cluster in the Docker container have sufficient resources.
The following steps are for macOS. The install command will assume a workspace directory that is under your home directory that will be shared with the sandbox. That workspace directory should contain the Git repositories and the example project, so that the application package can be accessed directly from the build directory.
We will also expose some of the Hadoop service ports so that we can access them later directly. The SSH port will be useful to add ssh tunneling later as the need for additional ports arises, without taking down the docker container. Install and start the sandbox:
docker run -v ~/workspace:/workspace --expose=22 -p 8022:22 -p 8088:8088 -p 50070:50070 -it -h apex-sandbox --name=apex-sandbox apacheapex/sandbox:3.6.0
The port mappings in the run command will make the Hadoop services that run in the container appear as they would run on the host machine, and we will use that later when accessing the web UI with the browser.
The directory mapping makes the workspace folder (with the example project) in your host's home directory available inside the sandbox. We will later access files (including the previously built application package) from there.
Depending on your internet connection it may take a bit to download the images and finally start the cluster running inside the Docker container. The command prompt will now be inside the container. Should the container process ever stop and you get kicked out to the host terminal (for example, because the MacBook goes to sleep!), you can re-attach using the following:
docker start -ai apex-sandbox
If, after the unexpected termination and restart of the sandbox, you see an error about not being able to write to HDFS due to safemode of the namenode, run the following:
hdfs dfsadmin -safemode leave
When working with the sandbox, some links in the Hadoop web UI may not work because they refer to Docker container internal addresses (for example, http://apex-sandbox:8042/ for the node manager). A way to overcome this without having to recreate the sandbox with additional port mappings is to use SSH port forwarding (tunneling). It allows us to expose further ports as needed while the sandbox keeps running.
We define the extra ports in the SSH configuration. Use your favorite editor (vi, nano and so on) and edit ~/.ssh/config to add the following:
Host apex-sandbox
HostName localhost
User apex
Port 8022 DynamicForward 8157
LocalForward 8042 127.0.0.1:8042
With this change in place, open a new terminal window/tab and establish the ssh session that will make port 8042 available to the outside:
ssh apex-sandbox (password: apex)
Now the port can be accessed as localhost:8042 but the resource manager will generate the links with apex-sandbox:8042, and hence we still need to find a solution for the host name. Since the port forwarding (for both, SSH, and original Docker run command) use the same port number on host and container, we can define apex-sandbox as localhost. Next, we will edit the hosts file to accomplish this, which will work not only for the browser but also for other tools or protocols. An alternative that does not require changes to the hosts file would be installation of a browser proxy such as the freely available FoxyProxy (https://getfoxyproxy.org/). It should be configured to forward all traffic that matches apex-sandbox to localhost:8157 (as defined earlier in ~/.ssh/config).
If you go with the hosts file option, edit the /etc/hosts file to define apex-sandbox as alias for localhost as shown in the following code snippet. Editing the file requires root access and a password:
sudo vi /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost apex-sandbox
255.255.255.255 broadcasthost
::1 localhost
In conjunction with the port forwarding that we set up earlier, this will make sure the node manager links generated by the RM web UI will now work in the browser. Our single node cluster with the Apex CLI is now ready to run and monitor applications.