Apache Hive on Docker

Clear Love 9/2/2021 hivedocker

As a big data enthusiast, if you are interested to learn Apache Hive, this blog will help you set it up on your local in few easy steps. This setup will serve as a good starting point for beginners as it will simulate the production environment. This tutorial uses Docker containers to spin up Apache Hive. Before we jump right into it, here is a quick overview of some of the critical components in this cluster.

# Apache Hive:

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics of large datasets residing in distributed storage using SQL.

# Docker:

Docker is an open-source technology to package an application and all its dependencies into a container.

# NameNode:

The NameNode is at the core of the Hadoop cluster. It keeps the directory tree of all files in the file system, and tracks where the file data is actually kept in the cluster.

# DataNode:

The DataNode on the other hand stores the actual file data. Files are replicated across multiple DataNodes in a cluster for reliability. Specifically, if were to think in terms of Hive, the data stored on the Hive tables is spread across the DataNodes within the cluster. NameNode, on the other hand is the one keeping track of these blocks of data actually stored on the DataNodes. We are using a single DataNode in this tutorial for the sake of simplicity.

# Hive Metastore:

Hive uses a relational database to store the metadata (e.g. schema and location) of all its tables. The default database is Derby, but we will be using PostgreSQL in this tutorial. The key benefit of using a relational database over HDFS is low latency and improved performance.

# Volumes:

Volumes are the preferred mechanism for persisting data generated by and used by Docker containers. Volumes also allow us to persist the container state between subsequent docker runs. If you don’t explicitly create it, a volume is created the first time it is mounted into a container. When that container stops or is removed, the volume still exists. Specifically, in this use case, we are using volumes inside NameNode and DataNode containers to persist the HDFS file system. The volume on PostgreSQL container is utilized to persist the meta-data of the previously created Hive tables.

# Pre-requisites:

The only pre-requisite is Docker. I’m running it on Macbook Pro with 8 GB RAM. Installing Docker is pretty straightforward. Please follow the instructions on https://docs.docker.com/docker-for-mac/install/ to install it if you do not have it already on your machine.

# 1. Directory Structure:

Create the directory structure on your local

mkdir Hive
cd Hive
touch docker-compose.yml
touch hadoop-hive.env
mkdir employee
cd employee
touch employee_table.hql
touch employee.csv

1
2
3
4
5
6
7
8

# 2. Edit files:

Open each file in your favorite editor and simply paste the below code snippets in them.

# docker-compose.yml:

Compose is a Docker tool for defining and running multi-container Docker applications. We are using the below YAML file to configure the services required by our Hive cluster. The biggest advantage of using Compose is that it creates and starts all the services using a single command.

version: '3'
services:

  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
    container_name: namenode
    volumes:
      - ./hdfs/namenode:/hadoop/dfs/name
    environment:
      - CLUSTER_NAME=hive
    env_file:
      - ./hadoop-hive.env
    ports:
      - "50070:50070"

  datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
    container_name: datanode
    volumes:
      - ./hdfs/datanode:/hadoop/dfs/data
    env_file:
      - ./hadoop-hive.env
    environment:
      SERVICE_PRECONDITION: "namenode:50070"
    depends_on:
      - namenode
    ports:
      - "50075:50075"

  hive-server:
    image: bde2020/hive:2.3.2-postgresql-metastore
    container_name: hive-server
    volumes:
      - ./employee:/employee
    env_file:
      - ./hadoop-hive.env
    environment:
      HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
      SERVICE_PRECONDITION: "hive-metastore:9083"
    depends_on:
      - hive-metastore
    ports:
      - "10000:10000"

  hive-metastore:
    image: bde2020/hive:2.3.2-postgresql-metastore
    container_name: hive-metastore
    env_file:
      - ./hadoop-hive.env
    command: /opt/hive/bin/hive --service metastore
    environment:
      SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432"
    depends_on:
      - hive-metastore-postgresql
    ports:
      - "9083:9083"

  hive-metastore-postgresql:
    image: bde2020/hive-metastore-postgresql:2.3.0
    container_name: hive-metastore-postgresql
    volumes:
      - ./metastore-postgresql/postgresql/data:/var/lib/postgresql/data
    depends_on:
      - datanode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

# hadoop-hive.env:

The .env file is used to set the working enviornment variables.

HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore
HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.Driver
HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
HIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083

CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

# employee_table.hql:

This HQL script will be later used in the tutorial to demonstrate the creation of a sample Hive Database and Table. As soon as docker spins up the hive-server container, this file will be automatically mounted inside it for use.

create database if not exists testdb;
use testdb;
create external table if not exists employee (
  eid int,
  ename string,
  age int,
  jobtype string,
  storeid int,
  storelocation string,
  salary bigint,
  yrsofexp int
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile location 'hdfs://namenode:8020/user/hive/warehouse/testdb.db/employee';

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# employee.csv:

This csv file contains some sample records which will be loaded into the employee table. Again, this file is automatically mounted by docker inside the hive-server container at start up.

1,Rudolf Bardin,30,cashier,100,New York,40000,5
2,Rob Trask,22,driver,100,New York,50000,4
3,Madie Nakamura,20,janitor,100,New York,30000,4
4,Alesha Huntley,40,cashier,101,Los Angeles,40000,10
5,Iva Moose,50,cashier,102,Phoenix,50000,20

1
2
3
4
5

# 3. Create & Start all services:

Navigate inside the Hive directory on your local and run the single docker compose command to create and start all services required by our Hive cluster.

docker-compose up

# 4. Verify container status:

Allow Docker a few minutes to spin up all the containers. I use couple of ways to confirm that the required services are up and running. Look for the below message in the logs once you run docker-compose up.

Snapshot of docker-compose up logs Run the command docker stats in another terminal. As docker begins to spin up the containers, their CPU and Memory utilization will stabilize after a few minutes in absence of any other activity on the cluster.

docker stats

CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
8cfdc9ea3002   hive-server                 0.39%     184.7MiB / 1.941GiB   9.30%     25.5kB / 30.5kB   1.63MB / 659kB    27
938e98165aa7   hive-metastore              0.39%     191.7MiB / 1.941GiB   9.65%     182kB / 213kB     2.75MB / 164kB    22
1e8ddecb96de   hive-metastore-postgresql   0.46%     26.36MiB / 1.941GiB   1.33%     207kB / 176kB     2.13MB / 2.14MB   10
2d4c1ed7ed9d   datanode                    0.13%     112.9MiB / 1.941GiB   5.68%     67.8kB / 324kB    623kB / 156kB     44
c4b2af14a4cb   namenode                    0.13%     121MiB / 1.941GiB     6.09%     350kB / 82.6kB    1.56MB / 1.23MB

1
2
3
4
5
6

# 5. Demo:

Its time for a quick demo! Log onto the Hive-server and create a sample database and hive table by executing the below command in a new terminal.

docker exec -it hive-server /bin/bash

Navigate to the employee directory on the hive-server container.

root@dc86b2b9e566:/opt# ls
hadoop-2.7.4  hive
root@dc86b2b9e566:/opt# cd ..
root@dc86b2b9e566:/# ls
bin  boot  dev employee  entrypoint.sh  etc  hadoop-data  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
root@dc86b2b9e566:/# cd employee/

1
2
3
4
5
6

Execute the employee_table.hql to create a new external hive table employee under a new database testdb.

root@dc86b2b9e566:/employee# hive -f employee_table.hql

Now, let’s add some data to this hive table. For that, simply push the employee.csv present in the employee directory on the hive-server into HDFS.

root@dc86b2b9e566:/employee# hadoop fs -put employee.csv hdfs://namenode:8020/user/hive/warehouse/testdb.db/employee

# 6. Validate the setup:

On the hive-server, launch hive to verify the contents of the employee table.

root@df1ac619536c:/employee# hive
hive> show databases;
OK
default
testdb
Time taken: 2.363 seconds, Fetched: 2 row(s)
hive> use testdb;
OK
Time taken: 0.085 seconds
hive> select * from employee;
OK
1 Rudolf Bardin 30 cashier 100 New York 40000 5
2 Rob Trask 22 driver 100 New York 50000 4
3 Madie Nakamura 20 janitor 100 New York 30000 4
4 Alesha Huntley 40 cashier 101 Los Angeles 40000 10
5 Iva Moose 50 cashier 102 Phoenix 50000 20
Time taken: 4.237 seconds, Fetched: 5 row(s)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# 7. Validate the container state between subsequent docker runs:

All good so far. You have successfully created a Apache Hive cluster on Docker. It is critical to verify that the Hive tables are maintained between subsequent docker runs and we do not end up losing our progress if docker containers are stopped. Therefore, restart the docker containers and verify that the hive data is persisted or not. In a new terminal, execute below command to stop all docker containers.

docker-compose down
Stopping hive-server               ... done
Stopping hive-metastore            ... done
Stopping hive-metastore-postgresql ... done
Stopping datanode                  ... done
Stopping namenode                  ... done
Removing hive-server               ... done
Removing hive-metastore            ... done
Removing hive-metastore-postgresql ... done
Removing datanode                  ... done
Removing namenode                  ... done
Removing network hive_default

1
2
3
4
5
6
7
8
9
10
11
12

Once all containers are stopped, run docker-compose up one more time.

docker-compose up

Wait for a few minutes as suggested in step 4 for the containers to come back online, then log onto the hive-server and run the select query.

docker exec -it hive-server /bin/bash
root@df1ac619536c:/opt# hive
hive> select * from testdb.employee;
OK
1 Rudolf Bardin 30 cashier 100 New York 40000 5
2 Rob Trask 22 driver 100 New York 50000 4
3 Madie Nakamura 20 janitor 100 New York 30000 4
4 Alesha Huntley 40 cashier 101 Los Angeles 40000 10
5 Iva Moose 50 cashier 102 Phoenix 50000 20
Time taken: 4.237 seconds, Fetched: 5 row(s)

1
2
3
4
5
6
7
8
9
10
11

Hurray! We still have our data! Congratulations on setting up your hive server on Docker! Keep practicing!

clearlove's blog

Choose mode