As a big data enthusiast, if you are interested to learn Apache Hive, this blog will help you set it up on your local in few easy steps. This setup will serve as a good starting point for beginners as it will simulate the production environment. This tutorial uses Docker containers to spin up Apache Hive. Before we jump right into it, here is a quick overview of some of the critical components in this cluster.
# Apache Hive:
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics of large datasets residing in distributed storage using SQL.
# Docker:
Docker is an open-source technology to package an application and all its dependencies into a container.
# NameNode:
The NameNode is at the core of the Hadoop cluster. It keeps the directory tree of all files in the file system, and tracks where the file data is actually kept in the cluster.
# DataNode:
The DataNode on the other hand stores the actual file data. Files are replicated across multiple DataNodes in a cluster for reliability. Specifically, if were to think in terms of Hive, the data stored on the Hive tables is spread across the DataNodes within the cluster. NameNode, on the other hand is the one keeping track of these blocks of data actually stored on the DataNodes. We are using a single DataNode in this tutorial for the sake of simplicity.
# Hive Metastore:
Hive uses a relational database to store the metadata (e.g. schema and location) of all its tables. The default database is Derby, but we will be using PostgreSQL in this tutorial. The key benefit of using a relational database over HDFS is low latency and improved performance.
# Volumes:
Volumes are the preferred mechanism for persisting data generated by and used by Docker containers. Volumes also allow us to persist the container state between subsequent docker runs. If you don’t explicitly create it, a volume is created the first time it is mounted into a container. When that container stops or is removed, the volume still exists. Specifically, in this use case, we are using volumes inside NameNode and DataNode containers to persist the HDFS file system. The volume on PostgreSQL container is utilized to persist the meta-data of the previously created Hive tables.
# Pre-requisites:
The only pre-requisite is Docker. I’m running it on Macbook Pro with 8 GB RAM. Installing Docker is pretty straightforward. Please follow the instructions on https://docs.docker.com/docker-for-mac/install/ to install it if you do not have it already on your machine.
# 1. Directory Structure:
Create the directory structure on your local
mkdir Hive
cd Hive
touch docker-compose.yml
touch hadoop-hive.env
mkdir employee
cd employee
touch employee_table.hql
touch employee.csv
2
3
4
5
6
7
8
# 2. Edit files:
Open each file in your favorite editor and simply paste the below code snippets in them.
# docker-compose.yml:
Compose is a Docker tool for defining and running multi-container Docker applications. We are using the below YAML file to configure the services required by our Hive cluster. The biggest advantage of using Compose is that it creates and starts all the services using a single command.
version: '3'
services:
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
container_name: namenode
volumes:
- ./hdfs/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=hive
env_file:
- ./hadoop-hive.env
ports:
- "50070:50070"
datanode:
image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
container_name: datanode
volumes:
- ./hdfs/datanode:/hadoop/dfs/data
env_file:
- ./hadoop-hive.env
environment:
SERVICE_PRECONDITION: "namenode:50070"
depends_on:
- namenode
ports:
- "50075:50075"
hive-server:
image: bde2020/hive:2.3.2-postgresql-metastore
container_name: hive-server
volumes:
- ./employee:/employee
env_file:
- ./hadoop-hive.env
environment:
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
SERVICE_PRECONDITION: "hive-metastore:9083"
depends_on:
- hive-metastore
ports:
- "10000:10000"
hive-metastore:
image: bde2020/hive:2.3.2-postgresql-metastore
container_name: hive-metastore
env_file:
- ./hadoop-hive.env
command: /opt/hive/bin/hive --service metastore
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432"
depends_on:
- hive-metastore-postgresql
ports:
- "9083:9083"
hive-metastore-postgresql:
image: bde2020/hive-metastore-postgresql:2.3.0
container_name: hive-metastore-postgresql
volumes:
- ./metastore-postgresql/postgresql/data:/var/lib/postgresql/data
depends_on:
- datanode
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# hadoop-hive.env:
The .env file is used to set the working enviornment variables.
HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore
HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.Driver
HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
HIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# employee_table.hql:
This HQL script will be later used in the tutorial to demonstrate the creation of a sample Hive Database and Table. As soon as docker spins up the hive-server container, this file will be automatically mounted inside it for use.
create database if not exists testdb;
use testdb;
create external table if not exists employee (
eid int,
ename string,
age int,
jobtype string,
storeid int,
storelocation string,
salary bigint,
yrsofexp int
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile location 'hdfs://namenode:8020/user/hive/warehouse/testdb.db/employee';
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# employee.csv:
This csv file contains some sample records which will be loaded into the employee table. Again, this file is automatically mounted by docker inside the hive-server container at start up.
1,Rudolf Bardin,30,cashier,100,New York,40000,5
2,Rob Trask,22,driver,100,New York,50000,4
3,Madie Nakamura,20,janitor,100,New York,30000,4
4,Alesha Huntley,40,cashier,101,Los Angeles,40000,10
5,Iva Moose,50,cashier,102,Phoenix,50000,20
2
3
4
5
# 3. Create & Start all services:
Navigate inside the Hive directory on your local and run the single docker compose command to create and start all services required by our Hive cluster.
docker-compose up
# 4. Verify container status:
Allow Docker a few minutes to spin up all the containers. I use couple of ways to confirm that the required services are up and running. Look for the below message in the logs once you run docker-compose up.
Snapshot of docker-compose up logs Run the command docker stats in another terminal. As docker begins to spin up the containers, their CPU and Memory utilization will stabilize after a few minutes in absence of any other activity on the cluster.
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
8cfdc9ea3002 hive-server 0.39% 184.7MiB / 1.941GiB 9.30% 25.5kB / 30.5kB 1.63MB / 659kB 27
938e98165aa7 hive-metastore 0.39% 191.7MiB / 1.941GiB 9.65% 182kB / 213kB 2.75MB / 164kB 22
1e8ddecb96de hive-metastore-postgresql 0.46% 26.36MiB / 1.941GiB 1.33% 207kB / 176kB 2.13MB / 2.14MB 10
2d4c1ed7ed9d datanode 0.13% 112.9MiB / 1.941GiB 5.68% 67.8kB / 324kB 623kB / 156kB 44
c4b2af14a4cb namenode 0.13% 121MiB / 1.941GiB 6.09% 350kB / 82.6kB 1.56MB / 1.23MB
2
3
4
5
6
# 5. Demo:
Its time for a quick demo! Log onto the Hive-server and create a sample database and hive table by executing the below command in a new terminal.
docker exec -it hive-server /bin/bash
Navigate to the employee directory on the hive-server container.
root@dc86b2b9e566:/opt# ls
hadoop-2.7.4 hive
root@dc86b2b9e566:/opt# cd ..
root@dc86b2b9e566:/# ls
bin boot dev employee entrypoint.sh etc hadoop-data home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var
root@dc86b2b9e566:/# cd employee/
2
3
4
5
6
Execute the employee_table.hql to create a new external hive table employee under a new database testdb.
root@dc86b2b9e566:/employee# hive -f employee_table.hql
Now, let’s add some data to this hive table. For that, simply push the employee.csv present in the employee directory on the hive-server into HDFS.
root@dc86b2b9e566:/employee# hadoop fs -put employee.csv hdfs://namenode:8020/user/hive/warehouse/testdb.db/employee
# 6. Validate the setup:
On the hive-server, launch hive to verify the contents of the employee table.
root@df1ac619536c:/employee# hive
hive> show databases;
OK
default
testdb
Time taken: 2.363 seconds, Fetched: 2 row(s)
hive> use testdb;
OK
Time taken: 0.085 seconds
hive> select * from employee;
OK
1 Rudolf Bardin 30 cashier 100 New York 40000 5
2 Rob Trask 22 driver 100 New York 50000 4
3 Madie Nakamura 20 janitor 100 New York 30000 4
4 Alesha Huntley 40 cashier 101 Los Angeles 40000 10
5 Iva Moose 50 cashier 102 Phoenix 50000 20
Time taken: 4.237 seconds, Fetched: 5 row(s)
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 7. Validate the container state between subsequent docker runs:
All good so far. You have successfully created a Apache Hive cluster on Docker. It is critical to verify that the Hive tables are maintained between subsequent docker runs and we do not end up losing our progress if docker containers are stopped. Therefore, restart the docker containers and verify that the hive data is persisted or not. In a new terminal, execute below command to stop all docker containers.
docker-compose down
Stopping hive-server ... done
Stopping hive-metastore ... done
Stopping hive-metastore-postgresql ... done
Stopping datanode ... done
Stopping namenode ... done
Removing hive-server ... done
Removing hive-metastore ... done
Removing hive-metastore-postgresql ... done
Removing datanode ... done
Removing namenode ... done
Removing network hive_default
2
3
4
5
6
7
8
9
10
11
12
Once all containers are stopped, run docker-compose up one more time.
docker-compose up
Wait for a few minutes as suggested in step 4 for the containers to come back online, then log onto the hive-server and run the select query.
docker exec -it hive-server /bin/bash
root@df1ac619536c:/opt# hive
hive> select * from testdb.employee;
OK
1 Rudolf Bardin 30 cashier 100 New York 40000 5
2 Rob Trask 22 driver 100 New York 50000 4
3 Madie Nakamura 20 janitor 100 New York 30000 4
4 Alesha Huntley 40 cashier 101 Los Angeles 40000 10
5 Iva Moose 50 cashier 102 Phoenix 50000 20
Time taken: 4.237 seconds, Fetched: 5 row(s)
2
3
4
5
6
7
8
9
10
11
Hurray! We still have our data! Congratulations on setting up your hive server on Docker! Keep practicing!