What is security? We could work on data security forever; this part of IT infrastructure is infinite. Companies are usually interested in these aspects:
What is data security? It consists of three major parts:
These three properties give us a clue as to who did something somewhere, and which privileges were used. Hadoop security setup is still a non-trivial task. Let's try to implement basic authentication and authorization.
There are several approaches to securing your clusters. We recommend isolating clusters using VPN since it's one of the easiest ways to protect data and get basic authentication with audit logging. You can always do non-trivial installations, and the best approach is to clearly estimate how much time you will spend setting up Kerberos. Usually, you don't have time for that; it's all devoted to reaching business goals. See the examples on the Internet where a younger startup's security fails and exposes private data to hackers. You definitely need to read a dedicated book devoted to Hadoop security with Kerberos enabled if you need really strong security.
There is a joke about Kerberos and Hadoop. You can join Hadoop user groups or just search them via the keywords Kerberos, Enable, Help, stopped working. Sometimes, a guy asks a group, "How can I set up Kerberos for my cluster?" He gets some general considerations and links to documents. Then in few days the guy appears again in a group with this question: "How can I disable Kerberos on my Hadoop; everything has stopped working?".
We are going to touch on the following topics briefly:
The default Hadoop security is designed to stop good people doing the wrong things, so security is the wrong term for the default functionality. HDFS acts as the base system for storing data with redundancy and even distribution features, which leads to an even cluster load and resilience to node failures. Consider that HDFS is a traditional file system that spans across many nodes. You can configure HDFS not to have SPOF (see the NameNode HA configuration). Be aware, that you can totally lose some of your nodes with data without noticing it if you don't have any monitoring and alerting. (if you don't have monitoring and alerting of course). Let's see what Hadoop HDFS suggests you do out of the box. We need to open the terminal on our VM:
Type: hadoop fs -ls / And get output: [cloudera@quickstart ~]$ hadoop fs -ls / Found 9 items drwxr-xr-x - cloudera supergroup 0 2015-03-16 06:27 /applications drwxr-xr-x - hdfs supergroup 0 2015-03-16 07:26 /backup some output removed to reduce size of logs. drwxr-xr-x - hdfs supergroup 0 2014-12-18 04:32 /var
You see the listing for the root of HDFS. HDFS is not fully POSIX-compliant, but you can see traditional user and group ownerships for catalogs with masks. x
is used only for listing the catalog. There are no executables on HDFS.
[cloudera@quickstart ~]$ whoami cloudera
Your current user should be cloudera
.
[cloudera@quickstart ~]$ groups cloudera default
Your cloudera
user is a member of two groups: cloudera
and default
.
This command creates an empty file in your current HDFS home directory. Yes, HDFS also has home directories for users:
[cloudera@quickstart ~]$ hadoop fs -touchz file_on_hdfs
hadoop fs
command points to your HDFS home directory by default:[cloudera@quickstart ~]$ hadoop fs -ls Found 2 items drwxr-xr-x - cloudera cloudera 0 2015-03-16 06:41 .Trash -rw-r--r-- 1 cloudera cloudera 0 2015-04-05 10:14 file_on_hdfs
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera Found 2 items drwxr-xr-x - cloudera cloudera 0 2015-03-16 06:41 /user/cloudera/.Trash -rw-r--r-- 1 cloudera cloudera 0 2015-04-05 10:14 /user/cloudera/file_on_hdfs
The file_on_hdfs
file has been created and its size is 0 bytes. The file belongs to the cloudera
user because you are logged in as cloudera
. Members of the group named cloudera
(rw-r--r--
) can read this file. Others also can read (-rw-r--r--
). Only the cloudera
user can modify the file (-rw-r--r--
).
echo "new line for file" > append.txt
sudo su hdfs hadoop fs –appendToFile append.txt /user/cloudera/file_on_hdfs
cloudera
user. Let's double-check that we did the append operation:[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/file_on_hdfs new line for file
namenode
configuration.The explanation is easy: namenode
is a service responsible for HFDS metadata, such as access permissions. Right now it's up and running, but doesn't take care of owners for files and catalogs. Let's fix it. There are two properties responsible for the behavior we need:
dfs.permissions
dfs.permissions.enabled
The first one is old and will be deprecated in the near future; the second one is new. Let's turn both to true
.
/etc/Hadoop/conf/hdfs-site.xml
. Find and change both properties to true
. You can use a visual editor and file browser if you are not familiar with the console vi
tool:<property> <name>dfs.permissions.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions</name> <value>true</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property>
[cloudera@quickstart ~]$ sudo /etc/init.d/hadoop-hdfs-namenode stop stopping namenode Stopped Hadoop namenode: [ OK ] [cloudera@quickstart ~]$ sudo /etc/init.d/hadoop-hdfs-namenode start starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-quickstart.cloudera.out Started Hadoop namenode: [ OK ]
mail
user, who definitely doesn't have any rights to modify the file:[cloudera@quickstart ~]$ sudo -u mail hadoop fs -appendToFile append.txt /user/cloudera/file_on_hdfs appendToFile: Permission denied: user=mail, access=WRITE, inode="/user/cloudera/file_on_hdfs":cloudera:cloudera:-rw-r--r--
Great, we see that owner permissions are preserved and you are not allowed to modify the file using the mail
user since only the cloudera
user has write access to it.
We've installed Hunk with a default user, named admin. Security was totally disabled while we were adding the first virtual index on top of the aggregated Milano city telco data. Now we've enabled permission checks and the admin user cannot have access to data stored in HDFS. We are going to map the Hunk user named admin
to the user named mail
and see that it's impossible to access data in HDFS through the mail
user since this user doesn't have proper access to data.
#create catalog on HDFS hadoop fs -mkdir -p /staging/test_access_using_mail #create file locally echo "new line for file" > file_with_data.txt #copy locally created file to HDFS hadoop fs -copyFromLocal file_with_data.txt /staging/test_access_using_mail #change permissions to restrict access from any except owner hadoop fs -chmod -R 700 /staging/test_access_using_mail
We created a provider previously. Now it's time to enable pass-through authentication there. Hunk will use the user to impersonalize itself while interacting with HDFS. The idea is simple:
admin
user and impersonalize the mail
user to interact with HDFS.admin
user and the user named mail
:mail
is used to impersonalize the admin
user.mail
user can't access data because it lacks the necessary permissions.The following screenshot shows the settings for the new virtual index that we can't access via the mail
user:
cloudera
user for impersonalization and to get access to data.cloudera
and set access rights to 700
, which means only the cloudera
user can access the file. Open the Pass Through Authentication form and replace mail
with cloudera
: