Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Controlling security with Hunk

What is security? We could work on data security forever; this part of IT infrastructure is infinite. Companies are usually interested in these aspects:

User/group/access control list-based access to data: Administrators should have something similar to the read/write/execute in Linux. We can set who owns the data and who can read/write it.
Audit/log access to data: We need to know who got access to data, and when and how.
Isolation: We don't want our data to be publicly accessible. We would like to set access to clusters from specific subnets, for example. We are not going to try to set up all the functionality required for production-ready security. Our aim is to set simple security to prevent unauthorized users from accessing the data.

What is data security? It consists of three major parts:

Authentication
Authorization
Audit

These three properties give us a clue as to who did something somewhere, and which privileges were used. Hadoop security setup is still a non-trivial task. Let's try to implement basic authentication and authorization.

There are several approaches to securing your clusters. We recommend isolating clusters using VPN since it's one of the easiest ways to protect data and get basic authentication with audit logging. You can always do non-trivial installations, and the best approach is to clearly estimate how much time you will spend setting up Kerberos. Usually, you don't have time for that; it's all devoted to reaching business goals. See the examples on the Internet where a younger startup's security fails and exposes private data to hackers. You definitely need to read a dedicated book devoted to Hadoop security with Kerberos enabled if you need really strong security.

Tip

There is a joke about Kerberos and Hadoop. You can join Hadoop user groups or just search them via the keywords Kerberos, Enable, Help, stopped working. Sometimes, a guy asks a group, "How can I set up Kerberos for my cluster?" He gets some general considerations and links to documents. Then in few days the guy appears again in a group with this question: "How can I disable Kerberos on my Hadoop; everything has stopped working?".

We are going to touch on the following topics briefly:

What is the default Hadoop security?
How Hunk runs out of the box on top of Hadoop
What is pass-through authentication?
How to configure pass-through authentication for Hunk and Hadoop using Hunk UI:
- One Hunk user to one Hadoop user
- Many Hunk users to one Hadoop user
- Hunk user(s) to the same Hadoop user with different queues
How Kerberos works for Hadoop and Hunk

The default Hadoop security

The default Hadoop security is designed to stop good people doing the wrong things, so security is the wrong term for the default functionality. HDFS acts as the base system for storing data with redundancy and even distribution features, which leads to an even cluster load and resilience to node failures. Consider that HDFS is a traditional file system that spans across many nodes. You can configure HDFS not to have SPOF (see the NameNode HA configuration). Be aware, that you can totally lose some of your nodes with data without noticing it if you don't have any monitoring and alerting. (if you don't have monitoring and alerting of course). Let's see what Hadoop HDFS suggests you do out of the box. We need to open the terminal on our VM:

List the HDFS directories:

Type:
hadoop fs -ls /
And get output:
[cloudera@quickstart ~]$ hadoop fs -ls /
Found 9 items
drwxr-xr-x   - cloudera supergroup          0 2015-03-16 06:27 /applications
drwxr-xr-x   - hdfs     supergroup          0 2015-03-16 07:26 /backup
some output removed to reduce size of logs.
drwxr-xr-x   - hdfs     supergroup          0 2014-12-18 04:32 /var

You see the listing for the root of HDFS. HDFS is not fully POSIX-compliant, but you can see traditional user and group ownerships for catalogs with masks. x is used only for listing the catalog. There are no executables on HDFS.

Check your current OS user. This user on OS and Hadoop tools will use this username while interacting with the HDFS file system:
```
[cloudera@quickstart ~]$ whoami
cloudera
```
Your current user should be cloudera.
Check the group membership for your OS user:
```
[cloudera@quickstart ~]$ groups
cloudera default
```
Your cloudera user is a member of two groups: cloudera and default.
Create a file on HDFS.
This command creates an empty file in your current HDFS home directory. Yes, HDFS also has home directories for users:
```
[cloudera@quickstart ~]$ hadoop fs -touchz file_on_hdfs
```
Now verify that the file was created and exists in your home directory.

The hadoop fs command points to your HDFS home directory by default:

[cloudera@quickstart ~]$ hadoop fs -ls 
Found 2 items
drwxr-xr-x   - cloudera cloudera          0 2015-03-16 06:41 .Trash
-rw-r--r--   1 cloudera cloudera          0 2015-04-05 10:14 file_on_hdfs

You can type the absolute path to your HDFS home directory; the result will be the same:
```
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera
Found 2 items
drwxr-xr-x   - cloudera cloudera          0 2015-03-16 06:41 /user/cloudera/.Trash
-rw-r--r--   1 cloudera cloudera          0 2015-04-05 10:14 /user/cloudera/file_on_hdfs
```
The file_on_hdfs file has been created and its size is 0 bytes. The file belongs to the cloudera user because you are logged in as cloudera. Members of the group named cloudera (rw-r--r--) can read this file. Others also can read (-rw-r--r--). Only the cloudera user can modify the file (-rw-r--r--).
Try to modify file using another user:
1. Create a file:
```
echo  "new line for file" > append.txt
```
2. Try to append this to the existing file using a user who is not allowed to modify the file:
```
sudo su hdfs hadoop fs –appendToFile append.txt /user/cloudera/file_on_hdfs
```
3. No problem; you can append to the file. It's counterintuitive since access flags tell us that the HDFS user can't modify the file owned by the cloudera user. Let's double-check that we did the append operation:
```
[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/file_on_hdfs
new line for file
```
Fix the namenode configuration.
The explanation is easy: namenode is a service responsible for HFDS metadata, such as access permissions. Right now it's up and running, but doesn't take care of owners for files and catalogs. Let's fix it. There are two properties responsible for the behavior we need:
- dfs.permissions
- dfs.permissions.enabled
The first one is old and will be deprecated in the near future; the second one is new. Let's turn both to true.

Go to /etc/Hadoop/conf/hdfs-site.xml. Find and change both properties to true. You can use a visual editor and file browser if you are not familiar with the console vi tool:

<property>
     <name>dfs.permissions.enabled</name>
     <value>true</value>
  </property>
  <property>
     <name>dfs.permissions</name>
     <value>true</value>
  </property>
  <property>
      <name>hadoop.proxyuser.root.hosts</name>
      <value>*</value>
 </property>
 <property>
      <name>hadoop.proxyuser.root.groups</name>
      <value>*</value>
 </property>

Restart NameNode to make it see the new configuration:

[cloudera@quickstart ~]$ sudo /etc/init.d/hadoop-hdfs-namenode stop
stopping namenode
Stopped Hadoop namenode:                                   [  OK  ]
[cloudera@quickstart ~]$ sudo /etc/init.d/hadoop-hdfs-namenode start
starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-quickstart.cloudera.out
Started Hadoop namenode:                                   [  OK  ]

Try to modify the HDFS file using the mail user, who definitely doesn't have any rights to modify the file:

[cloudera@quickstart ~]$ sudo -u mail hadoop fs -appendToFile append.txt /user/cloudera/file_on_hdfs
appendToFile: Permission denied: user=mail, access=WRITE, inode="/user/cloudera/file_on_hdfs":cloudera:cloudera:-rw-r--r--

Great, we see that owner permissions are preserved and you are not allowed to modify the file using the mail user since only the cloudera user has write access to it.

One Hunk user to one Hadoop user

We've installed Hunk with a default user, named admin. Security was totally disabled while we were adding the first virtual index on top of the aggregated Milano city telco data. Now we've enabled permission checks and the admin user cannot have access to data stored in HDFS. We are going to map the Hunk user named admin to the user named mail and see that it's impossible to access data in HDFS through the mail user since this user doesn't have proper access to data.

Open the console to create a file in HDFS with data and change permissions:

#create catalog on HDFS
hadoop fs -mkdir -p /staging/test_access_using_mail

#create file locally
echo  "new line for file" > file_with_data.txt

#copy locally created file to HDFS
hadoop fs -copyFromLocal file_with_data.txt /staging/test_access_using_mail

#change permissions to restrict access from any except owner
hadoop fs -chmod -R 700 /staging/test_access_using_mail

Enable pass-through authentication for the provider.
We created a provider previously. Now it's time to enable pass-through authentication there. Hunk will use the user to impersonalize itself while interacting with HDFS. The idea is simple:
- Enable pass-through authentication.
- Set mapping between the Hunk user and the user from HDFS. We impersonalize the default Hunk admin user and impersonalize the mail user to interact with HDFS.
- Open the Virtual indexes menu located under the DATA header:
Select the previously created provider for Hadoop:
Enable pass-through authentication. Scroll down and click on Save:
Set up mapping between the Hunk admin user and the user named mail:
Click on Save. Now the user called mail is used to impersonalize the admin user.
Create a new virtual index and verify that the mail user can't access data because it lacks the necessary permissions.
The following screenshot shows the settings for the new virtual index that we can't access via the mail user:
Explore data using the Hadoop provider and the new index.
Select the provider with enabled pass-through authentication and the newly created index:
You should see a blank screen:
Use the cloudera user for impersonalization and to get access to data.
We've created a file from the user named cloudera and set access rights to 700, which means only the cloudera user can access the file. Open the Pass Through Authentication form and replace mail with cloudera:
Repeat the Explore Data wizard process; now you should see the file: