The video we posted on June 23rd, 2015 (link here) introduced our efforts to open up our big data platform to other teams within AppNexus. Data Platform as a Service (DPaaS) is our internal offering that allows other teams at AppNexus to run analytics upon our wealth of data. Our users want to be confident that they’re using the platform safely and appropriately. They only want to see the jobs and resources that are relevant to them, and not to impact other users nor mainline production processes. As operators of the platform, we want to ensure the safety and stability of the system as a whole and reasonable isolation between our users.
This clearly points to requiring an AAA solution - authentication, authorization and accounting.
- Authentication - identifying an entity acting upon your system
- Authorization - allowing/disallowing that entity to perform actions
- Accounting - keeping track of which entities have performed which actions
Practically speaking, we will be tackling these one at a time in the natural order listed above. Each item has its own complexity and intricacies, and this post will discuss those around the first A - authentication.
Which systems need authentication?
The short answer is “all of them!” Actually - that’s the only answer. Security is definitely one of the things in this world where “the chain is only as strong as its weakest link.” Luckily we had already included authentication in many of the systems we’d built in DPaaS - requiring entities (both humans and systems) to log in with credentials. In order to strengthen the chain further, we wanted to have authentication for our Hadoop infrastructure as well.
Non-authenticated Hadoop means that Hadoop simply trusts the self-identification of any entity making a call into the cluster. HDFS API calls, map-reduce jobs, Hive queries, etc. will simply take the username as trusted input and will run the operation as that user. There’s no check whatsoever for that user’s credentials or real identity.
We explored two ways of locking down our Hadoop infrastructure.
Skirt the issue a bit by simply not letting users have logical access to the Hadoop cluster at all. We would gate all access through controlled DPaaS-specific API endpoints and that’s it. Only operators and specific systems would be able to access Hadoop directly. We’d add the proper authentication at the API level and we’re done.
This solution was great on paper, but very difficult to implement in practice - both technically and logistically. On the technical side it would have been an infrastructural change. Our clusters share subnet space with other systems and would have had to be peeled off - a herculean effort on its own. We’d also have needed to manage a whole other network along with bastion hosts and poke holes for whitelisted services. Ugh! Logistically, executing the project would have required a lot of cross-team dependencies which are always challenging to manage and schedule. Additionally, some of our existing users are accustomed to having logical access to Hadoop, and we didn’t want to shut them out.
Kerberos authentication in Hadoop
Bite the bullet and turn on Hadoop’s support for Kerberos authentication. This would close our security loop holes and give us the control we need over access to Hadoop. We already use Kerberos in-house for some other systems, so the core infrastructure has been in place. Implementation and rollout would mostly be handled by one team, so the logistics were manageable. And looking at the documentation, it didn’t seem like it would be too hard…
Turning on Kerberos in Hadoop had two major pieces - each with its own major challenges. The first was enabling authentication within Hadoop - ensuring HDFS, Yarn, and Hive would all interact properly in a kerberized environment. The second piece was ensuring that all our systems and processes that interact with Hadoop could authenticate and work seamlessly with the locked-down cluster.
Locking down Hadoop looked like it was just about following the recipe published on the Cloudera web site (our Hadoop distribution provider). The Cloudera docs helped a lot and gave us a jump start. During implementation we ran into many challenges which required code deep-diving, debugging and trial & error.
The most complicated of which was adapting to a quirk in our network/host infrastructure - our hostnames don’t match their fully qualified domain names in DNS. This causes consternation for Hadoop. Without going into excrutiating detail, there are some assumptions made on how your Kerberos principals are named & configured, and the validation Hadoop tries to do when connecting to any given server in the cluster. Because AppNexus’ infrastructure doesn’t conform to those assumptions, it took us a while to find a workaround (other than completely revamping how our hostnames are set up). The trick was an obscure feature for allowing a glob pattern to be used for server principals -. (See: Hadoop Jira Issue or the code itself). Every kerberos
principal configuration value also allows a
.pattern version of it. I.e.
1 2 3 4 5 6 7 8 9
After discovering that, it was a lot of trial and error to find all the myriad of configuration values which needed an additional
The next big challenge was Hive. We’d been happily running our production queries using the
hive command line tool in embedded mode for years. Unfortunately, in a kerberized Hadoop environment, this was no longer going to fly for us. We had to migrate to using the
beeline client connecting to hive-server2. We also enabled authentication on Hive metastore server, which caused some fun difficulties on the client side detailed below.
AppNexus client systems
Fixing this was not terribly complex. Our data loading code is written in Java. We were able to easily leverage the Hadoop libraries to make properly authenticated HDFS calls into our cluster. We created a Kerberos principal for a data loader user which we gave Hadoop impersonation privileges to. This is because our data loading system loads data which can potentially be owned by several different users, and we wanted to keep this flexibility going forward. The following code creates a properly authenticated connection to HDFS with impersonation:
1 2 3 4 5 6 7 8 9 10 11 12 13
More information on secure impersonation in Hadoop can be found here.
Our job execution is a mix of custom map reduce jobs and Hive queries. This required a refactoring of our job launching software, and luckily for us, we had the foresight to make it a single framework which all our jobs use. The framework is written in Java, so we leveraged the same libraries as above. We hoped all we’d have to do is put a big
doAs() block around our job launcher.
Here’s a pared-down version of what we ended up doing. Some of the full class names are specified where there might be ambiguity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
The above was good enough for our MapReduce jobs, but not quite for Hive. Our system was using the ‘hive’ command line tool in embedded mode for all production Hive queries. This was not going to work at all anymore in a kerberized environment. The ‘hive’ CLI tool has been deprecated for some time and we had to migrate to hive-server 2 (as mentioned above). We switched all our jobs to use the ‘beeline’ client instead of the ‘hive’ CLI. This was a pretty straightforward change.
Hive metastore was another story, unfortunately. The metastore client we were using (or the way we were using it) wasn’t communicating properly with a Kerberos-enabled metastore server. We ended up switching from using the
org.apache.hadoop.hive.service.ThriftHive.Client client to using the
org.apache.hadoop.hive.metastore.HiveMetaStoreClient. ‘HiveMetaStoreClient’ uses ‘ThriftHive.Client’ under the covers, but it gets configured directly from ‘hive-site.xml’ (as long as its on the classpath - a hard-learned lesson). We came to this conclusion by inspecting the Hive Metastore code itself and seeing which client code it uses internally, and used the same library.
Data Extraction API
This was identical to the “Data Loading” use case above, except we didn’t need to do any impersonation. We just used the Kerberos authenticated user to directly access HDFS.
Throughout this process, we kept in mind the complexity of the production deployment. This feature falls into that undesireable “big bang” category where you have to flip it all on at once, or not at all. There’s no staged roll-out approach where we just secure part of the cluster, or just some of our client software - it was all-or-nothing.
The key piece for us was to ensure that the Kerberos authentication could be easily turned on and off across the board. The deployment would turn it on, and if something went awry we’d turn it off. Therefore we made sure that the authentication method (“kerberos” vs. “simple”) was a configuration parameter in all our software. The actual code required to authenticate was deployed to production in a dormant state well ahead of flipping the switch. Both the authenticated and non-authenticated code paths were well tested and exercised. This also enabled us to easily practice the deployment in our pre-production environments.
There was no magic to deploying this. It required a lot of planning and coordination. We staged as many things up front as we could such as the Kerberos principals and keytabs, and all the code & configuration as mentioned above. The deployment was an orchestrated procedure to:
- Pause our data pipeline
- Safely bring down the Hadoop cluster
- Flip from “simple” authentication to “kerberos”
- Bring the cluster back up
- Resume the data pipeline
- Breathe :)
Now that we have of our authentication pieces in place, we’re tackling the second A - authorization - next. Stay tuned to the AppNexus blog as we’ll be sure to share our designs and trials & tribulations around authorization.