Building VMWare Identity Manager OnPrem

Hi everyone!

So, who hates OVAs?! ::raises hand::

I’ve been able to work a few back channels and get access to installers for VMWare Identity Manager that are non-OVA running on Windows instances. Here’s the catch, it REALLY sucks to build it. I’m going to document to full process end-to-end to give you a better chance that I had while working on this. There’s no documentation at AirWatch for this, so its really me feeling it through and bugging people at VMWare when I get stuck.

Let me Introduce You…..

Realistically before you can consider rolling this out internally, it is important that you understand the various components that deliver VIDM. They made some really good choices picking some of the best open source technologies out there, but if you don’t know how they work then you will fail.

Elasticsearch

Elasticsearch is responsible for sync logs, reporting, and auditing. You should probably get used to it since its a major component of AirWatch’s strategy for reporting and how they are solving the insanely slow report generation that they have today.

The simple way of putting it is that Elasticsearch is a RESTful search and analytics engine. It’s distributed and can work extremely effectively. This is a major technology that a ton of people are using now. I really like the screenshot below to help you understand it.

Screen Shot 2018-01-13 at 6.16.58 PM.png

You want to visualize the Elasticsearch index as a database. Inside of that index, you have shards aka Lucene indices. A Lucene index is a full-text search library that exists in Java. As an example, in my 6 node cluster I have 25 active primary shards and 50 replica shards in VIDM. Shards are interesting because they are BOTH a data store and a search engine. Elasticsearch indices also have types aka DB tables. Similarly, an index is used to logically partition the data. All data within a type will share the same properties.

The other thing you want to understand about Elasticsearch is the replication, which is a major aspect to building a resilient cluster. Once you configure the elasticsearch.yml file in the /oss/elasticsearch/configuration file to use your group of servers it will replicate across your instances. The graphic below shows you how a new item is added or deleted. The document ID identifies what shard it belongs to. Once it is completed, it will send the data to the replica shards in the other clusters.

Overall just remember, this command is SO useful for you: curl ‘http://localhost:9200/_cluster/health?pretty’ as it will tell you the status of your cluster. I would DEFINITELY suggest setting up monitoring for that URL in your environment expecting a HTTP/200 response. Anything outside of that and you know something is WRONG!

Screen Shot 2018-01-13 at 6.29.44 PM.png

Ehcache

Ehcache is another pivotal piece of software underneath the covers. It is the service that caches your database objects. Guess what people? That is why you can do stuff so quickly! SQL is not exactly quick sometimes, big surprise I know!

So…EhCache is a thing that does stuff.  Ehcache boosts performance, offloads your DB, and makes scalability very easy. Ehcache’s scalability comes in a number of flavors, such as in-process caching, offheap storage, mixed in-process/out-of-process deployments with terabyte-sized caches, and much more. It’s pretty complicated as you can see by this diagram, but the thing we will mainly focus on is the RMI replication.

Screen Shot 2018-01-13 at 7.07.33 PM.png

RMI (Remote Method Invocation) Caching is very favorable. A few examples are: (1) its the default remoting mechanism in Java, (2) you can tune TCP socket options, (3) supports passthrough on firewalls, and (4) implements some nifty batching components to improve performance.

Ehcache is very interesting. The thing that i love most about is there is no primary or master cache. Basically each server you configure its peers and they all work together as a group of peers. It’s the epitome of teamwork, which is beautiful when you think about it logically. The RMI influence comes into play as VMWare sets up two listeners. One on Port 40002, which is an endpoint where remote objects register. Port 40003 is the port for the remote object itself.

Screen Shot 2018-01-13 at 7.09.58 PM.png

RMI is very interesting for non-developers like myself. The entire point essentially is to create objects so remote servers can invoke methods on them. This helps facilitate the replication across nodes, which is essential to the technology. The weird thing to me is that VMWare has a tendency to make the technological decision that makes no sense. Most people would run Ehcache without automatic peer discovery, whereas VIDM requires you to put the peers in the runtime configuration file. As a FYI, if you want to verify its working properly you can check the horizon.log for “Added ehcache replication peer: //node3.hs.trcint.com:40002”

RabbitMQ

RabbitMQ is the 3rd technology to be aware of with VIDM. They have simplified it in recent versions, which no longer requires clustering (thank god!). RabbitMQ is the messaging service for VIDM. RabbitMQ is very attractive because it works cleanly with several programming languages. This technology is similar to the Microsoft Message Queuing except that its substantially more powerful and delivers more features.

Many people smarter than I feel that RabbitMQ is becoming the gold standard in cloud messaging. It helps broker messaging between servers and clients along with delivering monitoring notifications. It’s the epitome of middleware that helps bridge the gap between programming languages while facilitating communications through a protocol-based platform built on the Java Messaging System (JMS) platform.  Luckily, we don’t have to manage it for the most part anymore. We used to have to do a bunch of work on it, but that is no longer required.

Installing the Initial Server

The initial installation of your first server is sort of like going on a first date. You are trying to figure out where the “boundaries” are and what “works or doesn’t”. It can be very tiring, but once you successfully install one of them you will feel vindicated. I’m not going to walk you through the entire install, but I will show you how to hit on all the “gotchas”

The first one you might miss is making sure you export the config.xml from the initial build. You want to make sure you do this, so you setup for success.

Screen Shot 2018-01-13 at 7.52.01 PM.png

Next one here, make sure you click “Browse” to verify you can connect to your database. Also, make sure if you are doing SQL Always On Clustering to check the box there!

Screen Shot 2018-01-13 at 7.58.49 PM.png

My recommendation on the next screen is to NOT have VIDM run as a service account. Remember ideally, you want to run VIDM in your DMZ and then have it run inbound to VIDM Connectors to handle your authentication methods, i.e. Kerberos and its friends.

Screen Shot 2018-01-13 at 8.01.09 PM.png

This one is probably the trickiest of all. They implement (mistakenly and poorly might I add) the proxy/non-proxy hosts on the tomcat server. This means as of 3.1 they can only support a single non-proxy host. The place you usually get tricked up is (1) make sure you match the external and internal server hostnames so you don’t mess yourself up AND (2) make sure you set non-proxy hosts like *.test.com and not test.com. The latter will fail and break your install.

Screen Shot 2018-01-13 at 8.02.22 PM.png

This is where you input the local account for your VIDM instance. Make sure you write it down. Also, do NOT click the cluster checkbox since you can’t create the cluster yet

Screen Shot 2018-01-13 at 8.05.02 PM.png

That covers the install. The best way to know if it worked or not is pretty simple. Go to this folder D:/VIDM/vmwareidentitymanager/usr/local/horizon/conf/state/* and make sure you see files here. If you do, then the install actually worked and you didn’t get screwed over by your proxy.

Screen Shot 2018-01-13 at 8.06.51 PM.png

The last thing to check is that you have this many lines in the system.config file back in the conf folder a few levels back

Screen Shot 2018-01-13 at 8.09.46 PM.png

After the install finishes, you can use my wiki here for setting up the initial server. It’s fairly straight forward: (1) login, (2) setup directory, (3) setup auth methods and policies, and (4) setup a test application. Either way, onto building the additional nodes!

Installing the 2nd and 3rd Nodes

Before you start installing the other two nodes, you have a few pre-requisites to complete:

  • Generate the Cluster Config File
  • Install IIS on the other two servers ( I know it’s dumb but no one knows why you need it, but its part of the installation process)
  • Make sure the proper ports are opened between the member servers

Generating the Cluster File

  1. You need to run a valuable script, to create the cluster file as you can see below. I obviously blanked much of it out, but the format is .\generateClusterFiles.bat filename password. Make sure you don’t get cute and try to have it write it elsewhere. It needs to be written here, which you can see if puts it in the horizon folder.Screen Shot 2018-01-11 at 1.38.28 PM.png
  2. Copy the file to the other two servers
  3. Installing IIS on those servers is easy enoughScreen Shot 2018-01-11 at 1.44.44 PM.png

Firewall/Port Setup aka GIANT WALL OF TEXT

This isn’t a fun time for anyone. The firewall/ports you need for this application is an absolute monster.

Source Destination Port Service
External Devices VIDM Load Balancer HTTPs/443 VIDM
VIDM Load Balancer VIDM Instance HTTPs/443 VIDM
Admin Desktops VIDM Load Balancer HTTPs/8443 VIDM
VIDM Load Balancer VIDM Instances HTTPs/8443 VIDM
VIDM Instance VIDM Instances HTTPs/8443 VIDM Clustering
VIDM Instances VIDM Instances TCP/40002-40003, EHCache replication
VIDM Instances VIDM Instances TCP/9300-9400, UDP/54328 Elasticsearch
VIDM Instances SQL Server TCP/1433 Database Connectivity

A few other things that will help your cluster build go smoothly:

  • Don’t forget to have the SSL certificates ready for your other servers (Strongly recommend your load balancer and cluster member names in there. Don’t forget CNames!)
  • Don’t forget when running the installer to change it to the right certificate
  • Don’t forget to click the box to add it to the cluster! (You won’t have one without it)
  • Don’t add members to your load balancer until you re finished prepping your cluster for the data center (next section)
  • After you successfully install the other members, you should go in and make sure they are properly associated to the right identity providers, and all that stuff as a good check prior to proceeding.
  • RabbitMQ no longer needs clustering so don’t worry about that when you see it in older docs.

Preparing your Cluster for the Second Data Center

Okay, so now this is where it gets interesting. All of the steps are in Linux so I have Towlesified them for you.Prep Elasticsearch for DR

  • First stop the service, ignore the “could not be stopped” and just check it if necessary in services.mscScreen Shot 2018-01-11 at 1.11.04 PM.png
  • Remove the PluginScreen Shot 2018-01-11 at 1.14.37 PM.png
  • Start the serviceScreen Shot 2018-01-11 at 1.15.23 PM.png
  • Modify the Elastisearch configuration file found here: D:\VIDM\VMwareIdentityManager\oss\elasticsearch\config\elasticsearch.ymlScreen Shot 2018-01-11 at 1.21.38 PM.png
  • Restart the Identity Manager Service
  • Test the clusterScreen Shot 2018-01-11 at 1.16.36 PM

Setting up your Cluster Runtime Properties to finalize replication

  • Navigate to D:\VIDM\VMwareIdentityManager\usr\local\horizon\conf\runtime-config.properties and edit the file
  • Configure your EHCache Replication Peers (Don’t forget to omit the server you are on i.e. its PEERS which means never put your own server or you create a VIDM paradox)Screen Shot 2018-01-11 at 1.28.02 PM
  • Add the 2nd data center’s load balancer FQDN to the property fileScreen Shot 2018-01-23 at 6.56.41 PM
  • Restart the Identity Manager service one more time

Setting up the 2nd Cluster

I had a TON of challenges with this. One thing I will recommend is if your 4-6th servers give you issues to RIGHT AWAY recreate the cluster file. I made this mistake and I learned my lesson after trying to rebuild servers 8 ways to sunday. I am mostly just going to give you a few tips here to be successful since its mostly the same as building the first three.

These rules below need to be implemented for your servers 4-6 back to your first three. It is vital that all 6 of your servers talk together. This is by no means a VMWare recommendation, but I feel like I’m sort of an expert. You can certainly build a cluster of three servers which you can learn about here

My belief after working countless hours on the architecture and design of my environment is that if all 6 nodes can maintain a latency under 20ms then it gives you more resilience. If you think about it, the ability to keep your clusters for Ehcache and Elasticsearch in sync then there’s substantial benefit. The servers that are actually in use are managed by your load balancer so you don’t have to worry about it hitting bad servers provided your health checks are sound.

Source Destination Port Service
VIDM Instance VIDM Instances HTTPs/8443 VIDM Clustering
VIDM Instances VIDM Instances TCP/40002-40003, EHCache replication
VIDM Instances VIDM Instances TCP/9300-9400, UDP/54328 Elasticsearch
VIDM Instance VIDM Instances HTTPs/8443 VIDM Clustering

The process for 4-6 is the same with the first 3 servers from the preparing the cluster section. There are a few things you need to do after the 4-6 servers are setup:

  1. Set those servers into Read Only Mode in the runtime.config file by adding the line: read.only.service=true and restart services with the horizon script

Screen Shot 2018-01-13 at 8.31.27 PM.png

2.  Once you complete those tasks, you should clear the cache on those servers to have them properly prepared for a DR event. You can do it in one of two ways via the API or reboot the server. I think its important to make it very easy on your operations teams so I prefer to just tell people to reboot the server vs. teaching them how to use Postman.

Once you have setup your entire environment, you should definitely perform a DR test to ensure you didn’t mess up. Ideally if you built a solid environment it would look like this:

  1. Flip your Global DNS (recommendation is that it has a TTL of 30s)
  2. Failover to your secondary database
  3. Set Read Only to False
  4. Use the horizonService.bat to restart your backup cluster
  5. Reboot the primary servers after the secondary takes over

Some might suggest just failing over and letting them operate in read only mode (doesn’t apply to SQL Always On) , but I think that is just setting yourself up for failure.

The VIDM Lifeline

Something that goes truly unsaid and people aren’t even aware of is the amazing HZNAdmin Tool. This tool helps you perform a number of tasks as you can see below:

image1.PNG

So let’s discuss this in context. I know most people in mobility are not particularly strong with command line tools and documentation for this isn’t available so let’s discuss a few of the commands. Firstly, you access the utility from /usr/local/horizon/bin folder called hznAdminTool.bat.

The ones you should care about most are:

  • generateKeys -masterKey: this is amazing! It will regenerate your master encryption key, which is essentially a keystore  that links up your clusters and does lots of other stuff we likely don’t understand. Once you run this, you just have to copy the master key file over to your other nodes and ta da!
  • setSystemAdminPassword: this will let you reset the admin account password used to log into the admin console over 8443. This is an ABSOLUTE lifesaver. Trust me I’ve tried unlocking accounts from the DB and it doesn’t work!
  • generateSuiteTokenKey: this let’s you regenerate the suite token key that the API uses. You might see an issue where it tries to generate a new suite token in the logs, where this might be useful
  • encryptionServiceHealth: this command will help you with checking the health of the encryption service. It’s a good troubleshooting tool if your console health is failing.
  • clusterInstances: lists what nodes are part of your cluster
  • orgInvalidStateCheck: will scan your org to look for any apps that are in a bad state

Overall there are plenty of utilities in here, but I can tell you one example that is a good real world example was from building my environment. I found that my node monitoring in the console wasn’t working. I reviewed logs and saw my master encryption key was hosed. I figured out that by using this tool and manually copying the file to my other nodes that I resolved this issue.

In Closing

I wanted to share a few DB queries that I’ve found to be very useful for cleanup/fixing issues.

This will let you delete connectors that are orphaned aka cannot be removed. Sometimes when you have to reinstall a connector, it will still show up in there:

Delete from saas.Connector where id=5

You can use this command similarly to delete bad entries in the VIDM server monitoring GUI:

Delete from saas.ServiceInstance where id=3

Overall the great thing about the design that I have provided here is that you can achieve a 15m SLA on a total region failure. This is substantially better than the SaaS SLA VMWare offers of up to 3h. In a world where we cannot achieve dedicated SaaS or they aren’t building their cloud in AWS with cross-region replication/failure then we must do better. It’s vital that we as technologists must strive to be better and deliver something that is a legitimate game changer. VIDM can be a difference maker if you know how to deliver positive business outcomes with it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s