Managing cryptographic keys at an enterprise level: What to love and what to hate?
The hurdles of managing cryptographic keys at an enterprise level. A deep dive into the maintenance challenges of HSMs and the advantages of Key Management solutions.
This post is focused on the most important components that play a role in the key management lifecycle: Hardware Security Modules (HSMs) and Key Management Services (KMSs). I start by giving a very short introduction of what is cryptographic key management and what does the key management lifecycle entails. The rest of the post is mainly focused on my experience with HSMs and KMSs with regards to architectural, operational and maintainability aspects.
What is Key Management?
Key Management is about managing the lifecycle of cryptographic keys.
Cryptographic keys (simply referring to keys in this post) are one of the most essential parts of today’s security protocols, services and products. Just to give an example of their importance: you cannot setup a secure connection without a cryptographic key; and you cannot encrypt data or perform API authentication without a key. Despite keys being one of the most essential part of the security foundations, more than often they are poorly managed, leading to numerous vulnerabilities. If you’re not familiar with these topics, please find more information here.
Key Management Lifecycle
Key management involves managing the entire lifecycle of all the keys, from generation until destruction. The main steps of the keys lifecycle are defined in the picture below.
These steps include: key generation, key distribution & registration, key storage & backup, key deployment & usage, key recovery, key rotation (or sometimes referred as re-keying), key revocation & archiving, key de-registration, and finally, key destruction. All these steps are spread across different phases: pre-operational, operational and post-operational. NIST provides guidelines and best practices for managing of cryptographic keys , including definitions and specifications of all the steps. More information can be found here.
Hardware Security Modules (HSM)
HSMs play an important role in managing the keys lifecycle. These are physical devices that protect and manage cryptographic keys — e.g: offering the possibility of generating keys with sufficient entropy (i.e., pseudo-random keys. They can come in the form of plugin cards or an external device. Either way, HSMs can be attached directly to physical servers or to the network. They support a set of standard cryptographic algorithms and must be FIPS 140–2/3 certified. More information about the technical characteristics of HSMs can be found here and here.
There two main types of HSMs: (1) general-purpose and (2) payment HSMs. This post is mostly focused on general-purpose HSMs.
The usage of HSMs for key protection is mandatory in certain business lines, like card payment systems and PKI environments. But these devices can also be used to accelerate SSL-offloading (CDN/Proxy Services) or even enhance the key protection in the context of crypto-currency environments.
HSMs in practice
HSMs and Key Protection
Each HSM has a Master (symmetric) Key, either using AES or 3DES (some payment HSMs still use 3DES keys). Keys can be protected by specific smart cards (in a physical or virtual form) that are configured to enable access to the specific application keys or to authorise administrative tasks in the HSM. HSMs of the same cluster share the same master key. Different applications can have different keys that are protected under the same master key of the HSM.
Despite card protected keys provide enhanced key protection, physical smart card protected keys might be a nightmare to maintain. For certain HSM vendors, whenever the application needs to re-start (due to a scheduled upgrade, for instance) the card needs to be inserted in the HSM so the application key can be loaded and access can be granted. This requires, in certain cases, the Crypto Officer to be physically present at the HSMs (physical) location. Since this is not always realistic and can actually lead to DDoS attacks, some (or even most of) application keys are configured without the additional access control mechanisms.
The typical deployment architecture of an HSM
Typical HSMs have a couple of default interfaces that can be used to connect the applications: Java JCE, OpenSSL, MS CAPI CNG, PKCS#11, domain-specific command-driven interface, etc. In order to connect a client-application, they require deployment of an agent (daemon process) to be installed client-side.
This agent connects the client-application to the HSM and must be deployed either (1) in the same server — virtual machine or container — where the application is running or (2) in some sort of gateway — between the client application and the HSM. This gateway service must enable the communication between the client-application and the agent.
Let’s assume for now that the client-application is deployed in the same server where the agent is deployed. When setting up a redundant and highly-available environment, we need at least two HSMs and two servers where the client-application is deployed.
In these environments we need to connect each of the servers to each one of the HSMs, which means that each single application requires at least four or more connections to the HSM (depending on the number of servers). Setting up these connections might entail, for instance, configuring the IP of the server in the HSM. Despite most HSM vendors provide remote management of HSMs, whether the connections are physically configured at the HSMs’ location, or done in a remote way, this is mostly a manual task performed by the Crypto Officer or Key Custodian.
HSMs deployment at scale
This typical HSM architecture model doesn’t scale when we have hundreds of applications to connect to an HSM cluster (with multiple HSMs), where each one of them has multiple servers for high-availability and redundancy.
Please note that the architecture model described above needs to be replicated most of the environments: development, test, acceptance and production.
Furthermore, some HSMs require configuring in the HSM the specific IP of the server where the client-application is running. This means that this tightly coupled architecture is not directly compatible with containerised environments or other cloud-native resources.
The effort we have to put in to configure each and every single application when using this typical deployment architecture for HSMs is huge.
The maintainability nightmare of the HSMs world
Next I list some of the major issues I have identified with regards the maintainability of the typical deployment architecture for HSMs described above.
1. Tightly coupled architecture issues when doing patch management or upgrading
This tightly coupled architecture is ridiculous painful when you need to upgrade the HSMs. If the HSM needs an upgrade you need to go through the Herculean task of synchronising with all the client applications owners for the upgrade to take place. Sometimes, HSMs upgrades require the agent to be upgraded, which may require that the application itself needs to be updated as well. This is because an update in the agent and associated libraries/interfaces, requires an update in the client-application itself. Sometimes the whole upgrade is nearly impossible, because the client-application upgrade is not compatible with the HSM upgrade.
Practical examples:
Example 1. E.g.: when using Java JCE interface and the Java library has an update, the client-application might need to be re-built and re-deployed. So the upgrade itself, requires:
1. changing the code of the client-application and redeploying it
2. redeploying a new agent in the server where the application is running
3. upgrading the HSM itself
4. test everything end-to-end and hoping nothing breaks so you don’t have to spend infinite hours debugging the code or in meetings with the vendor.
Example 2. E.g.: Consider the situation where the new upgrade of the HSM implements the PCKS#11 interface in a different way from the client-application. This leads to incompatible working versions of the interface in the client-application and the HSM. This may sound strange to you, but not all the applications are tested against all HSMs vendors and PKCS#11 is a standard, not a library. Therefore, each vendor may implement in slightly different ways as described in this video.
2. HSMs licensing models and support contracts are expensive
Some vendors charge per single HSM connection and enterprise licenses (which allow unlimited number of connections) are only cost-efficient when you have more than a certain number of clients connected to an HSM. On top of that, in some cases, each additional functionality increases the licensing costs of the HSMs. E.g.: enabling ECC may cost an additional thousands of euros.
Also, additional functionalities incur additional costs in the HSM support contracts. E.g.: if you enable ECC, you may pay more for the support of ECC on top of the amount you’re already paying for the HSMs support. This tightly coupled architecture is not just a nightmare to maintain, it’s also very expensive and not cost-effective.
3. Scalability: provisioning a new HSM is hard
Adding a new HSM to the cluster is also not an easy process. HSMs are physical devices, so configuring a new HSM in some cases may require some effort, because HSMs configuration can be hardly automated. This becomes even harder when you have custom firmware running in the HSM (some applications may require custom firmware for enhanced performance). That’s why most of the HSM environments tend to have overcapacity by default.
4. Crypto Officers are like unicorns
Finding people with actual knowledge on how to configure and maintain an HSM is like finding a needle in a haystack. Managing an HSM requires cryptographic knowledge and expertise on how a specific HSM model has been implemented. On top of that, Crypto Officers must also understand how a modern applications work, including typical application stacks, infrastructure as code, CI/CD, etc; otherwise conversations with the DevOps teams become almost impossible to maintain. This means that Crypto Officers are like unicorns, but as we know unicorns are exquisite and infrequent. The more heterogenous is your HSM environment, the more difficult it becomes to maintain and find personnel with suitable skills.
5. HSMs are not secure by default
This is a hard claim and for sure a controversial one, but despite the fact that HSMs are cryptographic devices, that doesn’t mean these are secure devices. Graham Steel from Cryptosense, explains in this and this video some of the attacks to these devices. One of the most recent known attack is from 2019, where a group of french researchers found a bunch of security vulnerabilities in HSMs, that allowed them to upload unsigned (unauthenticated) firmware in the HSM (using a client-application and the PKCS#11 interface) and managed to dump all its secret keys (see more information here). If an attacker is able to access an HSM, either via the client-application or through the network, the impact can be catastrophic. So, if the client-application that directly connects to the HSM, is externally exposed, the attack surface of the HSM increases largely. Therefore, securing this whole tightly coupled architecture is a really laborious task.
I could keep going and keep mentioning things like firmware versions with unsolved (unsolvable) bugs, susceptibility of the HSMs to be vulnerable to network attacks, and so on, but I believe that the aspects listed above suffice to justify the lack of maintainability of the HSMs environments. This is why Key Management Services are such an important piece of the puzzle in these environments.
Key Management Services (KMS)
The KMS creates an abstraction layer for the HSMs, solving connectivity and configuration issues, enabling scalability in terms of consumption of HSMs services, long-term maintainability and most importantly, improving security of these environments. KMSs are in fact software-based solutions aiming at automating the key management lifecycle.
Despite some KMS solutions have been around in the market for quite a while, HSMs were never used at scaled until the Cloud Service Providers (CSPs) like Google, AWS and Azure, made it possible. As mentioned above, HSMs are expensive, hard to setup and configure, require specialised knowledge to manage and mostly do not fit into today’s cloud environments and DevOps culture. CSPs changed this game, by placing easy-to-use Key Management Services (KMS), like AWS KMS or Azure Key Vault, in front of the HSMs. In fact, the KMS from CSPs made the HSM services more of a commodity, allowing the usage of HSMs within any security context that requires cryptographic keys.
In the majority of the use cases, KMSs in the cloud came as a way of improving the key protection and somewhat improving the management of the keys. Two main types are offered in by CPSs:
- Single-tenant: the KMS and associated HSM are dedicated to a single tenant. Not sharing HSMs across different tenants reduces the attack surface in case of vulnerabilities in the KMS. E.g.: AWS Cloud HSM and Azure Managed HSM are example of single-tenant KMS solutions.
- Multi-tenant: in the multi-tenant environment, KMSs from different subscriptions share the same HSM; i.e., a single HSM cluster is shared across multiple tenants instead of dedicated to a single one. E.g.: Azure Key Vault and AWS KMS (HSM-backed) are example of multi-tenant solutions. In these scenario, the master key of the HSM cluster is used to protect all the secrets of all the tenants connected to it.
In both scenarios, the KMS is offered as a SaaS solution, integrating natively with native services and functionalities of the CSP, such as authentication and authorisation services, native PaaS services, etc. Some CSPs, like Google, for instance, offer now the possibility of integrating specific external KMSs and HSM services with their native SaaS and PaaS services (Google EKM).
Please note that KMSs in the cloud also support software-based key protection instead of HSM-protected. However, this post is mostly focused on the HSMs-backed KMSs.
What to love about cloud KMSs?
Cloud KMS provides well-maintained and tested SDKs and (REST) APIs. The environment where the client application is running doesn’t have to be preconfigured in order to connect to the HSM. The client-applications only need to consume the APIs to perform the cryptographic operations.
Cloud KMS perfectly aligns with the DevOps culture. It integrates with the continuous delivery pipelines and allows a more repeatable, testable and more secure deployment and maintenance of keys. Key management lifecycle can be automated, easily integrated with change and configuration management and provides audibility out-of-the box. KMS in the cloud facilitates DevOps practices, causing a massive adoption of these services. Secrets and keys are stored in a more secure way these days thanks to these products.
But, KMSs in the cloud are not for everything…
Decentralised key management. Key management in the cloud becomes mostly decentralised. Each DevOps team is responsible for maintaining the lifecycle of the keys used by their application. This comes as both as an advantage, as well as a problem. If you’re able to create templates to properly manage the keys and get the DevOps teams to use those templates, then decentralised key management can be a blessing — provides more autonomy to the teams and enables faster-delivery of secure products. On the other hand, if templates do not exist, things like misconfigurations of the KMS — e.g.: accessible from public end-points; faulty procedures to handle keys — e.g.: developers generate keys in their laptop and import them to the KMS manually; poor or non-existent access management policies, can lead to a greater key exposure.
When proper key management knowledge is not available, it is preferable to centralise the key management lifecycle.
Centralisation is possible but… CSPs do offer centralised KMS solutions — e.g.: single-tenant KMSs described above. However, these products are only configured with HSMs that are FIPS 140–2/3 Level 3 certified and some applications can only work with FIPS 140–2/3 Level 2 HSMs. Additionally, the cloud KMSs do not always support all the cryptographic operations and algorithms, nor all types of keys. E.g.: it is not possible to create symmetric keys in Azure Key Vault backed by an HSM — see here.
Not suitable for applications with low-latency requirements. KMSs provided by the CSPs cannot be used in certain contexts such as SSL-offloading. This is simply because they were not designed for that purpose. HSMs for SSL-offloading are placed next to the SSL server with the goal of speeding up the (asymmetric) cryptographic operations. When connecting an SSL server to a regular KMS service in the cloud we tend to introduce (network) latency, so, instead of speeding up the computations, we are in fact delaying them, and consequently degrading performance. Thus, for SSL-offloading environments, having dedicate HSMs is actually desirable, despite of all the issues that may arise when maintaining that architecture.
Lack of full control over the key material. Storying and managing keys using cloud KMSs can also impose challenges with regards to the lack of full control over the keys. For instance, because the cloud KMS integrates natively with the cloud IAM service, the CSP is technical able to elevate its privileges and access the keys on behalf of the tenant, without the tenant’s knowledge or consent. If these keys are being used to protect personal data, this may introduce data protection related issues (e.g.: non-compliancy with GDPR).
Getting the most out of KMSs and HSMs
Since solely relying in a KMS of a CSP might not be sufficient for some industries, replicating the KMS architecture model of the CSP (up to a certain extent) in the private clouds is for some business lines the desired situation. This KMS architecture model should then:
- Support all standard cryptographic operations and algorithms needed;
- Be centrally managed and fully backed by an HSM;
- Provide scalability, elasticity and more maintainable loosely coupled environments;
- Support applications running in both the private or the public cloud;
- Enable (full) automation of key management lifecycle;
- Access to keys is under the full control of the organisation, i.e., access management fully depends on internal IAM services;
- Ability to integrate with the cloud KMS, such that we can use HSMs in the private cloud or in the public one, whenever we need/want to.
This centralised KMS architecture that allows managing and visualising all the key lifecycle operations across all platforms in an enterprise, provides, the so called, single-pane of glass for the key management operations.
This architecture allows better control over key management lifecycle, enhanced crypto agility, better auditing and reporting and most importantly, safer and more secure environments.
Please note that the cloud KMS does have limitations, but it also has advantages. Depending on the requirements, the cloud KMS is in fact a good and easy solution for managing cryptographic keys. So we should be able to keep using it, even when having a KMS deployed in our internal/private cloud.
Can we actually implement a hybrid KMS architecture model?
The architecture model described above is actually possible to achieve with some KMS solutions in the market. But don’t forget that the KMS scalability is directly related to HSMs scalability. As mentioned, HSMs do not easily scale up or down when deployed in the private cloud, but for every decision, there’s a compromise. So, you must decide whether it makes sense to have HSMs deployed in the private cloud always running with overcapacity or not, by looking into your specific requirements (functionality, load, compliance, security, etc.).
The most popular HSM vendors do offer the possibility of having HSM as a Service (HSMaaS). In these services, HSMs are deployed in the datacenter of the vendor and the client can manage the keys remotely. In my opinion, HSMaaS coupled with an internally managed KMS, provides a better architecture model (with enhanced control) for applications that do not have strict performance requirements or/and do not require HSMs with custom configurations (e.g.: some mainframe applications for digital transactions wouldn’t fit into this model). Elasticity is still an issue, but at least scaling up and down becomes more straightforward.
KMS as a Service (KMS as SaaS) is also provided by some HSM vendors. This solution can be a good option for some use cases. However, KMS as SaaS may not always provide the desired level of control required in the implementation of certain regulatory requirements, nor the ability to integrate with distinct HSM clusters.
Moving forward… perhaps with virtual HSMs
Not long ago, two very well known cryptographers (Yehuda Lindell and Nigel Smart) built what is called a virtual HSM (vHSM). This product from Unbound Security leverages the capabilities of Secure Multi-Party Computation (SMPC) and instead of requiring a physical device to perform the cryptographic operations, the key is split into two or more parts called shares, and each share is placed in a different server. The SMPC protocol enables the different servers to interact with each other and perform the computations in a secure way without revealing the key. Yehuda Lindell gives a good explanation of this product in this white paper.
vHSMs are in fact software products, not hardware ones. This means that the key management lifecycle becomes more agile, more easy to maintain and more quick and smooth to adapt to changes (even in the event of quantum-computers and post-quantum crypto).
vHSMs are, in my opinion, the evolution of the traditional HSMs when we consider cloud adoption and DevOps practices.
I truly love the idea behind vHSMs and the product of Unbound Security, but I still see a lot of resistance with regards to their adoption across the organisations and even among the crypto community. There are political, commercial and, of course, technical reasons substantiating this discussion. I rather focus on the technical ones, but I’m not going to further discuss them in this post.
Conclusions and further remarks
The traditional operational key management can be really painful. As explained, the typical deployment architecture of HSMs is hard to maintain and doesn’t not scale. When coupling HSMs with KMSs we can in fact, turn the HSM services into a commodity. KMSs make key management easier, more controlled and secure and above all, widely available to all applications managing key material.
After reading this post, you could ask: why do all applications handling keys need an HSM (whether it’s a virtual one or not)?
Some advocate that HSMs must not be used for protecting all the keys in an enterprise. I’m certainly not of the same opinion. HSMs are extremely powerful and expensive computation devices with a capacity to do more than 30.000 cryptographic operations per second.
So I think the question we should ask is: why not using an HSM to protect all the key material?
HSMs provide a better mechanism for key protection, so leveraging the HSMs capabilities across multiple applications is actually needed and desired.
Remarks
It goes without saying, but please note that this is my personal view on the topic and is fully based on my own experience as a cryptography consultant.