The MPC Custody Handoff Problem: How Enterprises Maintain Operational Continuity During Key Shard Rotation and Infrastructure Upgrades
When enterprises rotate MPC key shards or upgrade custody infrastructure, the primary risk is not a security breach during the transition itself. The real risk is operational disruption: frozen transaction pipelines, signing failures, governance gaps, and compliance blind spots that open up precisely when the system is in flux. For institutions managing significant digital asset flows, even a brief interruption carries material consequences. The solution is not to avoid these transitions but to architect them so that continuity is never in question.
TL;DR
- Key shard rotation and infrastructure upgrades are necessary for long-term security, but they introduce a window of operational risk that most enterprises underestimate.
- Continuity depends on parallel shard availability, policy-layer governance, and pre-defined handoff protocols rather than manual coordination.
- Compliance obligations do not pause during infrastructure changes, making audit-trail integrity a non-negotiable requirement throughout any transition.
- Institutions should treat custody transitions as a governed operational event, not a technical maintenance task.
- The first tier of security standard in the industry is not just about how assets are protected at rest, but how protection holds through every point of change.
About the Author: This perspective on MPC operational continuity is grounded in nine years of real-world deployment experience across enterprise custody infrastructure, serving over 3,500 businesses across 50+ countries including banks, payment service providers, and institutional trading desks globally.
Why Does Key Shard Rotation Create Operational Risk?
MPC custody works by splitting a private key into separate cryptographic shares, called shards, distributed across multiple parties or systems [chainup.com]. No single shard can reconstruct the key on its own. This design eliminates the single point of failure that makes traditional key storage dangerous [blockdaemon.com]. But the same distribution that makes MPC secure also makes lifecycle management more complex.
When a shard needs to be rotated, whether because a signing party is being replaced, hardware is being decommissioned, or a compliance policy requires periodic refresh, the enterprise faces a narrow coordination window. During that window, the old shard set is being retired and the new shard set is not yet fully active. If the handoff is uncoordinated, signing requests that arrive mid-rotation may fail, queue indefinitely, or trigger exception states that require manual intervention.
The operational cost compounds quickly. Payment flows stall. Settlement windows are missed. Internal teams scramble to diagnose whether a failed transaction reflects a security event or a configuration lag. This ambiguity is itself a governance problem.
What Makes a Shard Rotation "Safe" From an Operational Standpoint?
A safe rotation preserves two things simultaneously: signing availability and audit continuity.
Signing availability means the system can still authorize transactions throughout the rotation window. This requires:
- Parallel shard activation, where the new shard set is provisioned and verified before the old set is deactivated.
- A defined overlap period during which both shard sets are valid, long enough to confirm successful signing under the new configuration.
- Automated fallback logic that routes signing requests to the stable configuration if the new set returns an error during its validation phase.
Audit continuity means every transaction authorized during the transition is fully traceable. Regulators and compliance teams need to verify that no transaction was signed under an unverified or partially active shard configuration. This requires:
- Timestamped records of when the old shard set was deactivated and when the new set became the sole signing authority.
- Immutable logs tied to each transaction showing which shard configuration authorized it.
- Policy-layer controls that prevent any transaction from being signed outside of a validated configuration state [bitgo.com].
MPC implementations that embed policy enforcement at the protocol level, rather than layering it on top, are significantly better positioned to maintain this standard through a rotation [bitgo.com].
How Should Enterprises Govern Infrastructure Upgrades Without Disrupting Custody?
Stepping back from the technical detail of shard rotation, a separate concern is what happens when the underlying infrastructure itself changes: hardware replacement, cloud migration, protocol upgrades, or changes to the HSM layer.
Infrastructure upgrades carry a different risk profile from shard rotations. The shard structure may remain intact, but the environment executing the cryptographic operations changes. This can affect:
- Hardware trust anchors: If an HSM is replaced, the trust relationship between the HSM and the signing nodes must be re-established before the new hardware is used in production. A FIPS 140-compatible HSM combined with a trusted execution environment provides the hardware-level attestation needed to verify this re-establishment without manual inspection [cobo.com].
- Network topology changes: Upgrades that alter how signing nodes communicate can introduce latency or packet-loss conditions that cause signing timeouts. These should be stress-tested in a staging environment that mirrors the production transaction load.
- API versioning: When custody infrastructure is exposed through APIs, a version upgrade that changes signing request formats can silently break integrations downstream. Enterprises should maintain backward compatibility for at least one API version during any upgrade window.
The governance principle here is consistent: treat every infrastructure upgrade as a governed operational event with defined entry and exit criteria, not as a background maintenance task. This means:
- Document the current state configuration before any change begins.
- Define the acceptance criteria the new configuration must meet before it becomes the signing authority.
- Establish a rollback path that can be executed without losing transaction history.
- Obtain sign-off from compliance and security stakeholders, not just engineering, before go-live.
What Role Does the Policy Engine Play During Transitions?
Building on the governance framework above, the harder question is how automated controls behave when the custody layer is in a transitional state.
A policy engine that converts risk signals into automated transaction controls is a stability asset during any handoff [vaultody.com]. But only if it is designed to treat transitional states as a distinct operational mode, not an error condition.
During a shard rotation or infrastructure upgrade, the policy engine should:
- Suspend high-value transaction thresholds temporarily if the signing environment has not yet been fully validated, reducing the blast radius of any configuration error.
- Flag, not block, mid-range transactions so that compliance teams have visibility without creating a full payment freeze.
- Maintain AML screening continuity regardless of what is happening at the custody layer. Compliance obligations do not pause because infrastructure is changing [ripple.com].
- Log all policy decisions made during the transition window with a marker indicating the custody state at the time of each decision.
This separation of the policy layer from the signing layer is what allows enterprises to keep risk controls active and auditable even when the cryptographic infrastructure beneath them is changing.
Frequently Asked Questions
What is MPC key shard rotation? It is the process of replacing the cryptographic shares that collectively represent a private key. It is performed to reduce long-term key exposure, comply with security policies, or replace a signing party without moving the underlying assets.
Can transactions continue during a shard rotation? Yes, if the rotation is designed with a parallel activation window. Signing requests should be routed to the validated shard set at all times. Unvalidated configurations should never be given live signing authority.
How long does a typical shard rotation take? Duration depends on the number of signing parties, the complexity of the validation checks, and the volume of in-flight transactions. There is no universal figure; enterprises should benchmark their own environment in staging before executing in production.
Does a shard rotation affect compliance reporting? It should not, provided the audit trail records which shard configuration authorized each transaction. Compliance teams should receive a transition summary documenting the exact deactivation and activation timestamps.
What is the biggest mistake enterprises make during custody infrastructure upgrades? Treating the upgrade as purely a technical event and excluding compliance and risk stakeholders from the sign-off process. Governance gaps during transitions are among the most common sources of audit findings [ripple.com].
How does HSM replacement differ from a shard rotation? A shard rotation changes the cryptographic material. An HSM replacement changes the hardware environment executing the cryptographic operations. Both require validation before the new component is given signing authority, but the verification steps are different.
Is MPC custody suitable for enterprises that need frequent infrastructure changes? Yes. MPC is specifically designed to support key lifecycle management without asset movement [blockdaemon.com]. Frequent upgrades are manageable when the governance framework treats them as routine operational events rather than exceptional ones.
About Cregis
Cregis is an enterprise-grade infrastructure layer for digital asset custody, built on nine years of institutional deployment experience across 3,500+ institutions in 50+ countries. Its Trust Vault Security Framework integrates MPC with FIPS 140-compatible HSMs and trusted execution environments to deliver the first tier of security standard of the industry for institutions that cannot afford gaps in custody continuity. Cregis serves banks, payment service providers, OTC desks, and digital asset exchanges through deployment models tailored to institutional infrastructure requirements, with certifications including SOC 2 Type II, ISO 27001, and PCI DSS.
Speak with a Cregis infrastructure specialist to learn how the Trust Layer supports operational continuity through every custody transition. Visit Cregis.
About Cregis
Founded in 2017, Cregis is a global leader in enterprise-grade digital asset infrastructure, providing secure, scalable and efficient management solutions for institutional clients.
Built to solve the challenges of fragmented blockchain systems and asset security risks, Cregis delivers MPC-based self-custody wallets, WaaS solutions, and Payment Engine, featuring collaborative asset control and a compliance-ready ecosystem.
To date, Cregis has served over 4,000 institutional clients globally. Our solutions empower exchanges, fintech platforms, and Web3 enterprises to adopt blockchain technology with confidence. Backed by years of proven expertise in blockchain and security, Cregis helps businesses accelerate their Web3 transformation and unlock global digital asset opportunities.

