Reproducible codesigning on Apple Silicon
For people who expect reproducible builds, Apple Silicon machines
provide an interesting challenge. Apple Silicon requires arm64
binaries, including command line tools you build yourself, be
codesigned. This change is mostly transparent to developers, because
Apple updated their linker to automatically ad-hoc sign
binaries1. Unfortunately, if you're interested in producing binaries
that support both Intel Macs and Apple Silicon Macs, you likely want to
produce a fat binary
. When codesigning this binary you hit
some behavior that depends on your current machine's architecture.
Example
You can consistently produce the same result across multiple machines when compiling a binary without signing it. Here's an example with a simple C program:
$ echo "int main() { return 0; }" > main.c
$ clang main.c -Wl,-no_adhoc_codesign -arch arm64 -arch x86_64 -o main
$ shasum main
113033b3d9a247210b49a476bbfadb2e347846fe main
The shasum
of main
should always be the same regardless of your host
machine2. On Apple Silicon machines you can see this binary has the
same sha1
even if you run clang
under Rosetta 23:
$ arch -x86_64 clang main.c -Wl,-no_adhoc_codesign -arch arm64 -arch x86_64 -o main
$ shasum main
113033b3d9a247210b49a476bbfadb2e347846fe main
The issue is introduced when you codesign
the binary on Apple Silicon
machines versus Intel machines. You can immediately see the
difference3:
$ codesign --force --sign - main
$ shasum main
84631e812bd480c306766ba03a728dd2565dd672 main
% arch -x86_64 codesign --force --sign - main
% shasum main
f631b6c0daf3ffd0bb5f65d19fa045acf447a72d main
We get closer to identifying the problem when you compare the details of these differences:
$ codesign --force --sign - main
$ codesign -dvvv main > arm.txt 2>&1
$ arch -x86_64 codesign --force --sign - main
$ codesign -dvvv main > intel.txt 2>&1
$ diff -Nur intel.txt arm.txt
--- intel.txt 2021-10-05 21:26:32.731918710 -0700
+++ arm.txt 2021-10-05 21:26:29.473702845 -0700
@@ -1,14 +1,14 @@
Executable=/private/tmp/main
-Identifier=main-555549445413bc88d3b13dfa855a7f47cf229020
+Identifier=main-55554944e4472179d9ec3bff92ff3f8e6d013184
Format=Mach-O universal (x86_64 arm64)
CodeDirectory v=20400 size=358 flags=0x2(adhoc) hashes=5+2 location=embedded
Hash type=sha256 size=32
-CandidateCDHash sha256=78fab81d4fef72ae942e3272ba2c9085e1893828
-CandidateCDHashFull sha256=78fab81d4fef72ae942e3272ba2c9085e1893828a77095ae517ab7d3b8229ad0
+CandidateCDHash sha256=1a9fcc4eabde35d87326380f5eca6672a1c96f78
+CandidateCDHashFull sha256=1a9fcc4eabde35d87326380f5eca6672a1c96f78f252bbecec7f19ffdd56e420
Hash choices=sha256
-CMSDigest=78fab81d4fef72ae942e3272ba2c9085e1893828a77095ae517ab7d3b8229ad0
+CMSDigest=1a9fcc4eabde35d87326380f5eca6672a1c96f78f252bbecec7f19ffdd56e420
CMSDigestType=2
-CDHash=78fab81d4fef72ae942e3272ba2c9085e1893828
+CDHash=1a9fcc4eabde35d87326380f5eca6672a1c96f78
Signature=adhoc
Info.plist=not bound
TeamIdentifier=not set
There are a few fields that differ here, but, given that most of them
seem to be hashes, it felt natural to focus on the Identifier
field
(especially since that appears to contain our file name).
In the manual page for codesign(1)
, we see some useful
information related to identifier
:
-i, --identifier identifier
During signing, explicitly specify the unique identifier string that is
embedded in code signatures. If this option is omitted, the identifier
is derived from either the Info.plist (if present), or the filename of
the executable being signed, possibly modified by the --prefix option.
It is a very bad idea to sign different programs with the same
identifier.
This gives us further hints to what is going on. Specifically, since we
do not have an Info.plist
, the logic must not be deriving the
identifier in that way. Given this information we can assume
codesign
is falling back to some other value here, but it's still
not clear why it isn't reproducible across architectures.
Luckily, Apple's open source page has quite a few internal
libraries, which in this case covers our issue. Looking through the
Security
project's source code4, we can immediately see
some useful information for this field:
@constant kSecCodeSignerIdentifier If present, a CFString that explicitly specifies
the unique identifier string sealed into the code signature. If absent, the identifier
is derived implicitly from the code being signed.
This begs the question: How is this information derived if there is no
explicit identifier? Tracing this constant through the code, we can see
it sets the mIdentifier
field, which is otherwise only set through
this logic:
identifier = rep->recommendedIdentifier(*this);
if (identifier.find('.') == string::npos)
identifier = state.mIdentifierPrefix + identifier;
if (identifier.find('.') == string::npos && isAdhoc())
identifier = identifier + "-" + uniqueName();
The prefix referenced here is from the --prefix
field mentioned in the
codesign(1)
man page. Since we're also not passing that we can
ignore this logic and assume the uniqueName()
logic is key here (which
also explains the -
we see after our filename). As we trace this logic
through the codebase, we begin to see the core issue:
//
// Generate a unique string from our underlying DiskRep.
// We could get 90%+ of the uniquing benefit by just generating
// a random string here. Instead, we pick the (hex string encoding of)
// the source rep's unique identifier blob. For universal binaries,
// this is the canonical local architecture, which is a bit arbitrary.
// This provides us with a consistent unique string for all architectures
// of a fat binary, *and* (unlike a random string) is reproducible
// for identical inputs, even upon resigning.
//
std::string SecCodeSigner::Signer::uniqueName() const {
CFRef<CFDataRef> identification = rep->identification();
...
}
//
// We choose the binary identifier for a Mach-O binary as follows:
// - If the Mach-O headers have a UUID command, use the UUID.
// - Otherwise, use the SHA-1 hash of the (entire) load commands.
//
CFDataRef MachORep::identification() {
std::unique_ptr<MachO> macho(mainExecutableImage()->architecture());
return identificationFor(macho.get());
}
CFDataRef MachORep::identificationFor(MachO *macho) {
// if there is a LC_UUID load command, use the UUID contained therein
if (const load_command *cmd = macho->findCommand(LC_UUID)) {
const uuid_command *uuidc = reinterpret_cast<const uuid_command *>(cmd);
// uuidc->cmdsize should be sizeof(uuid_command), so if it is not,
// something is wrong. Fail out.
if (macho->flip(uuidc->cmdsize) != sizeof(uuid_command))
MacOSError::throwMe(errSecCSSignatureInvalid);
char result[4 + sizeof(uuidc->uuid)];
memcpy(result, "UUID", 4);
memcpy(result+4, uuidc->uuid, sizeof(uuidc->uuid));
return makeCFData(result, sizeof(result));
}
// otherwise, use the SHA-1 hash of the entire load command area (this is way, way obsolete)
...
}
The gist of this logic is to fetch the UUID embedded in every binary and use that to derive the identifier. The reason this isn't reproducible across architectures is because the UUID is based on the content of each binary, which differs across architectures. You can print these UUIDs with:
$ dwarfdump -u main
UUID: 5413BC88-D3B1-3DFA-855A-7F47CF229020 (x86_64) main
UUID: E4472179-D9EC-3BFF-92FF-3F8E6D013184 (arm64) main
The fallback logic we see above relies on the contents of binary's load commands which unfortunately also differs based on architecture.
Summary
While this was a very informative deep dive into this logic, if you rely
on reproducible binaries and want to support Apple Silicon machines, you
need to do 2 things for binaries without Info.plist
files:
- Don't allow the linker to automatically sign your binaries by passing
-no_adhoc_codesign
- Pass an explicit identifier when linking binaries with
--identifier
to thecodesign
invocation
I filed a radar about this behavior: FB9681559 (Make codesign for fat binaries reproducible across architectures).
Updates
- After working around this codesign issue it was noted that the same
issue with random binary names also affects the UUID computation. To
workaround this you can pass
-Wl,-no_uuid
but note this has a negative impact on LLDB attach times for binaries if you need to debug them. - I landed a fix in clang to make these temporary binary names reproducible. It will hopefully be part of a future version of Apple's clang fork.
-
This behavior can be disabled by passing
-no_adhoc_codesign
to link invocations. WithOTHER_LDFLAGS
in Xcode you need to pass-Wl,-no_adhoc_codesign
since link invocations go throughclang
, not directly told
↩ -
As long as you're using the same version of
clang
. This example was built with Xcode 13.0 13A233 ↩ -
Luckily if you run this on an Intel machine, instead of through
arch
's Intel emulation, you'll get the same results ↩ ↩2 -
I was looking at 59754.140.13 ↩