Skip to content

Updated aliases#2326

Open
bettio wants to merge 62 commits into
atomvm:release-0.7from
bettio:updated-aliases
Open

Updated aliases#2326
bettio wants to merge 62 commits into
atomvm:release-0.7from
bettio:updated-aliases

Conversation

@bettio

@bettio bettio commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

This revamps PR #2027

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

@bettio bettio changed the base branch from main to release-0.7 June 2, 2026 12:10
@petermm

This comment was marked as outdated.

@petermm

This comment was marked as outdated.

Comment thread src/libAtomVM/nifs.c Fixed
Comment thread tests/test-enif.c Fixed
@petermm

This comment was marked as outdated.

@bettio bettio force-pushed the updated-aliases branch from ce79b10 to f6af023 Compare June 7, 2026 19:53
@petermm

petermm commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Fresh review of everything, usual caveats, assume we need sapiens from here, but very close to ready, issues seems very minor and nitpicky:

PR Review — Process aliases (OTP ≥ 26 semantics)

Scope: last 27 commits (dceedea65..f6af02332), adding erlang:alias/0,
erlang:unalias/1, erlang:monitor/3 with {alias, Mode}, spawn_opt
{monitor, MonitorOpts}, sending to an alias reference, and a new
process reference boxed term shape.

Reviewed: full diff plus mailbox.c, context.c, term.c, term.h,
external_term.c, nifs.c, globalcontext.c, scheduler.c,
opcodesswitch.h, jit.c. Findings cross-checked against the actual tree and
against the pre-PR baseline (HEAD~27).

Overall this is a careful, well-commented implementation with good test
coverage. The reference-term representation, 32-bit padding, static asserts,
and the alias-via-signal design (validating the alias in the owner's own
context, never from the sender) are sound. The issues below are worth
addressing; only #1 and #2 affect correctness in normal use.

Severity summary

# Severity Area Status
1 Major spawn_opt silently ignores {monitor, BadTerm} New in PR
2 Major Alias message vs. earlier DOWN signal ordering New in PR
3 Minor ETF process-ref accepts owner pid 0 (INVALID_PROCESS_ID) New in PR
4 Minor term_compare word order for process refs differs from ETF/external New in PR
5 Minor Alias→normal conversion drops message on OOM (reply_demonitor non-transactional) New in PR
6 Nit context_add_monitor frees alias via embedded struct Monitor * New in PR
7 Pre-existing monitor/3 installs caller-side monitor after unlocking target (DOWN race) Pre-existing
8 Pre-existing mailbox_send_monitor_signal OOM is unchecked Pre-existing
Positive monitor/3 kind-mismatch now unlocks before raising Fixed in PR

1. Major — spawn_opt silently ignores a malformed {monitor, BadTerm}

src/libAtomVM/nifs.c#L1572-L1586

if (monitor_term == TRUE_ATOM) {
    monitor_term = term_nil();
}
if (term_is_list(monitor_term)) {
    is_spawn_monitor = true;
    ...
}

monitor_term is interop_proplist_get_value_default(..., MONITOR_ATOM, term_invalid_term()).
For spawn_opt(F, [{monitor, foo}]) or [{monitor, 123}], monitor_term
becomes foo/123 — not a list and not TRUE_ATOM — so it falls through and
the process is spawned with no monitor and no error. OTP raises badarg.
A typo therefore silently disables monitoring.

Suggested fix:

-    if (monitor_term == TRUE_ATOM) {
-        monitor_term = term_nil();
-    }
-    if (term_is_list(monitor_term)) {
-        is_spawn_monitor = true;
-
+    if (!term_is_invalid_term(monitor_term)) {
+        if (monitor_term == TRUE_ATOM) {
+            monitor_term = term_nil();
+        } else if (UNLIKELY(!term_is_list(monitor_term))) {
+            monitor_destroy(new_link);
+            monitor_destroy(self_link);
+            context_destroy(new_ctx);
+            RAISE_ERROR(BADARG_ATOM);
+        }
+        is_spawn_monitor = true;
+
         if (UNLIKELY(term_is_invalid_term(parse_monitor_opts(ctx, monitor_term, &is_alias, &alias_type)))) {

(new_link/self_link are NULL unless link was requested; monitor_destroy(NULL)
is a no-op, so the cleanup is safe.) A test like
{'EXIT', {badarg, _}} = (catch spawn_opt(fun() -> ok end, [{monitor, foo}]))
should accompany it.


2. Major — alias message can be delivered after an earlier DOWN that should have deactivated the alias

src/libAtomVM/mailbox.c#L399-L433,
interacting with PROCESS_SIGNAL_MESSAGES in
src/libAtomVM/opcodesswitch.h#L791
and src/libAtomVM/jit.c#L901.

process_outer_list resolves all AliasMessageSignals (converting/
dropping them, and running reply_demonitor side effects) while splitting the
outer list, but other signals (MonitorDownSignal, DemonitorSignal, …) are
only collected and returned for the signal loop to apply afterwards.

So within one drain the alias state is read before older monitor-down
signals are applied. For {alias, demonitor} / {alias, reply_demonitor} the
DOWN deactivates the alias
(context.c#L443-L470).
Trigger sequence in one batch, in received order:

  1. monitored process dies → MonitorDownSignal queued;
  2. a third party sends to the (now-doomed) alias → AliasMessageSignal queued.

OTP applies the DOWN first, deactivates the alias, then drops the alias
message. AtomVM converts and delivers the alias message because the DOWN has
not been processed yet. The PR's own comments stress that alias sends keep
order against plain sends; that holds, but the ordering guarantee does not
extend across the alias-vs-other-signal boundary.

This is a narrow interleaving (requires a DOWN and an alias send in the same
drain, for a non-explicit_unalias alias), so it is unlikely in practice but
is a genuine semantic gap. The clean fix is to handle AliasMessageSignal in
the same received-order stream as the other signals (in
PROCESS_SIGNAL_MESSAGES) rather than resolving aliases up-front in
process_outer_list. That is a non-trivial refactor; at minimum, document the
limitation and add a regression test for the interleaving.


3. Minor — ETF process-reference parse accepts owner pid 0

src/libAtomVM/external_term.c#L1011-L1022

uint32_t process_id = data[2];
// Reject a malformed process id (the value comes from untrusted input).
if (process_id > TERM_MAX_LOCAL_PROCESS_ID) {
    return term_invalid_term();
}
return term_make_process_reference(process_id, ticks, heap);

0 is INVALID_PROCESS_ID (term.h), the sentinel used everywhere to mean
"short ref / no owner". A deserialized process reference owned by pid 0 can
never name a real alias owner and blurs the short-ref vs process-ref
distinction. Reject it:

-                    if (process_id > TERM_MAX_LOCAL_PROCESS_ID) {
+                    if (process_id == INVALID_PROCESS_ID || process_id > TERM_MAX_LOCAL_PROCESS_ID) {
                         return term_invalid_term();
                     }

4. Minor — term_compare word order for process refs differs from ETF / external refs

src/libAtomVM/term.c#L707-L716
and src/libAtomVM/term.c#L793-L812

term_compare lays out a local process ref as
[process_id, ticks_hi, ticks_lo]:

} else if (len == 3) {
    data[0] = term_process_ref_to_process_id(t);
    int64_t t_ticks = term_to_ref_ticks(t);
    data[1] = t_ticks >> 32;
    data[2] = (uint32_t) t_ticks;
    ...

but serialization stores the words as [ticks_hi, ticks_lo, process_id]
(external_term.c#L547-L556),
and that same wire order is what term_get_external_reference_words() returns.

Local-vs-local comparison is internally self-consistent (the test_refs_ordering
expectations encode that order), so this is not a sorting bug for ordinary
local refs. The inconsistency only shows up when a local process ref is
compared against an external 3-word ref in the mixed branch around
term.c:793local_data is pid-first while the external words are
ticks-first, giving an order that disagrees with the local-only path. Given
AtomVM's limited distribution support this is low-impact, but aligning the two
layouts (use [ticks_hi, ticks_lo, process_id] in term_compare too) removes
a latent inconsistency and matches the existing len==2 / len==4 conventions
that key on ticks first.


5. Minor — alias→normal conversion drops the message on OOM (reply_demonitor becomes non-transactional)

src/libAtomVM/mailbox.c#L410-L433

term message = context_process_alias_message_signal(ctx, CONTAINER_OF(current, struct TermSignal, base));
if (!term_is_invalid_term(message)) {
    MailboxMessage *converted = mailbox_message_create_from_term(NormalMessage, message);
    // Best effort: on out of memory the message is dropped. ...
    if (converted != NULL) { ... }
}
mailbox_message_dispose(current, &ctx->heap);

The code already acknowledges this: for reply_demonitor the alias has been
deactivated and a DemonitorSignal already sent inside
context_process_alias_message_signal, so an allocation failure here loses the
message that those side effects were paying for. The new
mailbox_message_create_from_term allocation is avoidable: the incoming
struct TermSignal already owns a heap-copied term tree, and Message and
struct TermSignal share the { MailboxMessage base; HeapFragment; term }
prefix layout, so the signal can be re-typed as a NormalMessage in place
instead of copying. That both removes the OOM drop and saves an allocation/
free per alias send. (Larger change; flagging the trade-off rather than
prescribing a diff.)


6. Nit — context_add_monitor frees an alias monitor through the embedded pointer

src/libAtomVM/context.c#L1042-L1051

case CONTEXT_MONITOR_ALIAS: {
    ...
    if (UNLIKELY(... )) {
        free(new_monitor);   // <- frees via struct Monitor*
        return false;
    }

Every other branch frees the container (free(new_local_monitor), …). This
works only because struct Monitor is the first member of struct MonitorAlias.
Use the dedicated helper for consistency and robustness against future layout
changes:

-                        free(new_monitor);
+                        monitor_destroy(new_monitor);
                         return false;

monitor_destroy recovers the container with CONTAINER_OF, matching the
intent the PR introduced elsewhere.


7. Pre-existing — monitor/3 installs the caller-side monitor after unlocking the target (DOWN race)

src/libAtomVM/nifs.c#L5171-L5176

mailbox_send_monitor_signal(target, MonitorSignal, other_monitor);
globalcontext_get_process_unlock(ctx->global, target);

context_add_monitor(ctx, self_monitor);
if (is_alias) {
    context_add_monitor(ctx, alias_monitor);
}

Under SMP the target can receive the MonitorSignal, exit, and post its
MonitorDownSignal before the caller runs context_add_monitor(ctx, self_monitor).
The caller then processes a DOWN for which it has no monitor and drops it,
returning a ref that never yields a DOWN. This exact shape exists at
HEAD~27 (caller-side context_add_monitor was already after the unlock), so
it is not introduced by this PR — but the new alias path inherits it, and
fixing it would mean keeping the target lock held until the caller-side
monitor/alias entries are installed. Out of scope for a strict review of this
PR, but worth a follow-up.


8. Pre-existing — mailbox_send_monitor_signal OOM is silently ignored

src/libAtomVM/mailbox.c, called at
src/libAtomVM/nifs.c#L5171

mailbox_send_monitor_signal returns void; on malloc failure it logs and
returns, leaking other_monitor and leaving the caller believing the monitor
was installed. Pre-existing (unchanged by this PR), but the alias monitor/3
path now relies on it too. A follow-up could make it return bool and let the
caller clean up + raise OUT_OF_MEMORY.


Positive notes

  • Lock leak fixed: monitor/3 now calls globalcontext_get_process_unlock
    before RAISE_ERROR(BADARG_ATOM) on the process/port kind mismatch
    (nifs.c#L5122-L5124);
    the baseline leaked the target lock there.
  • spawn_opt atomicity: all fallible steps (option parsing, monitor/alias
    allocation, result-space reservation) now run before any monitor/link is
    published, with monitor_destroy on every error path — this correctly
    prevents the spurious {'EXIT', Pid, normal} the test
    test_spawn_opt_link_monitor_badarg_is_atomic guards against.
  • Reference shapes: TERM_BOXED_REFERENCE_SHORT/PROCESS/RESOURCE_SIZE are
    size-distinguishable on both 32- and 64-bit, the _Static_asserts enforce it,
    and the 32-bit resource padding word is initialized in term_from_resource
    so GC never copies uninitialized memory.
  • Alias-via-signal design: validating the alias in the owner's own context
    (never from the sender) is the right race-free model; context_destroy uses
    the plain mailbox_process_outer_list so it never takes the process-table
    lock from the alias path — keep that invariant.
  • scheduler.c queue reset: resetting processes_list_head after
    list_remove so context_destroy's dequeue is a no-op is a clean fix for the
    never-scheduled-process case.

Suggested follow-up tests

  • spawn_opt(F, [{monitor, foo}]) raises badarg (finding 1).
  • DOWN queued before an alias send in one drain → alias message dropped for
    {alias, demonitor} / {alias, reply_demonitor} (finding 2).
  • Round-trip a process ref through term_to_binary/binary_to_term and compare
    ordering against the equivalent local ref (finding 4).
  • Malformed ETF process ref with owner pid 0 is rejected (finding 3).

mat-hek and others added 23 commits June 10, 2026 10:33
Implement Erlang process aliases together with the reference-representation
rework they require, rebased onto release-0.7 and squashed into one commit.

Public API (libs/estdlib/src/erlang.erl):
- alias/0, unalias/1
- monitor/3 with {alias, explicit_unalias | demonitor | reply_demonitor}
- spawn_opt {monitor, [monitor_option()]}

Runtime:
- New process-reference term carrying ref_ticks + owner process id (RefData),
  alongside short/local references; BEAM-compliant reference ordering and
  binary_to_term/term_to_binary round-tripping (term.{c,h}, external_term.c).
- Monitors store RefData; alias monitors (CONTEXT_MONITOR_ALIAS) with the three
  alias modes and their lifecycle (context.{c,h}).
- send/2, the send opcode and the JIT send path resolve an alias to its target
  process and drop reply_demonitor aliases after first use (nifs.c,
  opcodesswitch.h, jit.c).
- Reference size macros: REF_SIZE -> TERM_BOXED_REFERENCE_SHORT_SIZE, plus
  TERM_BOXED_REFERENCE_PROCESS_SIZE and TERM_BOXED_REFERENCE_RESOURCE_SIZE.

Tests: alias coverage added to test_monitor, test_refs_ordering and
test_binary_to_term.
Sending to a process reference resolved and deactivated the alias from the
sending process: it walked and mutated the target's monitor list
(context_find_alias / context_unalias) while holding only the processes_table
read lock. That lock only guards against the target being freed, not against
the target mutating its own monitor list (via alias/0, unalias/1, demonitor or
process teardown), so a send racing with a monitor change could corrupt the
list or free a node used by another scheduler -- a data race / use-after-free
on SMP.

Route alias sends through the mailbox instead, the same way monitor and
demonitor operations are already delivered across processes. The send paths now
post an AliasMessageSignal carrying {Ref, Message} via the shared helper
globalcontext_send_message_to_alias; the owner validates the alias against its
own monitor list when it drains signals (context_process_alias_message_signal)
and either delivers the message or drops it for an inactive alias. All
monitor-list access now happens in the owning process, and the three send paths
(erlang:send/2, the send opcode and the JIT send) share one helper.

Signed-off-by: Davide Bettio <davide@uninstall.it>
A monitor created with {alias, reply_demonitor} previously only deactivated the
alias on the first reply through it; the underlying monitor stayed active, so
the owner could still receive a stale 'DOWN'. Match OTP: on the first message
received through the alias, also remove the monitor and signal the monitored
process to drop its side, as erlang:demonitor/1 would.

Add a regression test asserting that, after a reply through a reply_demonitor
alias, no 'DOWN' is delivered when the monitored process later exits.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Aliases require OTP 24, so the alias tests guarded their alias-specific cases
behind a runtime OTP-version check (plus a make_ref-based fallback for older
BEAM). AtomVM no longer supports OTP < 26, where aliases are always available,
so those checks are dead code: run the alias cases unconditionally and drop the
is_atomvm_or_otp_version_at_least/1 helper and the mock-alias fallback.

Signed-off-by: Davide Bettio <davide@uninstall.it>
For {alias, demonitor} and {alias, reply_demonitor}, the alias is deactivated
when the monitor is removed, including the automatic removal when a 'DOWN' is
delivered (as erlang:demonitor/1 already does for an explicit demonitor).
context_process_monitor_down_signal removed the monitor but left the alias
active, so a third party could still reach the owner through the alias after
the monitored process had exited.

Deactivate the alias (unless it is explicit_unalias) in both the local and
registered-name 'DOWN' paths, and add a regression test.

Signed-off-by: Davide Bettio <davide@uninstall.it>
nif_erlang_monitor locks the target process, then raises badarg when the object
type does not match the target (monitor(process, Port, ...) or vice versa). That
error path returned without releasing the processes-table lock, unlike the three
sibling out-of-memory error paths in the same function, leaking the lock and
deadlocking the next writer on SMP builds.

Unlock the target before raising. (A behavioural test would deadlock rather than
fail; verified by inspection against the sibling error paths.)

Signed-off-by: Davide Bettio <davide@uninstall.it>
Both nif_erlang_monitor and do_spawn published the monitor/alias state (queuing a
MonitorSignal to the target and adding monitors to the contexts) and only then
reserved heap for the returned reference. An out-of-memory at that last step
raised, leaving the monitor half-installed: the target could later send a 'DOWN'
for a reference the caller never received.

Reserve the result space before publishing, freeing the monitor structs on
failure. GC at the new point is safe -- the monitor structs hold only an
immediate pid and a plain RefData, none reachable from the heap. Unwinding the
separate spawn_opt link side effect on a later OOM is left out of scope.

Signed-off-by: Davide Bettio <davide@uninstall.it>
An alias send is delivered via an AliasMessageSignal so the owner validates the
alias in its own context (race-free). But mailbox_process_outer_list moved normal
messages straight to the inner mailbox while signals were processed afterwards and
re-posted, so `Alias ! m1, Pid ! m2` to the same receiver could deliver m2 before
m1 -- violating the single-sender message ordering guarantee.

Convert alias messages to normal messages in place during the outer-list reverse
(new mailbox_process_outer_list_with_aliases), prepending them into the normal
stream so they keep send order relative to plain messages from the same sender.
context_process_alias_message_signal now returns the payload instead of re-posting
it. Teardown and the port scheduler keep the plain mailbox_process_outer_list
(which returns alias signals to drop): they must not run the reply_demonitor path,
which takes the processes-table lock that context_destroy already holds.

Add a regression test (alias send then pid send keep order).

Signed-off-by: Davide Bettio <davide@uninstall.it>
term_compare sizes a process reference as 3 words (process_id + ref_ticks) in the
local-vs-local branch, but the mixed local-vs-external branch still sized it as 2
words, dropping process_id. A process reference was therefore ordered differently
depending on whether it was compared against a local or an external reference,
which can make the term order inconsistent. (The reviewer's "two process refs with
equal ref_ticks" case cannot actually occur: ref_ticks come from a global
monotonic counter, so they are unique.)

Size process references as 3 words in the mixed branch too and add the matching
extraction, mirroring the local-vs-local logic. Exercising this path requires an
external reference (distribution), so it is verified by inspection.

Signed-off-by: Davide Bettio <davide@uninstall.it>
binary_to_term of a NEWER_REFERENCE_EXT with len == 3 turned the third word
straight into a process reference id without bounding it. Note this was not a
memory-safety issue as the reviewer suggested: the id flows into
globalcontext_get_process_lock, which performs a linear search of the process
table (no out-of-bounds indexing), so an out-of-range value simply matches no
process. Still, validate it like a local pid (TERM_MAX_LOCAL_PROCESS_ID) to
reject clearly malformed input early.

Signed-off-by: Davide Bettio <davide@uninstall.it>
monitor(process, X, [{alias, ...}]) returned a plain short reference (no alias)
when X was self() or an already-dead process, so the returned reference could not
be used to send to the process or be passed to unalias/1. OTP returns a usable
alias in both cases.

Install the alias and return a process reference on both early-return paths. For
self the alias stays active (self never sends a DOWN). On the noproc path the
monitor is immediately removed by the noproc DOWN, so only an explicit_unalias
alias stays active (demonitor / reply_demonitor are deactivated right away, as at
a real DOWN). Add regression tests for both.

Signed-off-by: Davide Bettio <davide@uninstall.it>
A send to a reference that is not an active local process-reference alias (a
plain/short ref, a resource ref, or an external ref) is silently dropped, which
matches OTP for local references. Distributed aliases (external references) are
not supported, so a message addressed to a remote alias is lost rather than
routed over distribution. Add comments at the three send sites (erlang:send/2,
the send opcode and the JIT send); no behaviour change.

Signed-off-by: Davide Bettio <davide@uninstall.it>
On 32-bit, TERM_BOXED_REFERENCE_RESOURCE_SIZE is 5 (one more than header + refc +
mso cons) so reference shapes stay size-distinguishable, but term_from_resource
only wrote 4 words, leaving the trailing padding word uninitialized. It is not
read by comparison or serialization, but it is copied during GC, which is an
uninitialized-memory read (sanitizer / valgrind). Initialize it to nil. No effect
on 64-bit, where the resource size is exactly 4 words.

Signed-off-by: Davide Bettio <davide@uninstall.it>
…onitor

The CONTEXT_MONITOR_RESOURCE case fell through to the following monitor cases
when the ref_ticks did not match. It is benign today (the next cases only break),
but the alias commit added CONTEXT_MONITOR_ALIAS to the same switch, so tighten
the fall-through. No behaviour change.

Signed-off-by: Davide Bettio <davide@uninstall.it>
globalcontext_init_process appends every new context to the scheduler waiting queue, but
context_destroy never removed it. Destroying a process that was never scheduled -- as the
spawn NIFs do on option errors, or port drivers on initialization failures -- left a
dangling queue entry pointing into freed memory, corrupting the queue on the next
scheduler operation. Pre-existing upstream; exposed by the spawn_opt atomicity test added
in the next commit, the first to exercise such a path.

Dequeue in context_destroy under the processes spinlock, and reset the queue item at the
scheduler's own dequeue sites so the destroy-time dequeue is a no-op there. A concurrent
re-queue cannot happen: scheduler_make_ready refuses Killed and Spawning processes under
the same spinlock, and one of these flags is set on every destroy path.

Signed-off-by: Davide Bettio <davide@uninstall.it>
do_spawn published the link before the monitor phase, which can still fail parsing the
monitor options or on out-of-memory, including the result-space reservation. The error
paths destroyed the new context but left the caller's link entry installed, so the
destroyed context's teardown sent the caller a spurious {'EXIT', Pid, normal} -- delivered
as a message when trapping exits -- and process_info(links) reported a link to a dead pid,
for a spawn_opt call that raised. OTP's spawn_opt is atomic: either it returns and the
link exists, or it raises without side effects.

Run every fallible step -- option parsing, link and monitor allocations, and the result
space reservation -- before publishing anything; publishing itself cannot fail. This
closes the unwind residual left out of scope by the result-space commit. Add a regression
test checking that spawn_opt(F, [link, {monitor, [Bad]}]) raises badarg without leaving a
link behind or delivering an 'EXIT' message.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The BEAM conformance CI run (test-erlang -b) showed that on OTP,
monitor(process, self(), [{alias, _}]) is a no-op apart from returning a fresh reference:
no monitor is installed and no alias either -- sends to the returned reference are
dropped, and unalias/1 and demonitor(Ref, [info]) return false (verified on OTP 29).

This reverts the self half of "Return a real alias from monitor/3 on the self and noproc
paths", whose premise was wrong for self; the noproc half is conformant (its test passes
on the BEAM). Return a plain short reference and install nothing, matching the existing
monitor/2 self path, and update the test to assert the OTP behaviour.

Signed-off-by: Davide Bettio <davide@uninstall.it>
On the BEAM the test process is the erl_eval process, which is linked to init, so the
{links, []} assertion failed when running the tests with test-erlang -b. Snapshot the
links before the spawn_opt call and assert they are unchanged after; this still catches
a leaked link on AtomVM and is independent of the harness process's pre-existing links.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The monitor_*_new constructors return a pointer to the struct Monitor
embedded in their container struct, and the nifs.c error paths free()d
that pointer directly. That is only correct because the monitor happens
to be the first member of every container struct; recover the container
with CONTAINER_OF instead, through a new monitor_destroy that switches
on the monitor type, so a layout change cannot silently corrupt the
heap. No functional change.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The RefData arguments of the monitor constructors and term_from_ref_data
are only read, so take them as const pointers per the coding style.
Document monitor_alias_new, term_from_ref_data and
term_process_ref_to_process_id like their neighbours, and use the
until-now unused REFERENCE_PROCESS_PID_OFFSET for the process id word of
a process reference, which also drops a TERM_BYTES conditional. No
functional change.

Signed-off-by: Davide Bettio <davide@uninstall.it>
erlang:send/2 accepts process alias references since the aliases were
introduced, but send_destination() only allowed pids, ports and atoms.
Add reference() like OTP does, document the alias destination (and the
silent drop of non-alias references) on send/2, and mention the
{monitor, MonitorOpts} option and the {Pid, MonitorRef} result in the
spawn_opt docs.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Signed-off-by: Davide Bettio <davide@uninstall.it>
Nothing ever referenced it: reference flavors are distinguished by their
boxed size (see the comment above TERM_BOXED_REFERENCE_SHORT_SIZE), so
the enum was dead code. It also didn't follow the coding style, which
wants enumerations typedef'd with a lower_snake_case _t name.

Signed-off-by: Davide Bettio <davide@uninstall.it>
bettio added 19 commits June 10, 2026 10:35
The alias feature commit accidentally dropped the space in "// TODO" while
sweeping REF_SIZE through network_driver.c. Restore the original comment; the
line is unrelated to process aliases and clang-format flags the new form.

Signed-off-by: Davide Bettio <davide@uninstall.it>
unalias/1 validated its argument with term_is_local_reference, raising badarg
for a reference from another node. OTP 29 returns false instead (an external
reference can never be an alias of the calling process); only non-references
raise badarg. Sends to external references are already silently dropped, like
OTP when the node is not alive; the new test pins both behaviours, BEAM-validated.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The console driver rebuilt the io_reply reference from the request
reference ticks. That erases the flavor of a process reference: an alias
passed as ReplyAs came back as a short reference with the same ticks, which
the requester's receive does not match, so an io request through an alias
hung. It also crashed on a non-reference ReplyAs (the io protocol allows any
term) by reading ref ticks from an arbitrary term.

Put the incoming ReplyAs term in the reply instead: it lives in the request
message storage until the handler returns and port_send_message copies it.
This also drops a reference allocation from every console reply.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The select machinery (socket nowait handles, enif_select refs used by the
posix nifs) stores only the reference ticks and rebuilds the handle as a
plain reference in the select notification. An alias passed as a handle
would therefore come back as a reference the alias does not match, and the
caller would hang waiting for the notification.

On the BEAM an alias is a valid select handle and is echoed back verbatim.
Supporting that would mean carrying the owner process id through the select
machinery for a corner case with no practical use, so raise badarg instead
and keep the code simple; the divergence is called out in comments at both
validation sites.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The alias tests used a single 500 ms recv_one window both for positive
receives (the file convention elsewhere is 5000 ms) and for asserting that a
message does NOT arrive. Under valgrind or a loaded CI runner the positive
uses are the most likely false failure, and the two negative uses were
detection windows a reintroduced bug could outlast.

Raise recv_one to 5000 ms and assert the negatives behind deterministic
fences instead: a later monitor whose DOWN fires after the buggy one
(monitors fire in installation order), or a fence reply through the same
echo process (sends from one process keep their order). Bound the one
unbounded receive, and document that the same-batch DOWN test exercises the
same-batch path reliably only on SMP builds.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Quit the echo workers the alias tests spawn (the file convention; harmless
today only because each module runs in a fresh VM), rewrap a comment that
exceeded the 100-column print width, and restore the blank line between two
test functions in test_binary_to_term.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The helper shadowed the auto-imported spawn_monitor/2 BIF and made erlc emit
seven "ambiguous call of overridden auto-imported BIF" warnings on every
build of the test sources. Resolution picked the local function, so behavior
was correct, but the warnings were noise this branch introduced.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The demonitor mode also deactivates on DOWN delivery (not only on
demonitor/1), and reply_demonitor removes the monitor as well as the alias
when the first message sent via the alias is delivered (verified against
OTP 29). Also note that the OTP {tag, Term} option raises unsupported.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Two conformance cells the alias tests did not pin down yet: an alias used as
a map and ets key (ordinary reference term handling, value comparison), and
demonitor(Mon, [flush]) on an {alias, demonitor} monitor, where the alias
died with the monitor at DOWN delivery and flush removes the queued DOWN.
Both BEAM-validated.

Signed-off-by: Davide Bettio <davide@uninstall.it>
context_find_alias walks the whole monitor list and runs for every DOWN
delivery (twice: once in the outer-list split for same-batch ordering, once
in the down handler) and for every demonitor -- also for the vast majority
of processes that never create an alias. In a DOWN-heavy benchmark (50
batches of 1000 monitored process exits) this cost 25% over the pre-alias
baseline.

Track the number of active aliases in a 16-bit slice of the existing Context
bitfield word (no struct growth, owner-written only like the surrounding
bits) and short-circuit context_find_alias when it is zero. The benchmark
returns to the baseline noise envelope; processes that do use aliases keep
the previous behavior.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Pin two more behaviors verified against OTP 29: when {alias, _} appears twice
in the monitor options the last occurrence wins, and monitoring the caller
through its own registered name installs nothing, exactly like monitoring
self() directly.

Signed-off-by: Davide Bettio <davide@uninstall.it>
alias/0 becomes alias([]) with explicit_unalias as the default mode. The
reply option needs no new machinery: it maps onto the reply_demonitor alias
type, whose conversion path deactivates the alias when the first message
sent via it is delivered -- with no monitor to remove that is exactly the
OTP reply behavior (verified against OTP 29, including that a second
same-batch message is dropped). Bad options and a non-list argument raise
badarg before any side effect.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The JIT send sets x0 to the message in every delivering branch, but the
fall-through that silently drops a send to a non-alias reference left x0
holding the recipient, so Ref ! Msg returned the reference instead of Msg
on the JIT flavor (the interpreter assigns x0 after the dispatch and was
correct). Caught by the new non-local-refs send test, which only ran on the
interpreter flavor until now.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Reference kinds are distinguished by their boxed size, and on 32-bit the
natural process reference size (header + 2 ticks words + pid = 4) collides
with the resource reference. Resolving the collision by growing the resource
reference put the extra word on the wrong side: resource references are used
all over the embedded targets (sockets, files, NIF resources), while process
references only exist when aliases are used. Move the padding word to the
process reference, so 32-bit resource references return to 4 words and only
alias users pay the extra word. The padding word is initialized to nil so GC
copies a defined value. The size static asserts now check pairwise
distinctness, since the sizes are no longer monotonically ordered (32-bit:
short 3, resource 4, process 5; 64-bit: short 2, process 3, resource 4).

The wire format is unchanged (the padding word is not serialized), and all
allocation sites size through TERM_BOXED_REFERENCE_PROCESS_SIZE.

Signed-off-by: Davide Bettio <davide@uninstall.it>
…of RefData

Storing a RefData (ref ticks + owner pid) in every local monitor grew each
MonitorLocalMonitor, MonitorLocalRegisteredNameMonitor and MonitorAlias by 8
bytes (uint64_t alignment), paid by every plain monitor/2 -- i.e. every
gen_server:call -- on the embedded targets. The pid was redundant everywhere:
on the monitored side it duplicates monitor_obj (the monitoring process, which
is exactly the alias owner), on the monitoring, registered-name and alias
entries it was never read.

Keep a plain uint64_t ref_ticks in the monitor structs (their baseline layout)
and carry the only new bit of information -- "the 'DOWN' reference must be
rebuilt alias-shaped" -- as a monitor type, CONTEXT_MONITOR_MONITORED_LOCAL_ALIAS.
The 'DOWN' path derives the process reference pid from monitor_obj. monitor_new
now takes the monitor type directly, replacing the is_monitoring bool as well.

RefData stays as the NIF-local value type used to build result references
(term_from_ref_data) in monitor/3, spawn_opt and alias/1.

Behavior is unchanged and pinned by the existing tests (test_monitor_down_alias
asserts the 'DOWN' carries the very alias reference; process_info(monitored_by)
keeps listing alias-shaped monitors).

Signed-off-by: Davide Bettio <davide@uninstall.it>
The 16-bit active_alias_count was incremented and decremented unchecked: a
process accumulating 65536 simultaneously active aliases would wrap the
counter to 0, after which context_find_alias short-circuits and every alias
of that process silently stops matching -- messages dropped, unalias/1
returning false, reply_demonitor monitors no longer removed. Silent message
loss is the worst failure mode for aliases.

Make the counter saturate stickily: at 0xFFFF a new bit, alias_count_saturated,
is set and the count is pinned (decrements are skipped). Since the pinned count
is never 0, the context_find_alias fast path needs no change -- a saturated
process simply always walks its monitor list, degrading the optimization
instead of correctness. The bit packs into the existing bitfield word, so
sizeof(Context) is unchanged.

Not covered by a test: creating 65536 live aliases is infeasible in CI
(context_add_monitor's duplicate scan makes it quadratic); the three touched
sites (add, unalias, terminate) were reviewed for symmetry instead.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Aliases are serialized as 3-word NEWER_REFERENCE_EXT, so a BEAM peer can
legitimately hold an AtomVM alias and send to it, emitting DOP_ALIAS_SEND (33)
or DOP_ALIAS_SEND_TT (34). Those operations hit the default arm of
nif_erlang_dist_ctrl_put_data, which raises badarg in the distribution
connection process: a single remote alias send could take the whole connection
down instead of, at worst, losing one message.

Handle both operations by routing through the existing alias delivery path:
decoding our own alias yields a process reference (node and creation match),
so the owner pid is in the reference itself and the message is posted with
globalcontext_send_message_to_alias, where the owner validates the alias as
for a local send. A reference that is not a local process reference (e.g.
minted by a previous incarnation of this node) drops the message, exactly like
a local send to a non-alias reference. Unlike a pid, the reference is a boxed
term, so the control message is kept rooted across the payload decode.

The trace token of DOP_ALIAS_SEND_TT is ignored (sequential tracing is not
supported). The outbound direction is unchanged: sending to a remote alias
from AtomVM is not routed over distribution.

test_net_kernel gains test_alias_send_from_beam, which has a real OTP node
send through an AtomVM alias (delivery and reference round-trip), then send
again after unalias/1 (message dropped, connection survives). The test also
passes unchanged on BEAM, pinning the OTP semantics it asserts.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Behavior differences go to the differences-with-beam page: sending to a
remote alias is not routed over distribution (the inbound direction works),
and aliases are rejected as select handles where BEAM accepts them. The
unsupported erlang:monitor/3 {tag, Term} option is already stated in the
function documentation, and the builtin port-driver request/reply protocols
are AtomVM-specific, so neither belongs on the differences page.

The alias/1 documentation now states that the OTP 28 priority option is not
supported and raises badarg, and the CHANGELOG records the REF_SIZE C macro
deprecation with its replacements.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The received-order split added for process aliases reverses the LIFO outer
list and then walks it a second time, so that alias side effects run in
received order (deactivating the alias, and resolving several sends to the
same alias in one batch oldest-first). That second traversal only matters
when the owner actually holds an active alias, yet it ran for every signal
drain of every process -- including the alias-free processes that are the
common case on the embedded targets, and the ctx == NULL scheduler and
termination callers that can never own an alias.

Gate it on active_alias_count: when it is zero (or there is no owner
context), split the outer list into the normal and signal sublists in a
single prepend pass, exactly as before aliases existed. active_alias_count
is written only by the owner and this runs in the owner's own context, so it
is stable for the whole batch and the choice is race-free. An
AliasMessageSignal reached on this path necessarily targets an inactive
alias: drop it when we own a context (it must never reach the signal loop,
which treats it as unreachable), or leave it for the ctx == NULL caller to
ignore. The full received-order two-pass is kept unchanged for a process
that holds an active alias.

No behavior change: the fast path produces the same received-order sublists
the two-pass produced for an alias-free batch.

Signed-off-by: Davide Bettio <davide@uninstall.it>
@bettio bettio force-pushed the updated-aliases branch from 64e7fb6 to 73bb408 Compare June 11, 2026 13:43
The alias support reduced the console handler's heap reservation from a
conservative 12 terms to TUPLE_SIZE(3), matching the io_reply 3-tuple once
the verbatim ReplyAs echo removed the separate ref allocation. But the
gen_call paths build their error reply as an error 2-tuple nested inside the
reply 2-tuple (port_create_error_tuple + port_send_reply), i.e.
TUPLE_SIZE(2) + TUPLE_SIZE(2) = 6 terms, allocated on ctx's heap by helpers
that do not run their own memory_ensure_free. Reserving only 4 terms
under-allocates by 2: the writes are masked by heap slack today, but violate
the ensure_free contract and corrupt the heap whenever it is tight (e.g.
after a MEMORY_CAN_SHRINK shrink to the requested size).

Reserve 2 * TUPLE_SIZE(2), the largest reply the handler builds (more than
the io_reply 3-tuple and the close 2-tuple). Flagged by CodeQL's
allocation-exceeding-ensure_free query.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Comment thread src/libAtomVM/context.c

term context_process_alias_message_signal(Context *ctx, struct TermSignal *signal)
{
// signal->signal_term is a 2-tuple {Ref, Message}. The alias lookup runs here, in the owner's own

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want comments that verbose? The drawback is that eventually they end up being unsync with code, and LLM or humans tend to read comments instead of code.

Comment thread src/libAtomVM/context.c
// Target cannot be NULL as we processed Demonitor signals
assert(target != NULL);
int required_terms = REF_SIZE + TUPLE_SIZE(5);
int required_terms = TERM_BOXED_REFERENCE_PROCESS_SIZE + TUPLE_SIZE(5);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This size could be conditional of monitor->monitor_type == CONTEXT_MONITOR_MONITORED_LOCAL_ALIAS, couldn't it?

Comment thread src/libAtomVM/context.h
// lookups are no longer skipped for this process, trading the optimization for
// correctness instead of wrapping to 0 and silently deactivating every alias.
unsigned int active_alias_count : 16;
unsigned int alias_count_saturated : 1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this optimization is worth the 17 bits it costs. We could have only one bit and check if there is any alias left when removing an alias. Alternatively, we could reduce it to 8 bits and drop the optimization when the number of aliases reaches (forever) 255.

{
Context *p = globalcontext_get_process_lock(glb, process_id);
if (p) {
// Post the alias message as a signal carrying {Ref, Message}. The owner validates the alias

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment adds any value

*
* @details Posts an AliasMessageSignal carrying {Ref, Message} to the owner process. The owner
* validates the alias against its own monitors when it drains signals and either delivers Message
* as a normal message or drops it. This avoids touching the owner's monitor list from the sender.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last sentence sounds obvious here.

Comment thread src/libAtomVM/jit.c
set_error(ctx, jit_state, 0, BADARG_ATOM);
return false;
} else {
// A reference that is not a local process reference (short/resource/external ref): the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TODO would be welcome for distributed aliases.
Also I would put the term_is_reference first.

Comment thread src/libAtomVM/jit.c
}

int local_process_id;
if (term_is_local_pid_or_port(recipient_term)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test still make sense?

bettio added 7 commits June 12, 2026 19:10
The alias work renamed REF_SIZE to TERM_BOXED_REFERENCE_SHORT_SIZE and left
REF_SIZE as a deprecation shim that emits a GCC diagnostic on every use. The
libAtomVM tree was fully migrated, but the three USB CDC drivers were not:
ten uses remained, so every esp32, stm32 and rp2 build gained deprecation
warnings, turning into hard errors when AVM_WARNINGS_ARE_ERRORS is enabled
(the drivers inherit libAtomVM's compile options). The generic_unix builds
never compile these drivers, which is how the leftovers survived every
native test pass.

All ten uses size reply allocations whose reference is built with
term_from_ref_ticks, i.e. a plain short reference, so
TERM_BOXED_REFERENCE_SHORT_SIZE is the exact replacement. The shim expands
to that same macro, so the generated code is unchanged and the deprecation
is left with zero in-tree users.

Flagged by the pre-merge review, which confirmed with a standalone probe
that the pragma fires in exactly these expression contexts and becomes a
hard error under -Werror.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The differences-with-beam page documents core divergences from the BEAM,
and the process alias section was not one: its remaining substance is a
distribution limitation. Move it to a new Known Issues & Limitations
section of the distributed Erlang page, which now carries the only
user-facing note: an outbound send to a remote alias is dropped, while
inbound delivery from a BEAM peer works.

The page-closing invitation to file issues or pull requests moves below
the new section, so it keeps closing the page and now reads on both the
partial feature list and the known issues.

Dropped in the move: the select-handle rejection (an exotic corner that is
not worth user-facing documentation -- it remains commented at both
validation sites and covered by tests) and the OTP >= 26 semantics claim.
The differences-with-beam page is back to its release-0.7 content.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Five tests closing the review's coverage gaps (AR-10):

- test_alias_send_after_owner_died: a send through the alias of a dead
  process is dropped, still returns the message, and surfaces nowhere; the
  spawn churn probes would report a misdelivery. AtomVM assigns process ids
  monotonically so the dead owner's id is not reused, and globally-unique
  ref ticks are the invariant that keeps a stale alias unmatchable even by
  an id reuser.

- test_alias_multi_sender_unalias: four senders hammer one alias while the
  owner unaliases mid-stream. All synchronization is per-sender send order
  and explicit acks -- no sleeps and no timing windows -- so the test cannot
  flake on slow or loaded hosts. Phase 1 pins the exact delivered count
  behind per-sender fences while the alias is active; phase 2 races
  unalias/1 against the senders and pins no crash, a bounded count and an
  observably dead alias afterwards.

- test_alias_duplicate_options: alias/1 duplicate options are last-wins in
  both orders, like OTP 29 (the monitor/3 duplicate was already covered).

- test_unalias_non_reference_badarg: unalias(42) raises badarg, like OTP.

- test_binary_to_term_invalid_process_ref: a wire-format alias with its
  owner pid word patched to 0 or above the 28-bit maximum must fail to
  decode -- the untrusted-input validation this branch added to the decoder,
  previously untested. The BEAM does not read reference words as a pid and
  decodes plain references there; the test asserts that instead.

All five pass on both flavors and on the BEAM (OTP 29) via -b; the module
was looped 30x on each flavor (0 failures) and runs under valgrind with 0
errors.

Signed-off-by: Davide Bettio <davide@uninstall.it>
Process ids are int32_t throughout (Context.process_id, RefData.process_id,
and every assignment of this accessor's result), but
term_process_ref_to_process_id returned uint32_t. Return int32_t, making the
six int32_t call sites exact; the remaining sites store the always-positive,
28-bit-bounded value into uint32_t wire words, where the conversion is
well-defined and unchanged.

Also from the review's style pass: drop the stray blank line in the
REFERENCE_PROCESS_PID_OFFSET conditional (the neighboring blocks are
compact), fix "an unique" in the term_make_process_reference doc, and state
the load-bearing layout invariant there: the ticks occupy the same words as
in a short reference, so term_to_ref_ticks works on both shapes.

The remaining nits from the same review pass were deliberately not taken:
explicit casts in term_make_process_reference would decorate conversions
that are well-defined and always in range; initializing the
parse_monitor_opts alias_type out-param adds a line for a latent-only risk;
the context_dump %lu ticks truncation matches the surrounding pre-existing
lines and is best fixed together with them.

Signed-off-by: Davide Bettio <davide@uninstall.it>
A message sent to an inactive alias is rejected at the owner's mailbox
drain, but both drop sites disposed it with mailbox_message_dispose, which
appends the signal storage to the receiver's heap as a fragment: the dead
{Ref, Message} tree -- including any refc binaries it pins -- stayed on the
victim's heap until its next GC. OTP drops such messages with no receiver
heap impact, so a flood of sends to a deactivated alias produced
AtomVM-only GC pressure, with the receiver paying for messages it never
accepted.

The deferred reclamation is only needed for delivered messages, whose terms
may still be referenced after a receive. A dropped alias message was never
delivered, so nothing can reference its term: free it immediately with
mailbox_message_dispose_unsent (which sweeps the mso_list, releasing refc
binaries), exactly the treatment unsent messages already get. TermSignal
and Message share the layout already asserted for the in-place conversion,
and both drop sites run in the owner's own context (from_task = false).

The reply-mode and duplicate-option tests now also push a >64-byte binary
through each drop path, so the refc sweep is exercised under valgrind and
ASAN.

Flagged by the pre-merge review (round 2, issue 4).

Signed-off-by: Davide Bettio <davide@uninstall.it>
The reply_demonitor branch of context_process_alias_message_signal takes
the processes_table read lock from inside the mailbox drain, which is why
callers of the with-aliases drain must not hold that lock and why
context_destroy stays on the alias-blind variant. The constraint was
documented at the mailbox.h API; nothing at the acquisition site itself
said so, and the pre-merge review (round 2, issue 5) asked for a guard
against a future caller draining with the lock held.

A guard already exists, stronger than the suggested debug assertion: on
generic_unix smp_rwlock_rdlock aborts on any pthread error, and glibc
returns EDEADLK when the calling thread already owns the write lock
(probe-verified), so the misuse dies fast at the exact lock call in every
build type instead of deadlocking silently. A literal current-thread
ownership assertion would need thread-local tracking maintained at all 22
processes_table lock sites across the four platform smp implementations --
out of proportion for a fragility note.

State the invariant and its enforcement in a comment at the acquisition
site, so the coupling is visible where a future reader would break it.

Signed-off-by: Davide Bettio <davide@uninstall.it>
The owner in test_monitor_alias_down_before_send_same_batch busy-waits on
whereis by design: receiving would drain its mailbox and defeat the
same-batch construction. Under valgrind all threads are serialized with an
unfair scheduler, so the spinning owner can starve the relay; the
5M-iteration bound expired once in a full-suite valgrind run (the test
passes standalone, on the other flavor, and on rerun, with zero memory
errors every time). Raise the bound to 50M -- ten times the scheduling
opportunities for the relay -- and say why it is large. The fast path is
unchanged: the loop exits as soon as the relay registers.

Signed-off-by: Davide Bettio <davide@uninstall.it>
@petermm

petermm commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

AMP:

PR Review — Process aliases for AtomVM

Range reviewed: f90e0ee9a..HEAD (62 commits), branched from
9402220d7 (Merge pull request #2331 … valgrind-spinlock-sched).

Diffstat: ~40 files, +1975 / −173. The series adds OTP process aliases:
erlang:alias/0,1, erlang:unalias/1, erlang:monitor/3 with the
{alias, Mode} option, spawn_opt {monitor, MonitorOpts}, sending to an
alias reference (local + inbound distribution), a new boxed process
reference
term, and per-process saturating alias counters.

Verdict

Strong, ship-quality work. The design is sound: aliases are validated in
the owner's own context against the owner's own monitor list (the same
single-owner discipline as alias/0/unalias/1/demonitor), so there is no
cross-process race on alias state. Error paths in do_spawn/monitor are
carefully "allocate-everything-then-publish", the 32-bit reference padding and
size-distinctness are guarded by _Static_assert, and the commit history shows
the tricky cases (same-batch ordering, saturation, wire ordering in
term_compare, REF_SIZE deprecation) were each addressed deliberately. Test
coverage is substantial (tests/erlang_tests/test_monitor.erl +672).

The findings below are mostly pre-existing OOM-robustness gaps that this
feature widens slightly, plus one portability note. None block merge; I'd fix
#1 in this series since the alias code adds to the half-published surface.


Findings

1. (Medium) OOM in mailbox_send_monitor_signal leaves a half-published monitor/alias

nifs.c:5187,
helper at mailbox.c:338.

mailbox_send_monitor_signal() returns void and silently drops the signal on
malloc failure (the FIXME at mailbox.c:340 acknowledges this). In
nif_erlang_monitor the call is the publish step for the target's half of the
monitor. If it fails:

  • other_monitor is leaked,
  • self_monitor and (new in this PR) alias_monitor are still added right
    after at nifs.c:5190-5192,
  • the caller still receives a reference,
  • the target never gets the monitor.

So the monitoring process believes it monitors / has an alias to a target that
knows nothing about it. The alias addition makes the inconsistency worse (a
live alias with no monitored half).

Suggested fix — make the helper report failure and clean up at the call site:

--- a/src/libAtomVM/mailbox.h
+++ b/src/libAtomVM/mailbox.h
-void mailbox_send_monitor_signal(Context *c, enum MessageType type, struct Monitor *monitor);
+bool mailbox_send_monitor_signal(Context *c, enum MessageType type, struct Monitor *monitor);
--- a/src/libAtomVM/mailbox.c
+++ b/src/libAtomVM/mailbox.c
-void mailbox_send_monitor_signal(Context *c, enum MessageType type, struct Monitor *monitor)
+bool mailbox_send_monitor_signal(Context *c, enum MessageType type, struct Monitor *monitor)
 {
     struct MonitorPointerSignal *monitor_signal = malloc(sizeof(struct MonitorPointerSignal));
     if (IS_NULL_PTR(monitor_signal)) {
-        // FIXME (pre-existing): this function returns void, so an out-of-memory here is silently
-        // dropped -- the monitor is leaked and the caller believes it was installed. It should return
-        // bool so the caller can free the monitor and raise out_of_memory.
         fprintf(stderr, "Failed to allocate memory: %s:%i.\n", __FILE__, __LINE__);
-        return;
+        return false;
     }
     ...
     mailbox_post_message(c, &monitor_signal->base);
+    return true;
 }
--- a/src/libAtomVM/nifs.c
+++ b/src/libAtomVM/nifs.c
-    mailbox_send_monitor_signal(target, MonitorSignal, other_monitor);
-    globalcontext_get_process_unlock(ctx->global, target);
+    if (UNLIKELY(!mailbox_send_monitor_signal(target, MonitorSignal, other_monitor))) {
+        monitor_destroy(alias_monitor);
+        monitor_destroy(self_monitor);
+        monitor_destroy(other_monitor);
+        globalcontext_get_process_unlock(ctx->global, target);
+        RAISE_ERROR(OUT_OF_MEMORY_ATOM);
+    }
+    globalcontext_get_process_unlock(ctx->global, target);

The other callers (nifs.c:5275 link, dist_nifs.c, resources.c) then need
to either check the return or explicitly ignore it with cleanup. If a full
audit is out of scope for this PR, at minimum guard the new alias-bearing path
above so the alias is not left dangling.

2. (Low) Same OOM class on mailbox_send / mailbox_send_ref_signal

  • noproc monitor branch: nifs.c:5133 installs the
    explicit_unalias alias before mailbox_send(ctx, down_message_tuple); if
    that send OOMs the DOWN is lost but the alias remains (tolerable, since an
    explicit-unalias alias lives until unalias/1).
  • reply_demonitor: context.c:527
    mailbox_send_ref_signal(target, DemonitorSignal, …) can drop the remote
    demonitor on OOM, leaving the monitored side installed until target exit.

Same root cause as #1 (void mailbox-send helpers). Worth a tracking note rather
than a blocker.

3. (Low, pre-existing) Registered-name DOWN writes a temp-heap tuple into the retained signal

context.c:470-476:

BEGIN_WITH_STACK_HEAP(TUPLE_SIZE(2), temp_heap)
term name_tuple = term_alloc_tuple(2, &temp_heap);
...
term_put_tuple_element(signal->signal_term, 3, name_tuple);  // temp-heap term into the signal
mailbox_send(ctx, signal->signal_term);
END_WITH_STACK_HEAP(temp_heap, ctx->global);

mailbox_send copies the term out before the temp heap is torn down, so the
delivered message is fine. But signal->signal_term now holds a pointer into
freed stack-heap storage, and the original signal fragment is later appended to
ctx->heap by mailbox_message_dispose, where a future GC could scan the
dangling slot.

This code is unchanged by this PR (it predates f90e0ee9a; the series only
renamed ref_data.ref_ticksref_ticks here), so it is out of scope — noted
for a follow-up. Fix is to build a fresh 5-tuple on the temp heap and send that
instead of mutating signal->signal_term.

4. (Low / portability) In-place re-type of TermSignalMessage

mailbox.c:458-459
(and the CONTAINER_OF(current, Message, base) disposal casts at lines 403,
472) re-type an AliasMessageSignal in place. The _Static_asserts above the
function prove identical layout, so this works on every current toolchain, but
identical layout does not by itself satisfy C effective-type / strict-aliasing
rules. Low risk; the clean long-term fix is to unify Message and TermSignal
into one heap-backed mailbox term with a union { term message; term signal_term; }. Not needed for merge.


Things verified as correct (not bugs)

  • Unlocked active_alias_count read in process_outer_list
    (mailbox.c:388):
    owner-written only, runs in owner context, stable for the batch. Fast path
    (no alias) preserves the original single-pass reversal; alias path reverses
    to received order first. Sound.
  • Same-batch ordering: pre-deactivating an alias on a MonitorDownSignal
    seen earlier in the batch (mailbox.c:483)
    so a later same-batch alias send is dropped; idempotent with the later
    context_process_monitor_down_signal. Matches OTP.
  • Saturating counter (context.h): pins at 0xFFFF and stops decrementing,
    so context_find_alias never wrongly short-circuits — correctness over the
    optimization. Sound.
  • Process-reference sizing (term.h): 64-bit = SHORT+1, 32-bit = SHORT+2
    with the trailing pad word initialized to nil; all reference kinds stay
    size-distinguishable, guarded by _Static_assert. Sound.
  • reply_demonitor lock discipline (context.c): monitor_pid captured as
    an immediate before context_demonitor frees the alias; target taken under
    the process-table lock; context_destroy deliberately uses the alias-blind
    drain to avoid taking that lock while holding the write lock. No UAF/deadlock.
  • Inbound distributed alias send (dist_nifs.c, external_term.c): control
    tuple kept rooted across payload-decode GC; decode rejects process_id == 0
    (the INVALID_PROCESS_ID short-ref sentinel) and > TERM_MAX_LOCAL_PROCESS_ID;
    delivery only for node+creation match.
  • term_compare wire ordering: local process refs laid out as
    [ticks_hi, ticks_lo, process_id] to agree with the serialized form and the
    external-ref path; the mixed local/external len==3 branch never has both
    operands local (guarded by the outer is_external check), so the shared
    local_data scratch buffer is not a hazard.
  • do_spawn: all parsing/allocation/result reservation before any publish,
    monitor_destroy on every error path, monitor_destroy(NULL) no-op. No leak
    or half-published state found.
  • REF_SIZE deprecation: no remaining REF_SIZE uses in the tree, so the
    _Pragma("GCC warning …") will not fire (safe even under -Werror).
    CHANGELOG documents both the feature and the deprecation.

Suggested follow-up tests

  • Same-batch: alias-send-before-DOWN (delivered) vs DOWN-before-alias-send
    (dropped); multiple sends to one reply/reply_demonitor alias in a batch
    (first delivered, rest dropped).
  • OOM fault injection on mailbox_send_monitor_signal at nifs.c:5187 (if a
    malloc hook is available) to lock in the bifs_hash.h not found #1 fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants