Skip to content

feat: internal-only network access#6

Draft
think-in-universe wants to merge 1 commit into
mainfrom
feat/restrict-external-requests
Draft

feat: internal-only network access#6
think-in-universe wants to merge 1 commit into
mainfrom
feat/restrict-external-requests

Conversation

@think-in-universe

@think-in-universe think-in-universe commented Jan 22, 2026

Copy link
Copy Markdown

Background

As we're building the RAG service which will rely on a large number of libraries, it's more likely to suffer from supply chain attack. We'd like to restrict the RAG service to send requests to external network and can only send internal requests inside the VPC.

Implementation: Add Internal-Only Network Access Support via Headscale ACL Policies

Overview

This PR implements support for restricting VPC nodes to internal-only network access, preventing them from making external internet requests while still allowing communication within the VPC. This is achieved through Headscale ACL (Access Control List) policies that control routing at the VPN level.

Problem

Previously, all VPC nodes could potentially make external internet requests if the underlying host/container had internet access. There was no mechanism to restrict certain nodes to only communicate within the VPC subnet.

Solution

Implemented Headscale ACL policies that tag nodes as either internal-only or external-allowed and enforce routing restrictions at the VPN level:

  • Internal-only nodes: Can only communicate within the VPC subnet (100.128.0.0/10)
  • External-allowed nodes: Can access both VPC and external networks (default behavior)

Changes Made

1. Headscale Configuration (configs/headscale_config.yaml)

  • Added ACL policies to restrict routing based on node tags
  • tag:internal-only nodes are restricted to VPC subnet only
  • tag:external-allowed nodes can access both VPC and external networks
  • Maintains backwards compatibility with default rule for all VPC nodes

2. VPC API Server (vpc-api-server/main.go)

  • Added ACLTags field to PreAuthKeyRequest struct
  • Modified generatePreAuthKey() to accept internalOnly parameter and set appropriate tags
  • Updated /api/register endpoint to accept optional internal_only query parameter
  • Nodes are automatically tagged based on the internal_only flag

3. Node Setup Script (scripts/vpc-node-setup.sh)

  • Added support for VPC_INTERNAL_ONLY environment variable
  • Passes internal_only=true parameter to registration API when enabled
  • Logs internal-only status for debugging

4. Compose Generation (scripts/generate-compose.sh)

  • Added VPC_INTERNAL_ONLY environment variable to vpc-node-setup container
  • Defaults to false if not set (maintains backwards compatibility)

5. Documentation (README.md)

  • Added comprehensive "Internal-Only Network Access" section
  • Updated API reference with internal_only parameter documentation
  • Added configuration examples and use cases
  • Included testing procedures

Usage

Enable Internal-Only Access for a Node

Set the VPC_INTERNAL_ONLY environment variable to true when deploying a node:

docker run -d \
  -e VPC_NODE_NAME=internal-node \
  -e VPC_SERVER_APP_ID=server-app-id \
  -e VPC_INTERNAL_ONLY=true \
  -e MESH_BACKEND=127.0.0.1:27017 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --privileged \
  nearaidev/dstack-service

API Usage

The registration endpoint now accepts an optional internal_only parameter:

GET /api/register?instance_id={uuid}&node_name={name}&internal_only=true
Headers: x-dstack-app-id: {app_id}

Benefits

  1. Security: Prevents sensitive nodes from making external requests
  2. Compliance: Supports air-gapped or isolated workload requirements
  3. Policy-Based: Centralized ACL management in Headscale config
  4. Backwards Compatible: Existing nodes continue to work without changes
  5. No Privileged Containers Required: Works at VPN routing level, not container level

Use Cases

  • Database nodes that should only accept connections from application nodes
  • Sensitive services that must not leak data externally
  • Compliance requirements for isolated workloads
  • Cost reduction by preventing unnecessary external API calls

Testing

Verify Internal-Only Access Works

# On an internal-only node, these should work:
curl http://other-node.dstack.internal:27017

# These should fail (blocked by ACL):
curl https://google.com
curl https://8.8.8.8

Verify External-Allowed Nodes Still Work

# On a normal node (VPC_INTERNAL_ONLY not set or false):
curl http://other-node.dstack.internal:27017  # Should work
curl https://google.com  # Should work (if host has internet)

Implementation Notes

  • Uses Headscale's native ACL policy system (recommended approach)
  • Tags are applied during pre-auth key creation
  • ACL policies are evaluated at the VPN routing level
  • No changes required to existing nodes (backwards compatible)

Potential Future Enhancements

  • Add iptables rules as a secondary layer of defense (defense in depth)
  • Support for per-node ACL customization
  • Monitoring/logging of blocked external connection attempts

Breaking Changes

None. This is a backwards-compatible feature addition. Existing nodes without VPC_INTERNAL_ONLY set will continue to work as before.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements internal-only network access control for VPC nodes, allowing certain nodes to be restricted to internal VPC communication while blocking external internet access through Headscale ACL policies.

Changes:

  • Added ACL policies in Headscale configuration to enforce network access restrictions based on node tags
  • Modified VPC API server to accept internal_only parameter and tag pre-auth keys accordingly
  • Updated node setup and compose generation scripts to pass internal-only flag through the deployment pipeline

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
configs/headscale_config.yaml Adds ACL policies with tag-based routing restrictions for internal-only and external-allowed nodes
vpc-api-server/main.go Adds ACLTags field to pre-auth key requests and internal_only parameter handling in registration endpoint
scripts/vpc-node-setup.sh Adds VPC_INTERNAL_ONLY environment variable support and passes it to registration API
scripts/generate-compose.sh Configures VPC_INTERNAL_ONLY environment variable in vpc-node-setup container
README.md Documents the internal-only network access feature with configuration examples and testing procedures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vpc-api-server/main.go
Comment on lines +425 to +428
// Parse internal_only parameter (defaults to false)
internalOnly := false
if internalOnlyStr == "true" || internalOnlyStr == "1" {
internalOnly = true

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsing of the internal_only parameter doesn't handle invalid values explicitly. If a user passes internal_only=false or internal_only=0 or any other string, it will be treated as false (the default). While this may be acceptable for a boolean flag, it could be confusing if users expect explicit rejection of invalid values. Consider adding validation to only accept "true", "false", "1", or "0", and return an error for other values to provide clearer feedback.

Suggested change
// Parse internal_only parameter (defaults to false)
internalOnly := false
if internalOnlyStr == "true" || internalOnlyStr == "1" {
internalOnly = true
// Parse internal_only parameter (defaults to false) with explicit validation
internalOnly := false
switch internalOnlyStr {
case "", "false", "0":
// keep default false
case "true", "1":
internalOnly = true
default:
c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid value for internal_only; must be one of: true, false, 1, 0"})
return

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +72
acls:
# Allow internal-only nodes to only reach VPC subnet
- action: accept
src:
- "tag:internal-only"
dst:
- "100.128.0.0/10:*"

# Allow external-allowed nodes to reach VPC subnet
- action: accept
src:
- "tag:external-allowed"
dst:
- "100.128.0.0/10:*"

# Allow external-allowed nodes to reach external networks (via exit nodes if configured)
# Note: This allows routing to external IPs, but actual external access depends on exit node configuration
- action: accept
src:
- "tag:external-allowed"
dst:
- "*:*"

# Default: Allow all nodes to communicate within VPC (for backwards compatibility)
- action: accept
src:
- "100.128.0.0/10"
dst:
- "100.128.0.0/10:*"

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACL configuration is missing a tagOwners section which is required by Headscale to define who can assign tags to nodes. Without this, the ACL tags (tag:internal-only and tag:external-allowed) referenced in the policies may not work as expected. You should add a tagOwners section before the acls section that defines which users can assign these tags. For example:

tagOwners:
  tag:internal-only:
    - default
  tag:external-allowed:
    - default

This defines that the "default" user (which is used in the API calls) has permission to assign these tags.

Copilot uses AI. Check for mistakes.
Comment thread vpc-api-server/main.go
Comment on lines +252 to +253
// Note: Headscale API format may vary. If aclTags field doesn't work,
// we may need to set tags on the node after it joins instead.

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 252-253 indicates uncertainty about whether the aclTags field will work with the Headscale API. This suggests the implementation may not have been tested. If the field doesn't work as expected, the entire internal-only network access feature will fail silently - nodes will be created without the appropriate tags and the ACL restrictions won't be applied. Consider removing this comment if the implementation has been tested and verified to work, or implementing a fallback mechanism to set tags after node creation if this approach is uncertain.

Suggested change
// Note: Headscale API format may vary. If aclTags field doesn't work,
// we may need to set tags on the node after it joins instead.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +72

# Default: Allow all nodes to communicate within VPC (for backwards compatibility)
- action: accept
src:
- "100.128.0.0/10"
dst:
- "100.128.0.0/10:*"

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACL rule on lines 67-72 allows all nodes in the VPC subnet (100.128.0.0/10) to communicate with each other by default, which could potentially bypass the internal-only restrictions. If a node doesn't get tagged properly (due to API issues or other failures), it will fall back to this rule and have full VPC access without any external restrictions being enforced. Consider whether this default rule should be more restrictive or if there should be explicit denial rules for untagged nodes to fail closed rather than fail open.

Suggested change
# Default: Allow all nodes to communicate within VPC (for backwards compatibility)
- action: accept
src:
- "100.128.0.0/10"
dst:
- "100.128.0.0/10:*"

Copilot uses AI. Check for mistakes.
Comment thread vpc-api-server/main.go
Comment on lines +259 to +260
aclTags = []string{"tag:external-allowed"}
log.Printf("Creating pre-auth key with external-allowed tag")

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When internalOnly is false, nodes are explicitly tagged with tag:external-allowed. However, this means that all existing nodes that were created before this feature was implemented will not have either tag, potentially causing them to rely on the default ACL rule only. Consider whether existing nodes should be retroactively tagged, or if the default behavior should explicitly tag nodes as external-allowed even when the parameter is not provided to maintain consistent behavior.

Suggested change
aclTags = []string{"tag:external-allowed"}
log.Printf("Creating pre-auth key with external-allowed tag")
// For non-internal nodes, rely on the default ACL behavior instead of
// explicitly tagging them as external-allowed. This keeps behavior
// consistent with nodes created before ACL tagging was introduced.
log.Printf("Creating pre-auth key with default (non-internal) ACL behavior")

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +410 to +416

# External-allowed nodes can reach both VPC and external
- action: accept
src:
- "tag:external-allowed"
dst:
- "*:*"

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACL policy documentation in the README is incomplete and doesn't match the actual implementation in configs/headscale_config.yaml. The documentation only shows two rules (internal-only and external-allowed), but the actual config has four rules including additional rules for external-allowed nodes to reach VPC and a default rule for all VPC nodes. The documentation should accurately reflect all ACL rules to help users understand the complete access control behavior.

Suggested change
# External-allowed nodes can reach both VPC and external
- action: accept
src:
- "tag:external-allowed"
dst:
- "*:*"
# External-allowed nodes can reach VPC (for talking to services inside the VPC)
- action: accept
src:
- "tag:external-allowed"
dst:
- "100.128.0.0/10:*"
# External-allowed nodes can also reach the public internet and other external services
- action: accept
src:
- "tag:external-allowed"
dst:
- "*:*"
# Default: all nodes inside the VPC CIDR can talk to each other
- action: accept
src:
- "100.128.0.0/10:*"
dst:
- "100.128.0.0/10:*"

Copilot uses AI. Check for mistakes.
Comment thread README.md

# These should fail (blocked by ACL):
curl https://google.com
curl https://8.8.8.8

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing example uses curl https://8.8.8.8 which is attempting HTTPS on an IP address (Google's DNS server) that doesn't serve HTTPS. This command will fail even on nodes with external access, making it a poor test case. Consider using curl http://8.8.8.8 or curl https://1.1.1.1 (Cloudflare DNS which does support HTTPS), or better yet, ping 8.8.8.8 to test network connectivity.

Suggested change
curl https://8.8.8.8
curl https://1.1.1.1

Copilot uses AI. Check for mistakes.
Comment thread vpc-api-server/main.go
Comment on lines +425 to +429
// Parse internal_only parameter (defaults to false)
internalOnly := false
if internalOnlyStr == "true" || internalOnlyStr == "1" {
internalOnly = true
}

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal_only parameter accepts both "true" and "1" as truthy values, but the documentation only mentions "true". While this flexibility might be intentional, it's not documented. Consider either documenting that both values are accepted, or restricting to only "true" for consistency with the documentation and to avoid confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +66

# Allow external-allowed nodes to reach VPC subnet
- action: accept
src:
- "tag:external-allowed"
dst:
- "100.128.0.0/10:*"

# Allow external-allowed nodes to reach external networks (via exit nodes if configured)
# Note: This allows routing to external IPs, but actual external access depends on exit node configuration
- action: accept
src:
- "tag:external-allowed"
dst:
- "*:*"

Copilot AI Jan 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACL rules allow tag:external-allowed nodes to access ":" (all destinations including external networks), but this depends on exit node configuration as noted in the comment. However, there's no verification or logging to ensure that nodes are actually restricted as intended. If the exit node configuration or ACL enforcement fails, nodes marked as internal-only could still access external networks. Consider adding monitoring or verification mechanisms to ensure the ACL policies are being enforced correctly.

Suggested change
# Allow external-allowed nodes to reach VPC subnet
- action: accept
src:
- "tag:external-allowed"
dst:
- "100.128.0.0/10:*"
# Allow external-allowed nodes to reach external networks (via exit nodes if configured)
# Note: This allows routing to external IPs, but actual external access depends on exit node configuration
- action: accept
src:
- "tag:external-allowed"
dst:
- "*:*"
# Explicitly deny internal-only nodes from reaching external (internet) ranges
# This ensures internal-only nodes remain restricted even if exit node or routing is misconfigured.
- action: deny
src:
- "tag:internal-only"
dst:
- "0.0.0.0/0:*"
# Allow external-allowed nodes to reach VPC subnet
- action: accept
src:
- "tag:external-allowed"
dst:
- "100.128.0.0/10:*"
# Allow external-allowed nodes to reach external networks (via exit nodes if configured)
# Note: This allows routing to external IPs; actual external access still depends on exit node configuration.
# Using an explicit 0.0.0.0/0 CIDR instead of "*:*" makes the intent clearer and avoids overbroad wildcard usage.
- action: accept
src:
- "tag:external-allowed"
dst:
- "0.0.0.0/0:*"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants