Skip to content

Propagate enable_profiling through ProcessedContainerSettings#483

Open
jhiemstrawisc wants to merge 3 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:fix-profiling
Open

Propagate enable_profiling through ProcessedContainerSettings#483
jhiemstrawisc wants to merge 3 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:fix-profiling

Conversation

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator

ProcessedContainerSettings.from_container_settings() rebuilds the processed container settings from the raw ContainerSettings, but it never copied the enable_profiling field, so the processed settings always used the dataclass default of False. run_container_singularity() reads the flag off the processed settings (config.enable_profiling), so the profiling branch never ran: no peer cgroup was created, cgroup_wrapper.sh never executed, and no usage-profile.tsv was written, regardless of the user's containers.enable_profiling config value.

This regressed in #390, which moved enable_profiling from a top-level config field to the nested container settings and added the field to both ContainerSettings and ProcessedContainerSettings, but missed wiring it through the conversion in from_container_settings().

Add the missing assignment, plus a regression test asserting that enable_profiling survives the full config-parsing path.

ProcessedContainerSettings.from_container_settings() rebuilds the processed
container settings from the raw ContainerSettings, but it never copied the
enable_profiling field, so the processed settings always used the dataclass
default of False. run_container_singularity() reads the flag off the processed
settings (config.enable_profiling), so the profiling branch never ran: no peer
cgroup was created, cgroup_wrapper.sh never executed, and no usage-profile.tsv
was written, regardless of the user's containers.enable_profiling config value.

This regressed in Reed-CompBio#390, which moved enable_profiling from a top-level config
field to the nested container settings and added the field to both
ContainerSettings and ProcessedContainerSettings, but missed wiring it through
the conversion in from_container_settings().

Add the missing assignment, plus a regression test in test_config.py asserting
that enable_profiling survives the full config-parsing path.
@jhiemstrawisc jhiemstrawisc requested review from agitter and ntalluri June 8, 2026 21:46

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good. I have not tested it yet, which one of us may want to do before merging. I tried testing in CHTC but hit other problems described in #459 (review)

@ntalluri

Copy link
Copy Markdown
Collaborator

I updated my code on my current CHTC set up with these changes. I am currently waiting for my jobs to get scheduled to test if this fixed the bug.

ntalluri
ntalluri previously approved these changes Jun 22, 2026
@ntalluri

Copy link
Copy Markdown
Collaborator

I accidentally approved this. I still need to test this on CHTC, I realized that the SPRAS Docker image I was using wasn't including this code change so it still looked like a bug.

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

I'm testing this now, and I've found a few more subtle issues -- don't merge yet!

subprocess.run rejects capture_output=True together with an explicit
stderr argument (capture_output is shorthand for stdout=PIPE, stderr=PIPE).
In the profiling/cgroup branch of run_container_singularity this raised
"ValueError: stdout and stderr arguments may not be used with
capture_output." on every Apptainer run with enable_profiling=true.

Set stdout=PIPE and stderr=STDOUT explicitly so the container's stderr is
merged into the captured stdout, preserving the original intent of
capturing combined output via proc.stdout.
The profiling file is written next to raw-pathway.txt inside the reconstruct
rule, but it was never declared as a Snakemake output. With a shared
filesystem this is harmless, but under shared-fs-usage: none (e.g. HTCondor)
Snakemake only transfers declared outputs back from execute nodes, so the
profiling data was created on the EP and then discarded.

Declare usage-profile.tsv as an additional reconstruct output, gated on
enable_profiling and the singularity/apptainer framework so the output is
only required when it is actually produced (run_container_singularity),
matching the existing warn-only behavior for profiling under docker.
@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

The most recent commits address two other issues I ran into while testing. The first fixes a runtime error that I think I introduced at some point related to the way I used subprocess to invoke apptainer on the EP. The second declares the profiling file as an output so it's transferred back to the AP. This second point was previously hidden by the fact that the executor used to transfer the EP's entire output director rather than individual files (which was itself a bug because it caused Snakemake to re-run completed jobs).

At this point I've tested profiling end to end and this works for me.

@ntalluri

Copy link
Copy Markdown
Collaborator

I tested this update and was able to get the profiling files. However, every value inside the files is N/A:

egfr_large-strwr-params-CYXCVH5/usage-profile.tsv 
peak_memory_bytes	cpu_usage_usec	cpu_user_usec	cpu_system_usec
N/A	N/A	N/A	N/A

egfr_large-strwr-params-KZG4KAJ/usage-profile.tsv 
peak_memory_bytes	cpu_usage_usec	cpu_user_usec	cpu_system_usec
N/A	N/A	N/A	N/A

@ntalluri

ntalluri commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Looking at one of the reconstruct stdout outputs I get:

Container image override (local .sif): apptainer_images/strwr.sif
singularity build --sandbox unpacked/strwr apptainer_images/strwr.sif
WARNING: 'nodev' mount option set on /srv, it could be a source of failure during build process
INFO:    Starting build...
INFO:    Verifying bootstrap image apptainer_images/strwr.sif
INFO:    Extracting local image...
INFO:    Creating sandbox directory...
INFO:    Build complete: unpacked/strwr
WARNING: 'nodev' mount option set on /srv, it could be a source of failure during build process
INFO:    Starting build...
INFO:    Verifying bootstrap image apptainer_images/strwr.sif
INFO:    Extracting local image...
INFO:    Creating sandbox directory...
INFO:    Build complete: unpacked/strwr
Resolved singularity image to sandbox: unpacked/strwr
Failed to create cgroup: [Errno 30] Read-only file system: '/sys/fs/cgroup/spras-peer-19'
Reading memory and CPU stats from cgroup
Failed to read memory usage from cgroup: [Errno 2] No such file or directory: '/sys/fs/cgroup/spras-peer-19/memory.peak'
Failed to read cpu.stat from cgroup: [Errno 2] No such file or directory: '/sys/fs/cgroup/spras-peer-19/cpu.stat'
    My cgroup path is: /sys/fs/cgroup/spras-peer-19
    /usr/local/lib/python3.11/site-packages/spras/cgroup_wrapper.sh: line 15: /sys/fs/cgroup/spras-peer-19/cgroup.procs: No such file or directory
    Executing command: apptainer exec --bind /srv/scratch/output/egfr/prepared/egfr_large-strwr-inputs:/spras/AOZVQUF --bind /srv/scratch/output/egfr/prepared/egfr_large-strwr-inputs:/spras/4IWLUPR --bind /srv/scratch/output/egfr/prepared/egfr_large-strwr-inputs:/spras/ZBD3RLG --bind /srv/scratch/output/egfr/egfr_large-strwr-params-CYXCVH5:/spras/GN4QBOE/egfr_large-strwr-params-CYXCVH5 --cleanenv --containall --pwd /spras --env SPRAS=True unpacked/strwr python /ST_RWR/ST_RWR.py --network /spras/ZBD3RLG/network.txt --sources /spras/AOZVQUF/sources.txt --targets /spras/4IWLUPR/targets.txt --output /spras/GN4QBOE/egfr_large-strwr-params-CYXCVH5/output.txt --alpha 0.05

@ntalluri ntalluri dismissed their stale review June 26, 2026 15:25

the outputs for the profiling are all NA, seems like the cgroup files do not exist

@ntalluri ntalluri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is broken where the profiling files all have N/A values.

@agitter

agitter commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

The error Failed to create cgroup: [Errno 30] Read-only file system: '/sys/fs/cgroup/spras-peer-19' seems to point to

spras/spras/profiling.py

Lines 23 to 32 in 11091cd

mycgroup = os.path.join("/sys/fs/cgroup", cgroup_rel.lstrip("/"))
peer_cgroup = os.path.join(os.path.dirname(mycgroup), f"spras-peer-{os.getpid()}")
# Create the peer cgroup directory
try:
os.makedirs(peer_cgroup, exist_ok=True)
except Exception as e:
print(f"Failed to create cgroup: {e}")
return peer_cgroup

Did that work previously and no longer does? Should we return None or fail differently in that block of code if we can't create the directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants