Use cmdstan config and decouple stanfit objects from RunSet#851
Use cmdstan config and decouple stanfit objects from RunSet#851amas0 wants to merge 17 commits intostan-dev:developfrom
Conversation
Bumps [actions/cache](https://github.com/actions/cache) from 4 to 5. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@v4...v5) --- updated-dependencies: - dependency-name: actions/cache dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
WardBrian
left a comment
There was a problem hiding this comment.
A few comments/questions, but overall I like the look of the code!
Responding to your comments in the PR:
- Runset being left out of the result entirely makes sense. We will have to see how this feels when we tackle something with more moving parts like the MCMC class, but for now I agree.
- Repr changes make sense.
- Stdout file is good. We may want to also do the same thing with diagnostic files and profiling files, which gets us closer to just including the whole Runset again... maybe we have a small dataclass that is like a cut-down runset and stores the relevant files, can be re-used across all of them?
- I agree, it makes sense to replace
csvwithoutputacross the board, both in the save functions and thefrom_csvfunction. This would make sense going forward anyway, c.f. stan-dev/design-docs#58 - Yep, totally agree that InferenceMetadata might want to take a hike when all is said and done. The header could be stored in each class, or in the cut-down-Runset idea ("
FileInfo"?)
| model_name: str | ||
| csv_file: str | ||
| config: PathfinderConfig | ||
| config_file: str | None = None |
There was a problem hiding this comment.
Is this really optional? Doesn't look like it in from_files
There was a problem hiding this comment.
It's optional in the sense that if you pass a config in directly, you don't need to reference the file itself. Outside of simplifying some testing, I'm not sure if it will come up much? The intended entrypoint in practice is always going to be the from_files.
| metadata=metadata, | ||
| csv_file=str(csv_file), | ||
| model_name=stan_config.model_name, | ||
| config=stan_config.method_config, |
There was a problem hiding this comment.
It's obviously nice that we don't have any downstream isinstance checks, but it is a bit of a bummer that we end up throwing away the extra information that is present in the StanConfig (e.g. model name, etc). Is there some trick to keep it all around, while maintaining that this is a Pathfinder-specific config?
There was a problem hiding this comment.
Good point. We could just store references to both stan_config and config? The top level StanConfig would still contain the inference config within it, but it would just be the same object. Storing both would keep the top level info and still let us have config be validated that the method is correct. Keeps the type checker happy and doesn't drop any info. Thoughts?
There was a problem hiding this comment.
That seems fine. I was wondering if pydantic had a more elegant way, but it's not like doing that actually costs any extra memory or anything
There was a problem hiding this comment.
We could do something fancy like make StanConfig a generic, which pydantic has good support for:
from typing import Generic, TypeVar
AnyMethodConfig = Annotated[
SampleConfig | OptimizeConfig | PathfinderConfig
| LaplaceConfig | VariationalConfig | GeneratedQuantitiesConfig,
Discriminator("method"),
]
MethodT = TypeVar("MethodT", bound=BaseModel, default=AnyMethodConfig)
class StanConfig(BaseModel, Generic[MethodT]):
model_config = ConfigDict(extra="allow")
model_name: str
stan_major_version: str
stan_minor_version: str
stan_patch_version: str
method_config: MethodTThen for example we could have:
@dataclass
class CmdStanPathfinder:
# other fields ...
config: StanConfig[PathfinderConfig]And parsing like StanConfig[PathfinderConfig].model_validate(jsons) would validate it against the Pathfinder instance of the config. This would result in only one reference to the config but the type checker would know that it has the pathfinder configuration fields within method_config.
There was a problem hiding this comment.
That seems like a good representation of what's actually happening under the hood
| f'found {len(csvfiles)}' | ||
| ) | ||
| csv_file = csvfiles[0] | ||
| config_file = os.path.splitext(csv_file)[0] + '_config.json' |
There was a problem hiding this comment.
I assume this is temporary but it's worth putting a comment saying so, since this is kind of a nasty assumption
There was a problem hiding this comment.
Yeah definitely temporary. I was planning on doing away with this function entirely once the full changes are in.
|
@WardBrian Working through some of the other methods. For I see we could maybe parse that info out of the |
|
Hm, that is indeed tricky. I think we should add the piece of information that currently ends up in stdout ( Looking at the code, it looks like the only place we actually check the return codes is in model.py, before we call the CmdStanMLE constructor, so maybe it is fine? |
|
Yeah. In my local implementation, I take It does seem like the kind of thing that should be available in the CSV? Like a |
This is true for the existing |
|
Yeah good point, this is the snippet from for i in range(len(runset._retcodes)):
runset._set_retcode(i, 0)
return CmdStanMLE(runset)Looks like in recreating the runset, we just manually set all return codes to 0, which is just assuming convergence I believe. |
|
Yeah, so we won't be doing any worse, at least! |
|
@WardBrian Is there an equivalent to saving the metric json for the variational method? Looking at CmdStanVB now and we parse the variational eta value from the stancsv comments, but afaik we don't have an alternative way to source it for now. |
|
There's no preconditioning in our two variational inference implementations, ADVI and Pathfinder, so there's no equivalent of a metric/mass matrix/preconditioner. |
|
@bob-carpenter I believe @amas0 was asking just if we wrote the metadata from the algorithm as a json, similar to the metric in sampling, not if there was actually a preconditioner Unfortunately to my knowledge the answer to this is no. It's a shame to have to keep all the comment-reading machinery around for something as rarely-needed as the final eta value from ADVI... |
|
Sorry for the confusion. Keeping things around for A better approach for |
|
Yeah, in particular, I was referring to this section that appears in the output CSV if you run the variational method: lp__,log_p__,log_g__,theta
# Stepsize adaptation complete.
# eta = 0.1
0,0,0,0.5838444I just meant that we output the algorithm stepsize info to the CSV in the same way that we write the sampler adaptation output (as comment lines after the column headers). But in the sample case, we can output equivalent info to a metric JSON file, but that doesn't seem to be the case for the variational method here. So, I guess the question is what do we want to do with this in the near term? I think it makes sense to have it be an additional column as Bob suggested? I suppose the question is whether it is important enough to maintain the comment parsing in the near term or should we drop keeping track of it for now until an alternative source is available? |
|
I'm not sure it makes sense as a column, since the advi output is only the one row of the varational parameters, in which eta has already been adapted, and then forward sampling iterations Github code search doesn't seem to find any public code which actually calls Maybe we preserve a dumber version of comment parsing just for that, i.e. a function that just does for line in file:
if line.startswith("# eta ="):
return float(line.split('=')[1])Rather than all the present machinery? We can also try to add a structured/json output like for the metric, but that would take longer to thread through the cmdstan code and we'd still need something in the meantime |
|
Honestly, the code surface for just being able to parse eta out is relatively small under the current structure. It's just one call that partitions the CSV into comment lines and draw lines and then a second call to pull out the eta value from the comment lines. Even keeping those, we'll be able to drop a good chunk Stan CSV parsing logic that goes through the comments. |
Draft PR for initial progress on config parsing for #785 and implementing ideas discussed in #848.
@WardBrian I'd like you to take a look at this initial draft which implements pydantic-based parsing of the config json files and uses that to refactor
CmdStanPathfinderto be built entirely from output files (and decouples it from the runset).The changes I'd like you to look at are in
stanfit/pathfinder.pyandstanfit/metadata.py-- the others files are mostly just temporary patches to have the library in a functional state.The
CmdStanPathfinderclass has been rewritten as a dataclass that defines afrom_files(...)method as a constructor to build it from the output files. I also prefer the clarity that this gives to what attributes the class has.Some areas where I'd appreciate thoughts:
reprany reference to the method_args, building directly from files made it seem that was no longer desirable.stdout_fileas an optional attribute. With no reference to the runset, it still seems like something that might be nice to have? But the stdout info is more like process diagnostic than an essential component of the stanfit?save_csvfilesprobably needs to be called something likesave_output_files, but I haven't changed that yet.InferenceMetadataobject more or less only scan the header line for stan variables, I think. So it may be worth a think as to what that should look like?If the shape of looks to be in the right direction, I'll proceed with doing this for the other stanfit objects. Following that, we should be able to really clean up what we have going on in
stanfit/__init__.pyand should be able to drop all the code that we have for parsing out the headers of the stan csvs.