Skip to content

Use sharding for sqlite cache (16 shards)#21292

Open
JukkaL wants to merge 17 commits intomasterfrom
sqlite-sharding
Open

Use sharding for sqlite cache (16 shards)#21292
JukkaL wants to merge 17 commits intomasterfrom
sqlite-sharding

Conversation

@JukkaL
Copy link
Copy Markdown
Collaborator

@JukkaL JukkaL commented Apr 22, 2026

SQLite writes can become a major bottleneck for parallel runs, since only one write can be active at any time. Sharding helps a lot and was pretty easy to implement and reason about.

We need to be a bit careful to not have transactions that span multiple shards, as it might cause deadlocks. Sharding is based on path name without file name extension(s), so cache data for a single module goes to the same shard always.

Use a predictable string hash function that is tuned for mypyc. It's much faster than say SHA-1 (though hashing probably isn't a huge bottleneck).

A version of this with 8 shards was on the order of 20% faster in some cases when using 8 workers. The impact was bigger on macOS, but Linux was also better (at least on a cloud VM). Before merging, I'll run some benchmarks to validate that 16 shards don't regress anything.

Used coding agent assist here, but did things here in small reviewed increments.

Also updated the cache conversion and diff scripts (tested manually using coding agent).

Related to #21215.

@JukkaL JukkaL requested a review from ilevkivskyi April 22, 2026 17:01
@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thanks! Couple minor things. Btw did you try this on Dropbox codebase or perf measurements are from our "micro-benchmark"?

Comment thread mypy/build.py
# If there are no errors, only write the cache, don't send anything back
# to the caller (as a micro-optimization).
write_cache_meta_ex(meta_file, meta_ex, manager)
manager.commit_module(meta_file)
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we write after literally every file, then I guess the .commit() calls in worker.py are no-op, right? Do you think it is safer to keep them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to commit after every file to avoid transactions that span multiple shards. Committing multiple times per file could be redundant, though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to commit after every file to avoid transactions that span multiple shards

TBH I am not sure why exactly it is a problem. We don't create any kind of shared client-side transaction (apart from those that may be created by individual connections under the hood).

Anyway, I am thinking that if we need to commit after each write, then we should simply use isolation_level=None and delete all the commit() calls altogether. Because what we are doing now is literally re-implementing isolation_level=None. IIUC with this setting each statement becomes its own transaction, if I read the docs correctly https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.isolation_level

@hauntsaninja Please correct me if I am wrong.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To be clear, it is fine to do this in a separate PR)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That matches my understanding

Comment thread mypy/util.py
c: i64 = ord(s[i])
if c == ord("/") or c == ord("\\"):
break
if c == ord("."):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can compiler infer results statically for these calls? If not maybe use DOT: Final[i64] = 46 etc.?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are constant folded by mypyc (ord(<literal>)). See the testOrd mypyc irbuild test, for example.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This is nice.

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 22, 2026

I used the 8 shard variant with the Dropbox codebase, both on macOS and Linux, and it helped significantly when using 8 workers. I'll run some additonal measurements with 16 shards before merging.

@JukkaL JukkaL changed the title Use sharding for sqlites (16 shards) Use sharding for sqlite cache (16 shards) Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants