⚡ Optimize file reading in idstack-learnings-delete#35
Conversation
This commit modifies `bin/idstack-learnings-delete` to stream through the learning files rather than loading the entire file into memory using `readlines()`. The memory overhead is reduced significantly without impacting performance by doing an O(N) stream to find the index and another stream to write to a temporary file while skipping the removed line, before utilizing an atomic `os.replace` to replace the old file. Memory usage dropped by approximately 87% in tests with 1,000,000 rows. Co-authored-by: savvides <1580637+savvides@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
Code Review
This pull request refactors the bin/idstack-learnings-delete script to improve memory efficiency and file handling safety by processing the learnings file line-by-line and using a temporary file for atomic updates. A review comment suggests replacing the bare except clause with specific exception handling for json.JSONDecodeError and adding a type check to ensure the parsed JSON is a dictionary, which prevents catching system signals and improves robustness.
| try: | ||
| d = json.loads(line) | ||
| if d.get('key') == key: | ||
| found_idx = i | ||
| except: | ||
| pass |
There was a problem hiding this comment.
Avoid using a bare except: clause as it catches all exceptions, including SystemExit and KeyboardInterrupt. This can make the script difficult to terminate via Ctrl+C and may hide unexpected bugs. It is better to catch the specific exception expected from json.loads and use an explicit type check to ensure the line is a dictionary before accessing its keys.
try:
d = json.loads(line)
if isinstance(d, dict) and d.get('key') == key:
found_idx = i
except json.JSONDecodeError:
pass
References
- PEP 8 recommends avoiding bare except: clauses because they catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program and potentially disguising other problems. (link)
💡 What: Replaced
readlines()inbin/idstack-learnings-deletewith streaming file iteration and a temporary file write via atomicos.replace.🎯 Why: The previous approach loaded the entire contents of
learnings.jsonlinto memory. For large numbers of learnings, this caused a significant O(N) memory overhead and is prone to data loss if it crashed while writing.📊 Measured Improvement: Generating 1 million rows to create a ~45 MB learnings file, max memory usage (
Maximum resident set size) decreased by roughly 87%, falling from 113.28 MB to 14.78 MB. Execution time was virtually equivalent (around 4.08 seconds for both, though user/sys time distributions shifted towards user space). The atomic temporary file approach prevents file truncation if a crash happens during the write pass.PR created automatically by Jules for task 3938025851398790547 started by @savvides