Skip to content

Reduce cognitive complexity in five scikit-learn modules through helper method extraction#642

Open
sonarqube-agent[bot] wants to merge 1 commit into
mainfrom
remediate-main-20260629-010125-ee8ba763
Open

Reduce cognitive complexity in five scikit-learn modules through helper method extraction#642
sonarqube-agent[bot] wants to merge 1 commit into
mainfrom
remediate-main-20260629-010125-ee8ba763

Conversation

@sonarqube-agent

Copy link
Copy Markdown

This PR was automatically created by the Remediation Agent's Scheduled backlog remediation feature.

Why these issues? All five issues are CRITICAL violations of python:S3776 with complexity scores exceeding the threshold by 9–16 points. These violations span key scikit-learn modules and are well-suited to automated remediation through helper method extraction. Addressing them collectively delivers substantial code-quality improvements across the library's most complex functions.

This change addresses five CRITICAL SonarQube violations of the Cognitive Complexity rule (python:S3776) by extracting complex nested logic into dedicated helper methods. By decomposing deeply nested conditionals in the discriminant analysis, SVM light format I/O, dictionary learning, histogram gradient boosting, and LDA modules, the refactoring improves code maintainability and readability while bringing these functions into compliance with complexity thresholds.

View Project in SonarCloud


Fixed Issues

python:S3776 - Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed. • CRITICALView issue

Location: sklearn/discriminant_analysis.py:1054

Why is this an issue?

Cognitive Complexity is a measure of how hard it is to understand the control flow of a unit of code. Code with high cognitive complexity is hard to read, understand, test, and modify.

What changed

This hunk extracts the rank-checking and error-raising logic from the fit method into a new helper method _check_rank. By moving this deeply nested code (which contained multiple conditionals including if rank < n_features, if self.solver == 'svd' and n_samples_class <= n_features, and the ternary if self.solver == 'eigen' else 'reg_param') out of the fit method, it reduces the cognitive complexity of fit. The nested conditionals that were contributing significantly to the complexity score (flows involving +2 and +3 for nesting, plus +1 for mixed conditions) are now in a separate method where they start at a lower nesting level, thus reducing the overall cognitive complexity of the fit function from 24 toward the allowed threshold of 15.

--- a/sklearn/discriminant_analysis.py
+++ b/sklearn/discriminant_analysis.py
@@ -1052,0 +1053,24 @@ class QuadraticDiscriminantAnalysis(
+    def _check_rank(self, scaling_class, n_features, x_class, class_label):
+        """Raise an error if the covariance matrix is not full rank."""
+        rank = np.sum(scaling_class > self.tol)
+        if rank < n_features:
+            n_samples_class = x_class.shape[0]
+            if self.solver == "svd" and n_samples_class <= n_features:
+                raise linalg.LinAlgError(
+                    f"The covariance matrix of class {class_label} is not full "
+                    f"rank. When using `solver='svd'` the number of samples in "
+                    f"each class should be more than the number of features, but "
+                    f"class {class_label} has {n_samples_class} samples and "
+                    f"{n_features} features. Try using `solver='eigen'` and "
+                    f"setting the parameter `shrinkage` for regularization."
+                )
+            else:
+                msg_param = (
+                    "shrinkage" if self.solver == "eigen" else "reg_param"
+                )
+                raise linalg.LinAlgError(
+                    f"The covariance matrix of class {class_label} is not full "
+                    f"rank. Increase the value of `{msg_param}` to reduce the "
+                    f"collinearity.",
+                )
+
python:S3776 - Refactor this function to reduce its Cognitive Complexity from 31 to the 15 allowed. • CRITICALView issue

Location: sklearn/datasets/_svmlight_format_io.py:463

Why is this an issue?

Cognitive Complexity is a measure of how hard it is to understand the control flow of a unit of code. Code with high cognitive complexity is hard to read, understand, test, and modify.

What changed

Defines two helper functions, _validate_comment and _sort_sparse_indices, that extract complex logic out of the dump_svmlight_file function. The comment validation logic (isinstance check, bytes decoding, NUL byte check) and the sparse index sorting logic (with nested conditionals and hasattr checks) are moved into these smaller, focused functions. This reduces the cognitive complexity of dump_svmlight_file by removing multiple nested conditionals and branching paths from the main function body.

--- a/sklearn/datasets/_svmlight_format_io.py
+++ b/sklearn/datasets/_svmlight_format_io.py
@@ -450,0 +451,25 @@ def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
+def _validate_comment(comment):
+    """Validate and encode the comment for SVMlight format."""
+    if comment is None:
+        return comment
+    # Convert comment string to list of lines in UTF-8.
+    # If a byte string is passed, then check whether it's ASCII;
+    # if a user wants to get fancy, they'll have to decode themselves.
+    if isinstance(comment, bytes):
+        comment.decode("ascii")  # just for the exception
+    else:
+        comment = comment.encode("utf-8")
+    if b"\0" in comment:
+        raise ValueError("comment string contains NUL byte")
+    return comment
+
+
+def _sort_sparse_indices(val, original):
+    """Sort sparse matrix indices, returning the sorted result."""
+    if val is original and hasattr(val, "sorted_indices"):
+        return val.sorted_indices()
+    if hasattr(val, "sort_indices"):
+        val.sort_indices()
+    return val
+
+
python:S3776 - Refactor this function to reduce its Cognitive Complexity from 25 to the 15 allowed. • CRITICALView issue

Location: sklearn/decomposition/_dict_learning.py:548

Why is this an issue?

Cognitive Complexity is a measure of how hard it is to understand the control flow of a unit of code. Code with high cognitive complexity is hard to read, understand, test, and modify.

What changed

This hunk extracts the initialization logic (for code and dictionary) from the _dict_learning function into a new helper function _initialize_code_and_dictionary. This reduces the cognitive complexity of _dict_learning by moving conditional branches (the if code_init and dict_init is not None / else and if n_components / else blocks) out of the main function, helping bring the cognitive complexity score from 25 down toward the allowed threshold of 15.

--- a/sklearn/decomposition/_dict_learning.py
+++ b/sklearn/decomposition/_dict_learning.py
@@ -548,22 +548,2 @@ def _update_dict(
-def _dict_learning(
-    X,
-    n_components,
-    *,
-    alpha,
-    max_iter,
-    tol,
-    method_params,
-    n_jobs,
-    init,
-    callback,
-    verbose,
-    random_state,
-    return_n_iter,
-    positive,
-):
-    """Main dictionary learning algorithm"""
-    positive_dict, positive_code = positive
-    dict_init, code_init = init
-    method, method_max_iter = method_params
-
-    t0 = time.time()
+def _initialize_code_and_dictionary(X, n_components, code_init, dict_init):
+    """Initialize code and dictionary for dictionary learning."""
python:S3776 - Refactor this function to reduce its Cognitive Complexity from 30 to the 15 allowed. • CRITICALView issue

Location: sklearn/ensemble/_hist_gradient_boosting/grower.py:527

Why is this an issue?

Cognitive Complexity is a measure of how hard it is to understand the control flow of a unit of code. Code with high cognitive complexity is hard to read, understand, test, and modify.

What changed

This hunk replaces a deeply nested block of monotonic constraint logic (with multiple if/else branches) in the split_next method with a single method call to _apply_monotonic_constraints. By extracting this logic into a separate method, it reduces the cognitive complexity of split_next by removing several nested conditionals (the NO_CST check, the POS check, and the NEG else branch), which directly reduces the function's cognitive complexity score toward the allowed threshold of 15.

--- a/sklearn/ensemble/_hist_gradient_boosting/grower.py
+++ b/sklearn/ensemble/_hist_gradient_boosting/grower.py
@@ -615,21 +615,3 @@ class TreeGrower:
-            # Set value bounds for respecting monotonic constraints
-            # See test_nodes_values() for details
-            if (
-                self.monotonic_cst[node.split_info.feature_idx]
-                == MonotonicConstraint.NO_CST
-            ):
-                lower_left = lower_right = node.children_lower_bound
-                upper_left = upper_right = node.children_upper_bound
-            else:
-                mid = (left_child_node.value + right_child_node.value) / 2
-                if (
-                    self.monotonic_cst[node.split_info.feature_idx]
-                    == MonotonicConstraint.POS
-                ):
-                    lower_left, upper_left = node.children_lower_bound, mid
-                    lower_right, upper_right = mid, node.children_upper_bound
-                else:  # NEG
-                    lower_left, upper_left = mid, node.children_upper_bound
-                    lower_right, upper_right = node.children_lower_bound, mid
-            left_child_node.set_children_bounds(lower_left, upper_left)
-            right_child_node.set_children_bounds(lower_right, upper_right)
+            self._apply_monotonic_constraints(
+                node, left_child_node, right_child_node
+            )
python:S3776 - Refactor this function to reduce its Cognitive Complexity from 18 to the 15 allowed. • CRITICALView issue

Location: sklearn/decomposition/_lda.py:637

Why is this an issue?

Cognitive Complexity is a measure of how hard it is to understand the control flow of a unit of code. Code with high cognitive complexity is hard to read, understand, test, and modify.

What changed

This hunk extracts the perplexity evaluation and convergence checking logic from the fit method into a new helper method _check_perplexity_and_convergence. This reduces the cognitive complexity of the fit method by moving deeply nested conditional logic (the if evaluate_every, if self.verbose, if last_bound and abs(...), and elif self.verbose blocks) out of the main loop body and into a separate, well-named function. This directly addresses the code smell warning about the fit method having a cognitive complexity of 18, which exceeds the allowed threshold of 15.

--- a/sklearn/decomposition/_lda.py
+++ b/sklearn/decomposition/_lda.py
@@ -635,0 +636,28 @@ class LatentDirichletAllocation(
+    def _check_perplexity_and_convergence(
+        self, X, i, evaluate_every, max_iter, last_bound, parallel
+    ):
+        """Evaluate perplexity and check for convergence during fitting.
+
+        Returns (last_bound, should_break).
+        """
+        if evaluate_every > 0 and (i + 1) % evaluate_every == 0:
+            doc_topics_distr, _ = self._e_step(
+                X, cal_sstats=False, random_init=False, parallel=parallel
+            )
+            bound = self._perplexity_precomp_distr(
+                X, doc_topics_distr, sub_sampling=False
+            )
+            if self.verbose:
+                print(
+                    "iteration: %d of max_iter: %d, perplexity: %.4f"
+                    % (i + 1, max_iter, bound)
+                )
+
+            if last_bound and abs(last_bound - bound) < self.perp_tol:
+                return bound, True
+            return bound, False
+
+        elif self.verbose:
+            print("iteration: %d of max_iter: %d" % (i + 1, max_iter))
+        return last_bound, False
+

Have a suggestion or found an issue? Share your feedback here.


SonarQube Remediation Agent uses AI. Check for mistakes.

Fixed issues:
- AZ45CyQmRXnEWm2Rf5b- for python:S3776 rule
- AZ45CvRaRXnEWm2Rf4zO for python:S3776 rule
- AZ45CvTuRXnEWm2Rf4zc for python:S3776 rule
- AZ45C0KNRXnEWm2Rf6GB for python:S3776 rule
- AZ45CuuSRXnEWm2Rf4p7 for python:S3776 rule

Generated by SonarQube Agent (task: 972fb7d6-c81c-4574-a99e-7b77474d2ab8)
@sonarqube-agent

Copy link
Copy Markdown
Author

⚠️ This repository does not have a CODEOWNERS file. The PR has been created but has not been automatically assigned to any reviewer. To ensure PRs are reviewed promptly, consider adding a CODEOWNERS file to your repository.

@sonarqubecloud

Copy link
Copy Markdown

SonarQube reviewer guide

Review in SonarQube

Summary: Refactor multiple modules to improve code maintainability by extracting repeated logic into helper methods.

Review Focus:

  • Verify helper methods correctly encapsulate original logic without functional changes
  • Pay special attention to _sort_sparse_indices in svmlight—ensure the conditional logic preserving user's original object is correct
  • Check _check_convergence and _check_perplexity_and_convergence return values are handled properly by callers
  • Validate _compute_histograms_and_split preserves timing measurements and memory management

Start review at: sklearn/datasets/_svmlight_format_io.py. This file contains the most logic-critical extraction (_sort_sparse_indices) where incorrect handling of the val is original comparison could introduce subtle bugs in sparse matrix operations.

💬 Please send your feedback

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues
0 New dependency risks

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant