Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AuxAlignment for parsing secondary/supplementary alignments in SAM tags #209

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

clintval
Copy link
Member

@clintval clintval commented Jan 10, 2025

Prior drafts are here and here.

  • Add a few more unit tests to get back up to 100% coverage

Sorry, something went wrong.

@clintval clintval requested review from nh13 and tfenne as code owners January 10, 2025 16:56
Copy link
Contributor

coderabbitai bot commented Jan 10, 2025

Walkthrough

The pull request introduces significant changes to the SAM module by replacing the SupplementaryAlignment class with the new AuxAlignment class, which manages auxiliary alignments from SAM tags such as SA, XA, and XB. The AuxAlignment class includes attributes for various alignment metrics and validation checks. The deprecated SupplementaryAlignment class has had its parsing functionality transferred to AuxAlignment, and related examples have been updated.

New methods for creating auxiliary alignments from primary alignments have been added, along with a constant to indicate missing query bases. Comprehensive unit tests for the AuxAlignment class focus on input validation and parsing of auxiliary alignment tags. Additionally, tests for the deprecated SupplementaryAlignment class have been added, enhancing overall test coverage.

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Sorry, something went wrong.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/fgpyo/sam/test_supplementary_alignments.py (1)

Line range hint 8-24: Evaluate the necessity of tests for deprecated SupplementaryAlignment.

Since SupplementaryAlignment is deprecated, consider focusing on testing the new AuxAlignment class. Maintaining tests for a deprecated class might not be essential unless required for backward compatibility.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0b4227 and e0700de.

📒 Files selected for processing (3)
  • fgpyo/sam/__init__.py (5 hunks)
  • tests/fgpyo/sam/test_aux_alignment.py (1 hunks)
  • tests/fgpyo/sam/test_supplementary_alignments.py (4 hunks)
🔇 Additional comments (2)
fgpyo/sam/__init__.py (1)

1255-1528: Implementation of AuxAlignment is solid.

The AuxAlignment class is well-designed, effectively replacing SupplementaryAlignment with enhanced functionality for parsing auxiliary alignments from SAM tags. The parsing methods and validation checks are thorough.

tests/fgpyo/sam/test_aux_alignment.py (1)

1-612: Comprehensive test coverage for AuxAlignment.

The test suite thoroughly validates input parsing, exception handling, and key functionalities of AuxAlignment, ensuring robustness against various edge cases.

fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.54%. Comparing base (b0b4227) to head (e271e6c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #209      +/-   ##
==========================================
+ Coverage   91.06%   91.54%   +0.47%     
==========================================
  Files          18       18              
  Lines        2283     2400     +117     
  Branches      337      354      +17     
==========================================
+ Hits         2079     2197     +118     
  Misses        133      133              
+ Partials       71       70       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction!

fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after @tfenne's comments look good and you're happy with it

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/fgpyo/sam/test_template_iterator.py (1)

225-241: Test looks good but could use more edge cases.

Test follows AAA pattern and verifies core functionality.

Consider adding test cases for:

  • Empty/malformed tag values
  • Multiple auxiliary alignments
  • Hard-clipped alignments
fgpyo/sam/__init__.py (2)

142-164: Add more comprehensive examples.

Current examples show basic usage. Consider adding:

  • Multiple auxiliary alignment parsing
  • Error handling examples
  • Template integration examples

1482-1486: Improve hard-clipping sequence handling.

Current implementation sets sequence to "*" for any hard-clipping. Consider implementing the TODO to properly pad/clip sequences.

Would you like me to propose an implementation for proper sequence padding/clipping?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0700de and f8d56d0.

📒 Files selected for processing (3)
  • fgpyo/sam/__init__.py (6 hunks)
  • tests/fgpyo/sam/test_aux_alignment.py (1 hunks)
  • tests/fgpyo/sam/test_template_iterator.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/fgpyo/sam/test_aux_alignment.py
🧰 Additional context used
🪛 GitHub Actions: Python package
fgpyo/sam/__init__.py

[error] 1381-1381: f-string: expecting '}' - Syntax error in f-string formatting

🔇 Additional comments (3)
fgpyo/sam/__init__.py (3)

Line range hint 172-214: Constants and imports look good.


1261-1537: Well-structured class implementation.

The AuxAlignment class is well-documented and implements comprehensive validation. Good separation of concerns between parsing and SAM record generation.

🧰 Tools
🪛 GitHub Actions: Python package

[error] 1381-1381: f-string: expecting '}' - Syntax error in f-string formatting


1348-1350: Resolve alignment score validation TODO.

The commented validation for alignment scores should be resolved.

Let's check BWA's behavior with negative alignment scores:

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
fgpyo/sam/__init__.py (5)

142-164: Add more comprehensive examples.

Add examples showing:

  • Error cases and exception handling
  • Multiple auxiliary alignments in a single tag
  • Usage with Template class

1225-1230: Add usage examples in docstring.

Add examples showing:

  • How the method affects the template structure
  • Common use cases

1339-1353: Add more validations in post_init.

Add validation for:

  • reference_name format (non-empty, valid characters)
  • cigar string format

1482-1487: Implement TODO for handling hard clips.

The current implementation sets bases/quals to "*" when hard clips exist. Consider implementing the suggested padding/clipping logic.

Would you like me to help implement the sequence padding/clipping logic?


1519-1523: Consider making RG tag mandatory.

The Read Group (RG) tag is optional but highly recommended. Consider requiring it for consistency with SAM spec.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8d56d0 and e7f2ae2.

📒 Files selected for processing (1)
  • fgpyo/sam/__init__.py (6 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 1376-1376: fgpyo/sam/init.py#L1376
Added line #L1376 was not covered by tests


[warning] 1387-1387: fgpyo/sam/init.py#L1387
Added line #L1387 was not covered by tests


[warning] 1437-1437: fgpyo/sam/init.py#L1437
Added line #L1437 was not covered by tests


[warning] 1469-1469: fgpyo/sam/init.py#L1469
Added line #L1469 was not covered by tests

🔇 Additional comments (2)
fgpyo/sam/__init__.py (2)

213-214: LGTM!

The constant follows the module's naming convention and is well-documented.


1348-1350: Resolve TODO comment about alignment scores.

The comment about negative alignment scores needs resolution. Either:

  • Add validation
  • Document why negative scores are allowed

fgpyo/sam/__init__.py Show resolved Hide resolved
@clintval clintval requested a review from tfenne January 10, 2025 20:48
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fgpyo/sam/__init__.py (1)

142-164: Add error case examples to documentation.

Add examples showing:

  • Invalid tag values
  • Multi-value parsing
  • Error handling
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7f2ae2 and da497e7.

📒 Files selected for processing (3)
  • fgpyo/sam/__init__.py (6 hunks)
  • tests/fgpyo/sam/test_aux_alignment.py (1 hunks)
  • tests/fgpyo/sam/test_template_iterator.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/fgpyo/sam/test_template_iterator.py
  • tests/fgpyo/sam/test_aux_alignment.py
🔇 Additional comments (4)
fgpyo/sam/__init__.py (4)

213-215: LGTM!

Well-documented constant following established patterns.


1225-1230: LGTM!

Clean implementation with proper null handling and efficient iterator usage.


1348-1350: Resolve TODO comment about alignment score validation.

The comment about alignment score validation needs resolution. If BWA allows negative scores, document this. If not, implement the validation.


1433-1438: Add test coverage for error paths.

Missing test coverage for:

  • Empty tag values
  • Invalid tag values
  • Multi-value string parsing errors
✅ Verification successful

Error paths are well tested.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check test coverage for error paths
fd "test_.*\.py" tests/fgpyo/sam/ -x rg -l "def test.*error" {}

Length of output: 135


Script:

#!/bin/bash
# Check specific error test cases
echo "=== Test cases for aux alignment errors ==="
rg "def test.*error" -A 5 tests/fgpyo/sam/test_aux_alignment.py

echo -e "\n=== Implementation of from_tag_value ==="
ast-grep --pattern 'def from_tag_value($$$) {$$$}'

echo -e "\n=== Test cases with empty or invalid values ==="
rg "empty|invalid|malformed" tests/fgpyo/sam/test_aux_alignment.py

Length of output: 1685

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
tests/fgpyo/sam/test_aux_alignment.py (4)

13-58: Add test cases for alignment_score validation and None handling.

Add test cases for:

  • Negative alignment_score validation
  • None values in optional fields (mapping_quality, edit_distance, alignment_score)

60-164: Add test cases for cigar string validation.

Add test cases for:

  • Empty cigar string ("")
  • Invalid cigar string ("50X")

166-215: Add test cases for invalid field formats.

Add test cases for:

  • Invalid mapping quality format ("abc")
  • Invalid edit distance format ("xyz")

589-603: Track hard-clipping sequence handling enhancement.

The TODO about padding/clipping sequence with hard clips should be tracked in an issue.

Want me to create a GitHub issue to track this enhancement?

fgpyo/sam/__init__.py (3)

1261-1350: Add validation for alignment_score.

Add validation to ensure alignment_score is not negative when provided.

 def __post_init__(self) -> None:
     """Perform post-initialization validation on this dataclass."""
     errors: list[str] = []
     if self.reference_start < 0:
         errors.append(f"Reference start cannot be less than 0! Found: {self.reference_start}")
     if self.mapping_quality is not None and self.mapping_quality < 0:
         errors.append(f"Mapping quality cannot be less than 0! Found: {self.mapping_quality}")
     if self.edit_distance is not None and self.edit_distance < 0:
         errors.append(f"Edit distance cannot be less than 0! Found: {self.edit_distance}")
+    if self.alignment_score is not None and self.alignment_score < 0:
+        errors.append(f"Alignment score cannot be less than 0! Found: {self.alignment_score}")
     if len(errors) > 0:
         raise ValueError("\n".join(errors))

1356-1422: Refactor tag parsing into separate methods.

Split parsing logic into tag-specific methods for better maintainability:

  • _parse_sa_tag
  • _parse_xa_tag
  • _parse_xb_tag

1423-1438: Add type hint for values list.

Add explicit type hint for clarity:

-            values: list[str] = cast(str, rec.get_tag(tag)).rstrip(";").split(";")
+            values: list[str] = cast(str, rec.get_tag(tag)).rstrip(";").split(";") 
+            aux_values: list[str] = [v for v in values if v]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da497e7 and 7055cb5.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (6 hunks)
  • tests/fgpyo/sam/test_aux_alignment.py (1 hunks)
🔇 Additional comments (1)
fgpyo/sam/__init__.py (1)

1225-1230: LGTM!

Clean implementation of template reconstruction with auxiliary alignments.

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
tests/fgpyo/sam/test_aux_alignment.py (4)

12-57: Add test cases for optional parameters.

Add test cases for:

  • None values in optional parameters
  • alignment_score validation
@pytest.mark.parametrize(
    ["kwargs", "error"],
    [
        (
            {
                "reference_name": "chr9",
                "reference_start": 123232,
                "is_forward": False,
                "cigar": Cigar.from_cigarstring("49M"),
                "edit_distance": None,
                "alignment_score": -1,
                "mapping_quality": None,
            },
            "Alignment score cannot be less than 0! Found: -1",
        ),
    ],
)

59-164: Add edge cases for CIGAR string parsing.

Add test cases for:

  • Empty CIGAR string
  • Invalid CIGAR string
[
    # Test invalid CIGAR string in SA tag
    "SA",
    "chr9,104599381,-,49K,60,4",
    pytest.raises(CigarParsingException),
],
[
    # Test empty CIGAR string in SA tag
    "SA", 
    "chr9,104599381,-,,60,4",
    pytest.raises(ValueError),
],

165-171: Improve test specificity for field count validation.

Add separate test cases for each tag type with their specific field count requirements.

@pytest.mark.parametrize(
    ["tag", "value", "expected_fields"],
    [
        ("SA", "chr9,104599381", 6),
        ("XA", "chr9,104599381", 4),
        ("XB", "chr9,104599381", 6),
    ],
)
def test_from_tag_value_invalid_field_count(tag: str, value: str, expected_fields: int) -> None:
    with pytest.raises(ValueError, match=f"{tag} tag value has the incorrect number of fields.*"):
        AuxAlignment.from_tag_value(tag, value)

588-603: Track enhancement for hard-clipping sequence handling.

Current implementation sets sequence to None when hard clips exist. Consider implementing the suggested padding/clipping logic.

Would you like me to open a GitHub issue to track this enhancement?

fgpyo/sam/__init__.py (4)

1266-1330: Add type hints to docstring examples.

Add type hints to make examples more informative:

 ```python
->>> supp = AuxAlignment.from_tag_value("SA", "chr1,123,+,50S100M,60,0")
+>>> supp: AuxAlignment = AuxAlignment.from_tag_value("SA", "chr1,123,+,50S100M,60,0")
 >>> supp.reference_name
 'chr1'

1361-1427: Simplify tag parsing logic.

Extract tag-specific parsing into separate methods to reduce complexity:

@classmethod
def _parse_sa_tag(cls, fields: list[str]) -> Self:
    """Parse SA tag fields."""
    reference_name, reference_start, strand, cigar, mapq, edit_distance = fields
    if strand not in ("+", "-"):
        raise ValueError(f"The strand field is not either '+' or '-': {strand}")
    return cls(...)

@classmethod
def _parse_xa_xb_tag(cls, fields: list[str]) -> Self:
    """Parse XA/XB tag fields."""
    if len(fields) == 4:
        reference_name, stranded_start, cigar, edit_distance = fields
        mapq = alignment_score = None
    else:
        reference_name, stranded_start, cigar, edit_distance, alignment_score, mapq = fields
    return cls(...)

1444-1527: Split record creation into smaller methods.

Extract record creation logic into separate methods:

def _create_base_record(self, rec: AlignedSegment, hit: AuxAlignment) -> AlignedSegment:
    """Create base record with query information."""
    aux = AlignedSegment(header=rec.header)
    aux.query_name = rec.query_name
    self._set_query_sequence(aux, rec, hit)
    return aux

def _set_query_sequence(self, aux: AlignedSegment, rec: AlignedSegment, hit: AuxAlignment) -> None:
    """Set query sequence and qualities."""
    if "H" in rec.cigarstring:
        aux.query_sequence = None
        aux.query_qualities = None
    elif rec.is_forward and not hit.is_forward:
        aux.query_sequence = reverse_complement(rec.query_sequence)
        aux.query_qualities = rec.query_qualities[::-1]
    else:
        aux.query_sequence = rec.query_sequence
        aux.query_qualities = rec.query_qualities

1230-1234: Add validation for auxiliary alignments.

Verify that auxiliary alignments belong to the template:

 def with_aux_alignments(self) -> "Template":
     """Rebuild this template by adding auxiliary alignments from the primary alignment tags."""
     r1_aux = iter([]) if self.r1 is None else AuxAlignment.many_pysam_from_rec(self.r1)
     r2_aux = iter([]) if self.r2 is None else AuxAlignment.many_pysam_from_rec(self.r2)
+    # Verify auxiliary alignments belong to this template
+    for aux in chain(r1_aux, r2_aux):
+        if aux.query_name != self.name:
+            raise ValueError(f"Auxiliary alignment {aux.query_name} does not match template {self.name}")
     return self.build(recs=chain(self.all_recs(), r1_aux, r2_aux))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7055cb5 and e271e6c.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (6 hunks)
  • tests/fgpyo/sam/test_aux_alignment.py (1 hunks)

@@ -1203,6 +1227,12 @@ def set_tag(
for rec in self.all_recs():
rec.set_tag(tag, value)

def with_aux_alignments(self) -> "Template":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def with_aux_alignments(self) -> "Template":
def with_aux_alignments(self, include_secondary: bool = True, include_supplementary: bool = False) -> "Template":

Are these sensible defaults?

By default, bwa aln will produce a single record with all secondary alignments in the XA tag, and bwa mem will produce multiple records, each with all other supplementary alignments in the SA tag

Alternatively (or in addition), we could confirm that rehydrating does not create any duplicate records on the template

Copy link
Member Author

@clintval clintval Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This refactor may actually help me with another design decision I didn't love but couldn't solve easily at the time:

@clintval clintval assigned clintval and unassigned tfenne Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants