Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bonus Performance Optimization: Cache Blocking Technique #10

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

joachimasare
Copy link

I implemented a cache blocking technique to optimize matrix multiplication and achieved a 1.5x on first run and 1.6x performance improvement (approximately 49.2% and 59.5% respectively) faster compared to the reference implementation).

Optimization Details:

  • Technique: Divided large matrices into smaller submatrices or blocks. These blocks were sized to align with my CPU’s cache hierarchy, reducing cache misses and improving data reuse.

  • Block Sizes:

    • Rows (BM): 32
    • Columns (BN): 32
    • Reduction dimension (BK): 32
  • My CPU Cache Details that were considered for choosing block size:

    • L2 Cache Size: 18432 KB
    • L3 Cache Size: 24576 KB
      The block size (32x32x32) was chosen to fit within the available L2/L3 cache for efficient data storage and retrieval during compu

1st run result
Screenshot 2024-11-18 215908
station.

2nd run result
Screenshot 2024-11-18 215950

@619135593
Copy link

hi,I have run ./evaluate.sh reference,but when i run ./chat, the result is gibberish response too.What can i do?

… to the bonus optimizaiton with ARM fallback and compiler optimizations
@joachimasare
Copy link
Author

joachimasare commented Dec 15, 2024

Update to PR:

Hi @sxtyzhangzk Zhekai Zhang,
Following up on the feedback on canvas, I have now combined the cache blocking optimization with all other techniques as requested, and tested the performance with compiler optimization enabled (-Ofast). I have made commits to my branch not too long ago. Below were the results rerun:


Updated Results Table:

Implementation Total Time (ms) Average Time (ms) Count GOPs
Reference 522.049011 52.204002 10 5.021444
All Techniques (Without Cache Blocking) 40.069000 4.006000 10 65.423145
All Techniques (With Cache Blocking) 37.713001 3.771000 10 69.510250

Performance Improvement:

  1. All Techniques without Cache Blocking:

    • GOPs: 65.42 (12.7x improvement over reference).
  2. All Techniques with Cache Blocking:

    • GOPs: 69.51 (13.8x improvement over reference).
    • This represents a 6.25% improvement over the previous "all techniques" implementation.

Screenshots:

  • Reference Implementation:
    ![Reference](Screenshot 2024-12-15 031028)
  • All Techniques (Without Cache Blocking):
    ![All Techniques](lScreenshot 2024-12-15 031036)
  • All Techniques (With Cache Blocking):
    ![All Techniques + Cache Blocking](Screenshot 2024-12-15 031133)

Commit Details

  • Added cache blocking to all_techniques.cc with fallback for ARM architecture.
  • Enabled compiler optimizations (-Ofast) in the Makefile for better performance evaluation.


Summary:

After integrating cache blocking with all techniques, I was able to achieve a performance improvement of 6.25%

"All Techniques" achieved a ~13.x improvement over the reference implementation.
"All Techniques + Cache Blocking" further improved performance to ~14x compared to the reference.

@sxtyzhangzk
Copy link
Collaborator

Got a 6% speedup over our reference solution on my desktop. Great job!

@joachimasare
Copy link
Author

great! thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants