Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLE-900: Speed up indexing by excluding unrelated files #729

Merged
merged 9 commits into from
Sep 2, 2024

Conversation

thahnen
Copy link
Member

@thahnen thahnen commented Aug 29, 2024

Summary

Currently, when a project is imported or a workspace is opened, all its files are indexed (with very light filtering) and sent to SLCORE for housekeeping (SonarLintEclipseHeadlessRpcClient#listFiles() and FileSystemSynchronizer#getSonarLintJsonFiles()).
When one or more projects are analyzed, this is done as well per project to get all relevant files to be sent to SLCORE for analysis (SelectionUtils#collectFiles()).

When any file is removed, changed, or added we listen for that and calculate that delta on every file in the project no matter whether it is relevant or not (FileSystemSynchronizer#resourceChanged() that invokes FileSystemSynchronizer#visitDeltaPostChange()).

This is very slow and inefficient, both in the performance and the high memory consumption it carries. This is especially the case for hierarchical projects as in Eclipse parent projects will contain all the sub-modules/-projects content as well but keep it hidden - even if the sub-modules/-projects are also present in the workspace.

Compilation output (e.g. bytecode)

We want to narrow down the focus of files that are "present" and available for SonarLint even though they are not relevant. For example, the compiled code files themselves: For the Java analysis, the bytecode is necessary to yield better results, but we don't need to have all the files indexed; we just need SonarLint to know where the bytecode is stored to populate the analysis properties correctly - the bytecode cannot be analyzed itself.

For this, the JDT / Maven and Gradle sub-plugins are enhanced with a new extension point to aggregate the "output" directories, which are then excluded from the indexing.

Sub-modules/-projects content in the parent project

We also want to narrow down the focus for projects having multiple sub-modules/-projects themselves as their files are indexed for them and shouldn't be present as well and taken into account in the parent project.

For this, the Maven and Gradle sub-plugins are enhanced by excluding all the files that are not actually part of their project scope.

Additional, unrelated content

Additionally to this, we also want to exclude files that might be used by a build tool or version control system. In a Git repository, we don't want to index the .git folder as an example for a VCS. Python, on the other hand, brings libraries and tools to have a virtual Python environment directly saved in the project that we shouldn't analyze.

For this, the Python sub-plugin is enhanced by excluding common locations in a project where virtual Python environments are saved.
For version control systems we check for the most common systems like Git, Mercurial, and Subversion and ignore their "special" directories. This is done in SonarLintUtils#insideVCSFolder(...).
For Node.js-related files, there is no sub-plugin as compared to the other sub-plugins and their changes; there is no linkable Eclipse plug-in (there is WTP, but having another optional dependency for this would be overkill). This is done in SonarLintUtisl#inNodeJsRelated(...).

Testing

Exemplary results based on hierarchical Maven projects. The check is with the SonarLintEclipseRpcClient#listFiles() method called when a project is indexed. This is both for speed and number of files that are indexed and stored in memory.

SonarLint CORE w. 45.700 / 28 modules

When the project is clean (git clean -dfx), it is first imported (1), and then the workspace is re-opened (2).

Files before Files after Time before Time after
1 519 519 494 465
2 1660 1660 6701 4088

When the project is dirty (mvn clean verify -DskipTests), it is first imported (1), and then the workspace is re-opened (2).

Files before Files after Time before Time after
1 7052 3781 1277 465
2 1663 1660 9881 904

Orchestrator w. 6.800 LOC / 4 modules

When the project is clean (git clean -dfx), it is first imported (1), and then the workspace is re-opened (2).

Files before Files after Time before Time after
1 245 245 534 379
2 245 245 982 812

When the project is dirty (mvn clean verify -DskipTests), it is first imported (1), and then the workspace is re-opened (2).

Files before Files after Time before Time after
1 246 245 520 376
2 246 245 1072 949

An extension point is implemented that is used for narrowing down the focus of files that are actually part of a project and relevant to SonarLint.

This will speed up SonarLint and lower the memory footprint inside the IDE and for SLCORE.

To speed up operations on the IDE side, especially for importing projects and/or opening a workspace, caches were put in place for both the new extension point and files of a project as calculating both are very costly.

This also includes the exclusion of VCS files and narrowing the focus for Node.js related files as this wouldn't make sense to be put on sub-plugins.
All JVM related projects are touching JDT (even Maven / Gradle ones).

The focus is narrowed here by removing the output folders of all source entries in the classpath and the default one that is always present for Eclipse.
All Python related projects can have virtual environments, most of them use the PyDev plug-in.

The focus is narrowed here by removing the possible virtual environments that are created via Python or other tools.
As Maven projects are hierarchical in nature but flat in Eclipse we have to exclude all the content of sub-modules from being indexed in a parent project.

The focus is narrowed down here as well for the output directories that are coming from Maven directly and not JDT!
As Gradle projects are hierarchical in nature but flat in Eclipse we have to exclude all the content of sub-projects from being indexed in a parent project.

The focus is narrowed down here as well for the Gradle wrapper storage and the output directories coming from Gradle itself and not JDT!
Correctly react to changes done on importing a project by not stopping at the "root" of the resources.

Also invalidate the cache when there are actual changes in order to not yield potentially incorrect results on a manual analysis or project import.
@thahnen thahnen marked this pull request as ready for review August 30, 2024 09:34
The Maven integration into Eclipse (m2e) changed the signatures of methods we use. Therefore we have to use reflection.

On the FileSystemSynchronizer fix a possible array index out of bounds error that can happen when only files are removed.
Copy link

@eray-felek-sonarsource eray-felek-sonarsource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that is their own "fault"" we can remove this comment or change it to make it sound more friendly

Based on PR feedback, the comment was overhauled.

The flaky ITs were overhauled as it was a timing issue between the cache being cleared and accessed when new files are added when a project is imported!
@thahnen thahnen merged commit 0839af0 into master Sep 2, 2024
19 checks passed
@thahnen thahnen deleted the fix/tha/SLE-900_Perf branch September 2, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants