Caching and the ModuleManager #2783

arporter · 2024-11-15T13:04:39Z

arporter
Nov 15, 2024
Maintainer

There has been a lot of discussion in Teams about the caching of PSyIR/parse tree/module information. The main problem that we have is that PSyclone is invoked once per source file and therefore there can be an awful lot of repeated parsing/processing for a large code-base.

arporter · 2024-11-15T13:50:12Z

arporter
Nov 15, 2024
Maintainer Author

One option that seems sensible is for PSyclone (or even fparser?) to have its own 'module' files like a Fortran compiler does. @sergisiso suggested that this could just be some sort of serialisation of the SymbolTable for a particular file.

Tagging @schreiberx and @hiker as they've been involved in this discussion too.

12 replies

sergisiso Nov 15, 2024
Maintainer

I'm currently in favor of having a single file where everything is transparently cached from everyone instead of cluttering the file system.

The reason for a single file is that this is what build systems understand and therefore provide you all their featureset for free: dependency orders, hash invalidations, parallel builds (as @arporter pointed). I fear we will need to build all this complexity into the ModuleManger if we use a singlefile cache.

sergisiso Nov 15, 2024
Maintainer

I fear we will need to build all this complexity into the ModuleManger if we use a singlefile cache.

And by the way, if you accept handling all this complexity, what is stopping you to do it all in a single execution posidon <list_of_all_files> (which I thought that was what you were asking with the FileContainer detach/copy on teams)? This won't require serialisation/caching/repeated file parsing.

schreiberx Nov 15, 2024
Collaborator

I'm currently in favor of having a single file where everything is transparently cached from everyone instead of cluttering the file system.

The reason for a single file is that this is what build systems understand and therefore provide you all their featureset for free: dependency orders, hash invalidations, parallel builds (as @arporter pointed). I fear we will need to build all this complexity into the ModuleManger if we use a singlefile cache.

I thought about this a little bit further. It makes (now) sense to me to have a hierarchy of cache files:

cache files individually for each .f90 file
Another (single) cache file is used to store information related to all files.

hiker Nov 17, 2024
Maintainer

I'm currently planning to have two different implementations of a ModuleManager:

ModuleManagerAutoSearch: The one from @hiker which will automatically load files.

ModuleManagerFiles: A new one where one provides all files to be analyzed rather than loading them automatically.

While I understand your use case, @schreiberx , I am still not convinced that we need to support reading all files in PSyclone itself. I believe we had discussions about this in the past (advantage of processing more than one file), and I believe we agreed not to go in that direction(?) Also, it is quite simple to implement this in an application - it should be trivial to add a small interface to the module manager so you can feed the files/psyirs to it. And then everything else should work.

But it adds complexity to our source code for a single, non-standard application. I believe our API is pretty good, it's just a few lines of code to get the psyir, and to invoke a script.

And it feels like we are discussion two (or three) different things here - caching for fparser tree, psyir, and reading-all-source files at once.

schreiberx Nov 17, 2024
Collaborator

I'm currently planning to have two different implementations of a ModuleManager:

ModuleManagerAutoSearch: The one from @hiker which will automatically load files.

ModuleManagerFiles: A new one where one provides all files to be analyzed rather than loading them automatically.

While I understand your use case, @schreiberx , I am still not convinced that we need to support reading all files in PSyclone itself. I believe we had discussions about this in the past (advantage of processing more than one file), and I believe we agreed not to go in that direction(?) Also, it is quite simple to implement this in an application - it should be trivial to add a small interface to the module manager so you can feed the files/psyirs to it. And then everything else should work.

But it adds complexity to our source code for a single, non-standard application. I believe our API is pretty good, it's just a few lines of code to get the psyir, and to invoke a script.

And it feels like we are discussion two (or three) different things here - caching for fparser tree, psyir, and reading-all-source files at once.

I think there are 2 flavors we both have:

(A) You like to read files in automatically and step-by-step and I don't plan to remove anything of this.
(B) I like to first parse all files where I know they will be relevant (this could be after the 1st step of determining all dependencies).

Regarding (A), it is interesting to determine, e.g., build dependencies. But once I like to do optimizations of a code, I don't want to first search for the right files / modules, but have them loaded directly, hence (B). Both approaches are justified.

It won't add complexity but clearly split the automatic search you developed from the module management and further analysis of what you loaded, see, e.g., the ModuleManagerBase class:
https://github.com/stfc/PSyclone/blob/martin_nemo_call_trace/src/psyclone/parse/module_manager_base.py

Yes, we discuss many things here since they all belong together (e.g., inlining a subroutine from another module, looking up where symbols are defined for this, etc.).

I should emphasize that this is not just about a module dependency or build dependency, but also about the inlining, symbol lookup, etc. and to speedup this entire process.

arporter · 2024-11-15T15:33:51Z

arporter
Nov 15, 2024
Maintainer Author

OK, so the difference is that the AutoSearch variant doesn't use PSyIR? That's not what we do at the moment. Also, if we're going to cache things anyway, I don't think that's a major concern?
However, I must be misunderstanding something I think: you say they serve different purposes but I realise I haven't grasped what they are? :-)

1 reply

schreiberx Nov 15, 2024
Collaborator

Maybe "purpose" was the wrong word.
It's about how to fill the ModuleManager with the information on which files to load.

AutoSearch uses regexp to process through the files.
The other 2nd variant will just hand over a list of files.

The purpose of both cases is to get information about the modules which is then reused by inheriting from ModuleManager

sergisiso · 2024-11-15T15:36:36Z

sergisiso
Nov 15, 2024
Maintainer

I know this won't convince some of you to not do caching but note that parsing for "import dependencies" can be made MUCH faster because most of the time fparser is doing the recursive descent inside routines but for dependencies only the top-level public symbols are needed, and these are pretty fast to parse. We could have a mode that stops the recursion at routine declarations. We can also design a gradual parsing in fparser, for the "now also get me the tree of this particular routine that I want to inline".

1 reply

schreiberx Nov 15, 2024
Collaborator

These are the timings for parsing all nemo/*.f90 files:

load source code: 0.008646726608276367 secs
load fparser tree: 81.20603656768799 secs
load psyir_node: 33.25982213020325 secs

Seems like there can be substantial savings for this. I'm now planning to introduce a cache file for each .f90 file.
Even if we make fparser faster, there's one thing we need: Support for the serialization of psyir.
I'm writing it so that it's easily extendible to also support caching of psyir.

schreiberx · 2024-11-15T19:12:45Z

schreiberx
Nov 15, 2024
Collaborator

Some status update: I worked on restructuring the code.
The FileInfo will be used to

store the source code
cache the fparser tree into a cache file individual to each source file (e.g., for "asdf.f90" the cache file will be "asdf.psycache")
I'm preparing this to be *ready also to cache psyir once it's supported.
With this it "naturally" makes sense to have per-file individual caches.

The other "global" cache file will be then updates if the checksum of all other cached files changes. Quite easy :-)

1 reply

schreiberx Nov 15, 2024
Collaborator

The caching of fparser trees is now working in my branch, but a MR of fparser is required (since there was some problem during unpickling).

I created a new issue for the serialization of PsyIR nodes:
#2786

hiker · 2024-11-17T12:16:29Z

hiker
Nov 17, 2024
Maintainer

First of all, it might be worth considering #2681 - Fab already writes dependency files, and while we certainly do not want to have a dependency on Fab, using a compatible file format (json based atm) would have some benefits (reduce need to parse file in fab again). Of course, atm it only contains information useful for building. But if we could write our information to include the information used by fab, that might be good.

One problem that I didn't see addressed - if we write one 'cache' file per source file ... how do we find the source file? Do we use a one directory for all cached files? Do we then need to think of name duplication (same file name in two directories?) Or do we need to provide 'search directories' (like for a compiler)?

I am also a bit confused what is cached where. I understand that atm it's the fparser tree, but it's done in PSyclone (or??) Wouldn't it be useful to do this in fparser? This way, other applications (like Fab) could get some benefit?

1 reply

schreiberx Nov 17, 2024
Collaborator

Thanks. I answered in #2681 and didn't think this should be handled externally. This is all required for optimizations (requiring the knowledge of other modules, routines in other modules, etc.), hence, should be part of psyclone.

The caching I realized so far is solely based on FileInfo:
https://github.com/stfc/PSyclone/blob/martin_nemo_call_trace/src/psyclone/parse/file_info.py#L66
It has getters

get_source_code()
get_fparser_tree()
get_psyir_node()
which loads the information if it is not cached and writes out the cache once it is updated.
The cache file is based on a replacement of the extension, e.g.,
asdf/asdf.f90 => asdf/asdf.psycache
but we can also make this configurable.

(Thinking about this in a more utopic way, we could also have a database at $HOME/.psycache to realize a global cache for all files ever parsed with fparser/psyclone - this could be a nice future issue).

The new ModuleManagerBase class simply makes use of this modified FileInfo and caching happens transparently in the background - independent to the ModuleManagerBase class.

A similar caching should then be part of the ModuleManagerBase class, but it's not clear to me where to write the cache file. Fro simplicity, I write it in the current folder, but this could be also modified.

schreiberx · 2024-11-18T08:28:05Z

schreiberx
Nov 18, 2024
Collaborator

I also wanted to add a particular information I think is very relevant (and might have caused misunderstandings):
I'm not using the psyclone command, but we use the psyir library and other tools.
My perspective is that of psyir to be a library that can also be used in other developments. In our case, it's Poseidon.

1 reply

arporter Nov 18, 2024
Maintainer Author

Thanks Martin, that's a helpful clarification :-)

arporter · 2024-11-18T09:49:38Z

arporter
Nov 18, 2024
Maintainer Author

I agree with Joerg on the fparser caching => fparser should do it. That way other people will benefit, not just PSyclone and Poseidon users.

1 reply

schreiberx Nov 18, 2024
Collaborator

If it would be just fparser, I'd agree, but it's hopefully also a caching of psyir in the future and other information.

If it should be moved to fparser, should we then have a (separate) caching implemented for both - also for psyir?

How should this work?
Should the psyir generation decision be based on the hash of the fparser or the hash of the source code representation? But it has to be fparser (including unpickling the fparser cache) or should the source code be loaded separately including another hashsum generation to see whether psyir should be loaded for it - and then loading the source code again for fparser?

Some (module) management class for this is IMHO the better option where the caching is centralized (or we could say that the management also cares about the caching). Please keep in mind that this is not just about the caching of the fparser tree, but could in the future also include a hashing of other information (symbols, lookup tables, etc.).

To summarize, if only a caching of fparser trees should be done, I'd agree, but I'd hope that the caching will also go beyond fparser trees in the future.

arporter · 2024-11-18T16:20:10Z

arporter
Nov 18, 2024
Maintainer Author

It should be as transparent as possible. If the Module Manager decides it needs the parse tree then it should ask fparser. I would say that whether or not fparser uses an existing, pickled parse tree or not is up to it. As your timings helpfully show, fparser is not the fastest thing in the world so anyone using it will benefit from well designed caching (or at least, the ability to use such caching). We need to give some thought to how this will be controlled (probably an option to the parser constructor?) and where any such files are to be put. It will probably be helpful to follow normal compiler behaviour for this and think of the pickled parse trees as fparser 'mod' files. Following this logic, by default such files would then be written to wherever a compiler would put them (gfortran supports -J to control this) and fparser would look for them in directories specified by -I.

2 replies

schreiberx Nov 18, 2024
Collaborator

I definitively agree with the feature that specifies the folder where cache files should be stored.
What should be the default path? "$HOME/.cache"?

I'd propose a solution with two different cases:

If a path is provided, the source code's hash sum (shortened to about 40 characters) + basename as the filename of the cache file in this folder should be used.
If nothing is provided, the extension of the source file should be replaced with .psycache.

schreiberx Nov 19, 2024
Collaborator

The code following this proposal:
https://github.com/stfc/PSyclone/blob/martin_nemo_call_trace/src/psyclone/parse/file_info.py#L131

schreiberx · 2024-11-18T21:36:12Z

schreiberx
Nov 18, 2024
Collaborator

@hiker It looks like you wrote the ModuleManager to handle files before being preprocessed since you also check for a .f90 file ending with a comment related to preprocessed files.

Is this right? Is there any reason for that why this is in psyclone? I'd see this more as part of a build system (Fab?).

2 replies

hiker Nov 29, 2024
Maintainer

Because our test files included in PSyclone contain both versions of the infrastructure files. The idea is that at some stage we will be able to verify different precision for symbols (in LFRic particular). But we also need the preprocessed files. In other cases, F90 files need to be supported (since this is just the default ending used in a project, without the need to preprocess them, or the static dependency analysis finds files that are otherwise not handled by PSyclone).
Note that we have a ticket open to migrate the infrastructure files to use the LFRic one (instead of a cut-down and by now outdated version).

schreiberx Dec 1, 2024
Collaborator

Thanks. I understand the reasons behind this better and better.

schreiberx · 2024-11-19T11:21:52Z

schreiberx
Nov 19, 2024
Collaborator

I worked on a multiplexer solution to choose the ModuleManager via a ModuleManagerMultiplexer:
https://github.com/stfc/PSyclone/blob/martin_nemo_call_trace/src/psyclone/parse/module_manager_multiplexer.py#L85
This only serves the purpose to instantiate or return a particular ModuleManager which is configurable via this class.

The main points are:

This allows you to swap the module manager to whatever module manager someone wants to use - also outside from Poseidon.
Whether caching is used or not is then decided by the particular module manager.
IMHO, caching should never be done automatically, but should be explicitly requested (e.g., because of race conditions of updating caches, I/O issues on network-based file systems, etc.)

@hiker If you're worried about such a restructuring of the module manager, this could also be rewritten in the original form that there's the original ModuleManager class (+ just a few tweaks to it). In addition, there would be a new set_module_manager() method allowing the specification of which class will be instantiated and returned by the get_singleton() method. Maybe that's a good compromise. In this way, you can use whatever you developed, and I can hook in my own ModuleManager from Poseidon - or even leave it in Poseidon, which will avoid a lot of coverage tests.

0 replies

schreiberx · 2024-11-20T07:58:41Z

schreiberx
Nov 20, 2024
Collaborator

@arporter @sergisiso @hiker

Is there any agreement on how to handle multiple modules if processed by psyclone (e.g., finding the right routine matching a call)? I see these two options:

a) All containers have to be part of the same FileContainer. Consequently, they can be all accessed by going to the root node and traversing down.

b) All containers can be stored in more than one FileContainer - e.g. by loading .f90 files individually. Consequently, this requires some management behind it which would be, e.g., the ModuleManager.

If going for (b) (which looks like the best solution), this would require a ModuleManager to, e.g., walk over the different containers.

It there an agreement that the ModuleManager should be used for accessing other modules?

2 replies

arporter Nov 20, 2024
Maintainer Author

I think we already do (b) don't we in the ModuleManager? And yes, the ModuleManager should be (and is?) used for accessing other modules. It does this as required though, rather than all in one go as you favour. However, it probably wouldn't be a big change to add something to its API to allow all source files to be specified in one go.

schreiberx Nov 20, 2024
Collaborator

OK. I understand things better and better how things work in psyclone :-)

Then, concatenating all source files and loading them all together isn't the recommended way to deal with multiple files, but loading them through the ModuleManager, which manages the access across modules.

Any objections to storing a reference to the ModuleManager in the root of each psyir tree if it's of type Container? I think that's a better solution than using a singleton.

schreiberx · 2024-11-30T09:26:01Z

schreiberx
Nov 30, 2024
Collaborator

Caching is now available with this PR in FileInfo with further modifications in ModuleInfo to move the psyir / fparser generation into FileInfo:
#2810
This makes the caching centralized to FileInfo.

This doesn't change anything in the ModuleManager except for an option to activate caching.
Note, that caching won't be used if not accessing psyir/fparser through FileInfo.
So if someone likes to use caching, files have to be loaded via the ModuleManager.

0 replies

Caching and the ModuleManager #2783

arporter Nov 15, 2024 Maintainer

Replies: 12 comments · 24 replies

arporter Nov 15, 2024 Maintainer Author

sergisiso Nov 15, 2024 Maintainer

sergisiso Nov 15, 2024 Maintainer

schreiberx Nov 15, 2024 Collaborator

hiker Nov 17, 2024 Maintainer

schreiberx Nov 17, 2024 Collaborator

arporter Nov 15, 2024 Maintainer Author

schreiberx Nov 15, 2024 Collaborator

sergisiso Nov 15, 2024 Maintainer

schreiberx Nov 15, 2024 Collaborator

schreiberx Nov 15, 2024 Collaborator

schreiberx Nov 15, 2024 Collaborator

hiker Nov 17, 2024 Maintainer

schreiberx Nov 17, 2024 Collaborator

schreiberx Nov 18, 2024 Collaborator

arporter Nov 18, 2024 Maintainer Author

arporter Nov 18, 2024 Maintainer Author

schreiberx Nov 18, 2024 Collaborator

arporter Nov 18, 2024 Maintainer Author

schreiberx Nov 18, 2024 Collaborator

schreiberx Nov 19, 2024 Collaborator

schreiberx Nov 18, 2024 Collaborator

hiker Nov 29, 2024 Maintainer

schreiberx Dec 1, 2024 Collaborator

schreiberx Nov 19, 2024 Collaborator

schreiberx Nov 20, 2024 Collaborator

arporter Nov 20, 2024 Maintainer Author

schreiberx Nov 20, 2024 Collaborator

schreiberx Nov 30, 2024 Collaborator

arporter
Nov 15, 2024
Maintainer

Replies: 12 comments 24 replies

arporter
Nov 15, 2024
Maintainer Author

sergisiso Nov 15, 2024
Maintainer

sergisiso Nov 15, 2024
Maintainer

schreiberx Nov 15, 2024
Collaborator

hiker Nov 17, 2024
Maintainer

schreiberx Nov 17, 2024
Collaborator

arporter
Nov 15, 2024
Maintainer Author

schreiberx Nov 15, 2024
Collaborator

sergisiso
Nov 15, 2024
Maintainer

schreiberx Nov 15, 2024
Collaborator

schreiberx
Nov 15, 2024
Collaborator

schreiberx Nov 15, 2024
Collaborator

hiker
Nov 17, 2024
Maintainer

schreiberx Nov 17, 2024
Collaborator

schreiberx
Nov 18, 2024
Collaborator

arporter Nov 18, 2024
Maintainer Author

arporter
Nov 18, 2024
Maintainer Author

schreiberx Nov 18, 2024
Collaborator

arporter
Nov 18, 2024
Maintainer Author

schreiberx Nov 18, 2024
Collaborator

schreiberx Nov 19, 2024
Collaborator

schreiberx
Nov 18, 2024
Collaborator

hiker Nov 29, 2024
Maintainer

schreiberx Dec 1, 2024
Collaborator

schreiberx
Nov 19, 2024
Collaborator

schreiberx
Nov 20, 2024
Collaborator

arporter Nov 20, 2024
Maintainer Author

schreiberx Nov 20, 2024
Collaborator

schreiberx
Nov 30, 2024
Collaborator