Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which hardware/software specifications should we support? #62

Open
4 of 6 tasks
choisant opened this issue Aug 29, 2024 · 5 comments
Open
4 of 6 tasks

Which hardware/software specifications should we support? #62

choisant opened this issue Aug 29, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@choisant
Copy link
Collaborator

choisant commented Aug 29, 2024

Right now the code is really only tested for running in Linux on a workstation with an ok amount of memory and CPU cores. What kind of computing setup do potential users have, and what should we support?

Types of users + environments

Do we know if inferno can adapt to these cases?

  • #63
  • High budget, low tech user? Do we have users that have powerful machines, but might not even have Linux installed?
  • Low budget, low tech user: ability to at least test the software on a standard laptop running Windows.
  • Low budget, high tech user: Getting the most amount of computer per dollar. This is basically our current developing environment.

Potential use cases

  • Few variates, many datapoints
  • Many variates, few datapoints
  • Few variates, few datapoints: a small laptop should suffice
  • BIG DATA: the more CPU cores the better, but will we run out of memory?

Wishlist for improvements

Check and create an Issue for each case if it is deemed worth spending time on at some point.

  • Ability to continue MCMC calculations if the calculations are interupted. Low priority for now.
  • Support Windows (this might already be the case).
@choisant choisant added the enhancement New feature or request label Aug 29, 2024
@pglpm
Copy link
Owner

pglpm commented Aug 29, 2024

Personally I don't see the software as implying one or another use case. They depend more on the size of the problem – number of datapoints and number of variates – than the software itself? A problem with 30 datapoints and 10 variates can be solved on a laptop. One with 5000 datapoints and 100 variates needs a workstation. It's a matter of how much RAM and how many and how fast CPUs the user has?

One counter-argument to what I just wrote is that with many datapoints it might be necessary, just as an example, to use different approaches that use disk memory rather than RAM. However, the fact that the software relies on the Nimble package means that we have to load all data in RAM anyway.

@choisant
Copy link
Collaborator Author

The software can be made to support some or all of the set of all possible use cases. The modifiable parallel feature is an example of one of inferno's software features which enables support for utilising powerful/high budget hardware. In the case of a run producing a huge mcoutput/the process running out of memory, we might want to change the solution for storage of output files to something better than the native .rds files. If it's just a small output we might not even want to automatically generate all the output files, but enable a faster in-memory approach between learn() and Pr() for instance. It's these kind of things I'm trying to gather information about here.
This is all related to making the software work smoothly for a lot of different people's projects, which would increase it's value to the scientific community. In HEP we have to think about this all the time, as we are in the extreme end of use cases.

@pglpm
Copy link
Owner

pglpm commented Aug 29, 2024

I think right now the software already supports all those use cases. I know because I've used it both on a small Windows laptop and on Sigma2's unix HPC centre. The way the functions are used is exactly the same. Although things may change in the future, of course.

I checked into the storage question. It's good to save the main object, 'learnt.rds', which the inferences depend on, on disk. Because the user may need to close an R session and continue with new inferences in a later one. Regarding the kind of file, I checked other possibilities like Parquet, Arrow, NetCDF, and similar. Some of them are not appropriate because they only work well with tabular data (and the learnt object is not tabular). And the rds format turned out to give a very good compression. The fact that it's R-specific in not really a problem, because it is used by R functions in any case.

Of course the user can export plots and numerical results in any way they please. You mean we should provide some sort of graphical or numerical export functions? Isn't it enough if the user checks the basic R commands for this?

@choisant
Copy link
Collaborator Author

Many people are unfamiliar with R. To many, an export_to_csv function would be very attractive, so they could open their numbers in Excel/using python. The plotting function should definitely have an export-to-pdf/png option. I can't predict all possible cases, that's why we want to know what software/hardware environment people are working in already when we talk to them.

@pglpm
Copy link
Owner

pglpm commented Aug 29, 2024

Sure. In many cases we can also simply refer to base R functions (no need to reinvent the wheel); for example there's write.csv(). We can add a filetype argument or similar to tplot() and related functions, so that the user can directly save the plot as pdf/svg/png etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants
@choisant @pglpm @h587916 and others