Unified error calculation #560

nerkulec · 2024-07-17T14:15:14Z

This PR includes logging overhaul and makes energy error calculations uniform (in units of meV/atom).

RandomDefaultUser

Hi @nerkulec , thank you the PR! All around, it looks pretty great, and I think unifying the error calculations is a great idea.
I have some questions to the code (namely one maybe obscure error metric I was using occasionally and validation grpahs) that I left as comments.

Two general things:

Since this introduces the tqdm package, shouldn't we include it in the requirements.txt? Or is it reliably shipped with another package? This may also affect the cpu_environments.yml.
The docs are currently failing with this PR, which is due to tqdm not yet included in the autodoc_mock_imports list in the conf.py of sphinx. Sphinx is trying to import tqdm to generate an automated API (which is of course unnecessary) but failing, since it is not installed alongside the other docs packages. By adding it to the autodoc_mock_imports list, sphinx will not attempt to import it.

mala/network/runner.py

RandomDefaultUser · 2024-07-26T08:55:38Z

mala/network/runner.py

+                    errors[energy_type] = be_error
+                except ValueError:
+                    errors[energy_type] = float("inf")
+            elif energy_type == "band_energy_dft_fe":


Is there still a use case for calculating energies with DFT Fermi energy? I haven't done it myself in a long time, because I think it does not make that much sense conceptually. If we are overhauling the entire workflow in this aspect anyway, I'd argue for getting rid of it entirely - but if there is still good use for it we can of course keep it!

You're right, I removed it

mala/network/runner.py

RandomDefaultUser · 2024-07-26T09:07:06Z

mala/network/trainer.py

-                                s.wait_stream(
-                                    torch.cuda.current_stream(
-                                        self.parameters._configuration[
-                                            "device"
-                                        ]
-                                    )
-                                )
-                                # Warmup for graphs
-                                with torch.cuda.stream(s):
-                                    for _ in range(20):
-                                        with torch.cuda.amp.autocast(
-                                            enabled=self.parameters.use_mixed_precision
-                                        ):
-                                            prediction = network(x)
-                                            if self.parameters_full.use_ddp:
-                                                loss = network.module.calculate_loss(
-                                                    prediction, y
-                                                )
-                                            else:
-                                                loss = network.calculate_loss(
-                                                    prediction, y
-                                                )
-                                torch.cuda.current_stream(
-                                    self.parameters._configuration["device"]
-                                ).wait_stream(s)
-
-                                # Create static entry point tensors to graph
-                                self.static_input_validation = (
-                                    torch.empty_like(x)
-                                )
-                                self.static_target_validation = (
-                                    torch.empty_like(y)
-                                )
-
-                                # Capture graph
-                                self.validation_graph = torch.cuda.CUDAGraph()
-                                with torch.cuda.graph(self.validation_graph):
-                                    with torch.cuda.amp.autocast(
-                                        enabled=self.parameters.use_mixed_precision
-                                    ):
-                                        self.static_prediction_validation = (
-                                            network(
-                                                self.static_input_validation
-                                            )
-                                        )
-                                        if self.parameters_full.use_ddp:
-                                            self.static_loss_validation = network.module.calculate_loss(
-                                                self.static_prediction_validation,
-                                                self.static_target_validation,
-                                            )
-                                        else:
-                                            self.static_loss_validation = network.calculate_loss(
-                                                self.static_prediction_validation,
-                                                self.static_target_validation,
-                                            )
-
-                            if self.validation_graph:
-                                self.static_input_validation.copy_(x)
-                                self.static_target_validation.copy_(y)
-                                self.validation_graph.replay()
-                                validation_loss_sum += (
-                                    self.static_loss_validation
-                                )
-                            else:
-                                with torch.cuda.amp.autocast(
-                                    enabled=self.parameters.use_mixed_precision
-                                ):
-                                    prediction = network(x)
-                                    if self.parameters_full.use_ddp:
-                                        loss = network.module.calculate_loss(
-                                            prediction, y
-                                        )
-                                    else:
-                                        loss = network.calculate_loss(
-                                            prediction, y
-                                        )
-                                    validation_loss_sum += loss
-                            if (
-                                batchid != 0
-                                and (batchid + 1) % report_freq == 0
-                            ):
-                                torch.cuda.synchronize(
-                                    self.parameters._configuration["device"]
-                                )
-                                sample_time = time.time() - tsample
-                                avg_sample_time = sample_time / report_freq
-                                avg_sample_tput = (
-                                    report_freq * x.shape[0] / sample_time
-                                )
-                                printout(
-                                    f"batch {batchid + 1}, "  # /{total_samples}, "
-                                    f"validation avg time: {avg_sample_time} "
-                                    f"validation avg throughput: {avg_sample_tput}",
-                                    min_verbosity=2,
-                                )
-                                tsample = time.time()
-                            batchid += 1
-                    torch.cuda.synchronize(
-                        self.parameters._configuration["device"]
-                    )


I may be missing something, but I cannot find this portion of the code in the new __validate_network - is there a reason why we would want to get rid of the validation graphs? I thought they were working nicely.

The computation graphs in validate_network only feed the inputs forward through the network, and later invoke calculate_loss on the model which is just the mean square error. No gradients are accumulated or weights updated. I could potentially leave the computation graphs here just for the LDOS metric, but then all other metrics' calculations would still run outside the computation graph, since it's outside of torch. I could re-use the network predictions out of the graph, but I doubt there is any significant speed improvement compared to evaluation in eager mode. I even doubt whether there is any noticeable improvement when you add the MSE on top. That's why I removed them altogether here. If you strongly feel that this hurts performance when using just the LDOS validation metric or done a benchmark that shows the performance difference I can put it back here.

Alright, got it. I think I didn't look into the actual functionality deep enough, just noticed that code that had been authored by Josh had been deleted and just wanted to know if it was by accident or intentional. Your explanation makes sense, I am OK with deleting this part.

RandomDefaultUser

Thanks for incorporating all the changes and feedback, this looks great to me now and can be merged from my side!

Unified error calculation

4139713

nerkulec requested a review from RandomDefaultUser July 17, 2024 14:15

nerkulec marked this pull request as draft July 22, 2024 08:36

nerkulec force-pushed the uniform_error_calculation branch from 24ec8b3 to 863ea6f Compare July 22, 2024 09:19

nerkulec marked this pull request as ready for review July 22, 2024 09:20

nerkulec marked this pull request as draft July 22, 2024 09:52

nerkulec added 2 commits July 22, 2024 12:30

Fix error saving

4a3f56d

Updated .gitignore

a9925b2

nerkulec force-pushed the uniform_error_calculation branch from 863ea6f to a9925b2 Compare July 22, 2024 10:41

Fix UnboundLocal error

049d51d

nerkulec force-pushed the uniform_error_calculation branch from df8a019 to 049d51d Compare July 22, 2024 11:14

nerkulec marked this pull request as ready for review July 24, 2024 10:48

RandomDefaultUser added this to the v1.3.0 - Into the multi-GPU-verse milestone Jul 24, 2024

RandomDefaultUser marked this pull request as draft July 26, 2024 08:37

RandomDefaultUser marked this pull request as ready for review July 26, 2024 08:37

RandomDefaultUser and others added 2 commits July 26, 2024 10:39

Merge branch 'develop' into uniform_error_calculation

44fd34a

Miniscule error in docstring

f392ee7

RandomDefaultUser requested changes Jul 26, 2024

View reviewed changes

nerkulec added 5 commits August 21, 2024 13:10

Added tqdm

5239d4c

Remove energy calculations with DFT fermi energy

515c165

Fixed exceptions and added missing band_energy_actual_fe calculation

7b093b5

Remove unused fe_dft

29fab9a

Get energy targets and predictions

24f9f62

RandomDefaultUser self-requested a review October 7, 2024 11:51

RandomDefaultUser approved these changes Oct 7, 2024

View reviewed changes

RandomDefaultUser merged commit d56e2d1 into develop Oct 7, 2024
6 checks passed

RandomDefaultUser deleted the uniform_error_calculation branch October 7, 2024 12:34

This was referenced Oct 7, 2024

during_training_metric used as validation loss #572

Closed

use_fast_tensor_data_set breaks during_training_metric #571

Closed

elcorto mentioned this pull request Oct 16, 2024

Monitor observables every N epochs #573

Closed

RandomDefaultUser mentioned this pull request Nov 29, 2024

Recovering DDP scalability #617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified error calculation #560

Unified error calculation #560

nerkulec commented Jul 17, 2024

RandomDefaultUser left a comment

RandomDefaultUser Jul 26, 2024

nerkulec Aug 21, 2024

RandomDefaultUser Jul 26, 2024

nerkulec Aug 21, 2024

RandomDefaultUser Oct 7, 2024

RandomDefaultUser left a comment •

edited

Loading

Unified error calculation #560

Unified error calculation #560

Conversation

nerkulec commented Jul 17, 2024

RandomDefaultUser left a comment

Choose a reason for hiding this comment

RandomDefaultUser Jul 26, 2024

Choose a reason for hiding this comment

nerkulec Aug 21, 2024

Choose a reason for hiding this comment

RandomDefaultUser Jul 26, 2024

Choose a reason for hiding this comment

nerkulec Aug 21, 2024

Choose a reason for hiding this comment

RandomDefaultUser Oct 7, 2024

Choose a reason for hiding this comment

RandomDefaultUser left a comment • edited Loading

Choose a reason for hiding this comment

RandomDefaultUser left a comment •

edited

Loading