Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking simple metrics in remote server is really slow #3191

Open
diogo-sr opened this issue Jul 19, 2024 · 4 comments
Open

Tracking simple metrics in remote server is really slow #3191

diogo-sr opened this issue Jul 19, 2024 · 4 comments
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working

Comments

@diogo-sr
Copy link

🐛 Bug

I have been using AIM to track item detection experiments. We have a back-end running in one of our remote servers we use to track our training and evaluation data. The data consists either of float or image data (mostly numpy.NDARRAY[numpy.uint8]. I have observed massive performance differences between tracking data to a remote AIM server or to a local AIM server running on my laptop.
For instance, tracking a json file with 3000 lines (see attachment in the To reproduce section) takes more than 15minutes to push to the remote server while it takes less than 10 seconds to do the exact same job locally(!).

I have tried to debug this by pushing batches of data instead of doing one call per metric, but nothing seems to make a difference. To add more unexpected information to the picture, tracking 95 images (each app 4MB) to the exact same server took only one minute. I think this means that the delay is not related with the size of the data being tracked (the images are almost 400Mbs while the raw json data is 4.6Mb) 🤷‍♂️

I would really appreciate if someone could cast some light on this, if this difference in performance is expected or if there are any optimizations in terms of tracking/hardware... we could use to speed it up, because how it works not it is really not usable.

To reproduce

  • Start remote AIM server
  • Load metrics.json and track each metric
  • Code snippet used to recreated:
import os
import json
import numpy
from aim import Run

repo="aim://my_aim_server"
path_to_metrics_json="abs_path_metrics.json"

# Start run
logger = Run(experiment="back-end-test", repo=repo)

# Format metrics to proper json
new_metrics_path = "/tmp/new_metrics.json"
if os.path.exists(new_metrics_path):
    os.remove(new_metrics_path)
os.system(f"cat {path_to_metrics_json} | jq -s '.[0:]' >> {new_metrics_path}")

with open(new_metrics_path) as json_data:
    metrics_data = json.load(json_data)

# Log train metrics
for metrics_dict in metrics_data:
    for k, v in metrics_dict.items():
        if not v:
            v = numpy.nan

        logger.track(float(v), k, step=metrics_dict["iteration"])

Expected behavior

Pushing the metrics should not take more than 15minutes

Environment

  • Aim Version 3.20.1
  • Python version 3.9.18
  • OS Ubuntu 22 LTS
@diogo-sr diogo-sr added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Jul 19, 2024
@alberttorosyan
Copy link
Member

@diogo-sr thanks for raising this issue. Performance in general and of the tracking server is a priority for the team.
@mihran113, could you please take a look? Could you please share the results we got after re-implementing the tracking server?

@diogo-sr
Copy link
Author

Thank you for picking it up! Looking forward to hear back from you

@peter-sk
Copy link
Contributor

peter-sk commented Aug 3, 2024

Slightly related to PR #3203. Copying a few megabytes worth of 1M steps tracked sequence is really slow. The PR updates aim to be able to update the remote tree in chunks.

I am not sure how easy this is to integrate in direct tracking to a remote repository. But we are now tracking to a local repository and syncing the runs to a remote repository in close-to-real time using custom sync code, which we will be happy to contribute to aim once the aim backend has support for chunk updates.

@diogo-sr
Copy link
Author

Hi. @peter-sk I have just re-tested my script using latest version of aim v3.24.0 and the speed is the same as previous tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants