Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large experiment. #692

Closed
wants to merge 14 commits into from
Closed

Large experiment. #692

wants to merge 14 commits into from

Conversation

oliverchang
Copy link
Collaborator

@oliverchang oliverchang commented Nov 6, 2024

With all oracles and 6 targets each per project.

@oliverchang
Copy link
Collaborator Author

FYI @DavidKorczynski

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 --large

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 --large

@oliverchang
Copy link
Collaborator Author

Base automatically changed from cloud-cached to main November 6, 2024 10:15
@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 -ns 4 --large

1 similar comment
@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

@oliverchang
Copy link
Collaborator Author

Going to have to re-run this. Looks like the expeirment is stuck somehow.

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 --large

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

New report: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-11-08-692-oc-20241106-large-generated-20241106/index.html

It's still not clear to me why the last one got stuck. Hopefully #705 will help with debuggging this.

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241106 -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241108 -i -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241108 -i -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

oliverchang added a commit that referenced this pull request Nov 8, 2024
Per
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async,
callbacks should return immediately or they will otherwise block the
entire Pool from making progress.

For large experiments, this is likely causing problems causing our
throughput to decrease as the experiment runs.

From debugging with GDB on
#692, it looks like a large
number of worker processes are stuck waiting to report results:

```
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 376, in put
    with self._wlock:
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
```

This partially reverts #566.
We instead just create a new sub-process to periodically call this in the
background to avoid blocking anything.
oliverchang added a commit that referenced this pull request Nov 8, 2024
…#709)

Per

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async,
callbacks should return immediately or they will otherwise block the
entire Pool from making progress.

For large experiments, this is likely causing problems causing our
throughput to slow to a crawl as the experiment runs, as every single
benchmark experiment finishing requires this expensive calculation.

From debugging with GDB on
#692, it looks like a large
number of worker processes are stuck waiting to report results:

```
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 376, in put
    with self._wlock:
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
```

This partially reverts #566.
We instead just create a new sub-process to periodically call this in
the background to avoid blocking anything.
@oliverchang
Copy link
Collaborator Author

/gcbrun exp -n oc-20241108-fixed -b large-generated-20241106 -ns 4 --large

@oliverchang
Copy link
Collaborator Author

@oliverchang
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant