-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin Architecture #670
Plugin Architecture #670
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
46f59db
to
a95bddf
Compare
What's this in the test?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reviewing but wanted to post my comments so far
resources/plugins/simln/simln.py
Outdated
@click.argument("pod", type=str) | ||
@click.argument("method", type=str) | ||
@click.argument("params", type=str, nargs=-1) # this will capture all remaining arguments | ||
def rpc(pod: str, method: str, params: tuple[str, ...]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simln doesn't have an RPC interface. It just loads a config file and starts running, so I don't think it needs anything besides a chart and whatever code is needed to create the config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to sh
so that users can ./simln.py sh [podname] ls
or similar. Not married to this, but seems helpful to be able to quickly do that.
7530c3b
to
3ef92bc
Compare
Running a command against a pod that doesn't yet exist. I fixed with a |
ee58e0a
to
b91fe73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just gave a try on how plugins might work and seen some errors running ln_init
I'm using OSX and python 3.10.9
Steps to reproduce:
warnet new mplsnet
- changed
network.yaml
insidemplsnet/networks/mplsnet/network.yaml
network.yaml
caddy:
enabled: true
fork_observer:
configQueryInterval: 20
enabled: true
nodes:
- addnode:
- tank-0001
image:
tag: "27.0"
name: tank-0000
ln:
lnd: true
- addnode:
- tank-0000
image:
tag: "27.0"
name: tank-0001
ln:
lnd: true
- name: tank-0002
addnode:
- tank-0000
ln:
lnd: true
lnd:
config: |
bitcoin.timelockdelta=33
channels:
- id:
block: 300
index: 1
target: tank-0004-ln
capacity: 100000
push_amt: 50000
- name: tank-0004
addnode:
- tank-0000
ln:
lnd: true
lnd:
channels:
- id:
block: 300
index: 2
target: tank-0005-ln
capacity: 50000
push_amt: 25000
- name: tank-0005
addnode:
- tank-0000
ln:
lnd: true
plugins:
postDeploy:
# Take note: the path to the plugin file is relative to the `network.yaml` file. The location of your `plugin.py` file and `network.yaml` file may differ than what is shown below.
- '../../../resources/plugins/simln/plugin.py launch-activity ''[{"source": "tank-0003-ln", "destination": "tank-0005-ln", "interval_secs": 1, "amount_msat": 2000}]'''
warnet deploy <path>
and found the following errors
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/mpch/repo/opensource/warnet/src/warnet/deploy.py", line 328, in deploy_network
if any([queue.get() for _ in range(queue.qsize())]):
File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/multiprocessing/queues.py", line 126, in qsize
return self._maxsize - self._sem._semlock._get_value()
NotImplementedError
warnet status
seems finewarnet logs tank-0000
seems finewarnet run resources/scenarios/ln_init.py
and thenwarnet logs -f <commander>
Exception in thread Thread-34 (matching_graph):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
Exception in thread Thread-33 (matching_graph):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Thread-35 (matching_graph):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
self.run()
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Thread-36 (matching_graph):
File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
Exception in thread Thread-37 (matching_graph):
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/shared/archive.pyz/ln_init.py", line 374, in matching_graph
KeyError: 'source_policy'
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
KeyError: 'source_policy'
self.run()
self._target(*self._args, **self._kwargs)
File "/shared/archive.pyz/ln_framework/ln.py", line 37, in from_lnd_describegraph
File "/usr/local/lib/python3.12/threading.py", line 1012, in run
File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
AttributeError: 'NoneType' object has no attribute 'get'
KeyError: 'source_policy'
self._target(*self._args, **self._kwargs)
File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
KeyError: 'source_policy'
LNInit All LN nodes have matching graph!
seems like threads are failing iterating over the channels? or we might want to raise a failure to the test when some thread are failing 🤔 might be configuration issue too or just from the LN config not related to this PR specifics but It might fail to test the plugin itself if this is not expected
@a-mpch Thanks for taking the time to experiment with this! Thank you for calling attention to this:
I noticed that cropped up in some of my runs, and we'll need to investigate that. However, if you would like to experiment with running the 'activity' file and downloading the results, I think changing a couple things in your setup will get you there despite that error message. First, you'll need to change the relative path of the plugin.py file. Second, make sure to update the node names listed in the command. It should probably look like this:
This makes me think that perhaps I should update the docs so that it mirrors the Warnet user folder rather than the Warnet test framework. If you have a moment, please give that a shot and let me know what it comes back with. Also, do note it will take a minute for the SimLN program to generate results for you to download. |
I've not looked at the code but I've given plugins a spin as a user. It runs!! I hit an issue with I had a question about the difference between pre deploy and pre network and I now realise that pre network is basically post logging stack. In the ln test should there be something that checks the output of the hello plugin? Can't see anything in the log output of I don't think plugin should read the structure of the yaml file. At least not as common practice as this creates a hard tie between the structure of the file and all plugins. I think there should be an intermediate data structure (perhaps just a dict). |
@m3dwards Thanks for checking out the plugin architecture! I've addressed your feedback in the following way:
|
a9498f9
to
10b8fb1
Compare
This includes a *hello* network on the networks folder
Don't bother the user with messages from the plugin system if there are no plugin processes running.
10b8fb1
to
27ac1f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going very well. Most importantly I manually deployed a network with simln defined as a plugin and everything just worked. I have a bunch of nits and questions about the implementation but since the functionality seems to be in place, I also think its ok to merge and then just move forward with followups
Once `deploy` completes, view the pods of the *hello* network by invoking `kubectl get all -A`. | ||
|
||
To view the various "Hello World!" messages, run `kubectl logs pod/POD_NAME` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we trying to avoid kubectl commands in the docs? Lets just reccomend warnet logs -f <pod name>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnet logs -f <pod name>
runs into an issue -- it can't determine the primary container.
We could get around this by choosing to treat the first container as the primary container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some logic for this already. maybe we can address in a followup:
Lines 411 to 425 in 5039591
try: | |
pod = get_pod(pod_name, namespace=namespace) | |
eligible_container_names = [BITCOINCORE_CONTAINER, COMMANDER_CONTAINER] | |
available_container_names = [container.name for container in pod.spec.containers] | |
container_name = next( | |
( | |
container_name | |
for container_name in available_container_names | |
if container_name in eligible_container_names | |
), | |
None, | |
) | |
if not container_name: | |
print("Could not determine primary container.") | |
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If continue using the existing logic, we would need to dictate that plugins set a PLUGIN_CONTAINER as their primary container. I'm not totally opposed to that.
@hello.command() | ||
@click.argument("plugin_content", type=str) | ||
@click.argument("warnet_content", type=str) | ||
@click.pass_context | ||
def entrypoint(ctx, plugin_content: str, warnet_content: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the purpose of this command? Looks like the syntax would be warnet hello entrypoint
? Wouldnt _entrypoint()
be called by deploy already?
cmd = ( | ||
f"{network_file_path.parent / entrypoint_path / Path('plugin.py')} entrypoint " | ||
f"'{json.dumps(plugin_content)}' '{json.dumps(warnet_content)}'" | ||
) | ||
print( | ||
f"Queuing {hook_value.value} plugin command: {plugin_name} with {plugin_content}" | ||
) | ||
|
||
process = Process(target=run_command, args=(cmd,)) | ||
processes.append(process) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so the mechanism is to run the arbitrary python script in a subcommand with a hard-coded argument (entrypoint
). Would it make any more sense to import the script directly and, for example, look in the module for a function called entrypoint()
? ( or main()
? etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending json to the plugin via the command line felt more inspectable and extensible and also less hidden. Having warnet slurp up the script and then start calling specific functions seems like it would be just as hard coded while also hiding entrypoint
which is an important part of the plugin's api.
if not quiet: | ||
print(f"Timeout waiting for initContainer in {pod_name} ({namespace}) to be ready.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like quiet
is only set True in simln/plugin.py, why swallow this one log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want that message sent to terminal because it interfered with piping and other terminal functions.
@@ -0,0 +1,87 @@ | |||
nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to put this network here instead of /test/data
? This location will make it a "default" network that gets copied into all new warnets. Maybe the idea is to have a plugin template as a default? Either way I prefer networks that are run by tests to live in /test/data
.
Especially because this network.yaml generates tons of LN activity between two specific nodes, not like general network activity. It seems less default-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having the 'hello' network as as something that users get when they init
a new warnet user directory because I think it's helpful to have an "all in" example for users to look at.
That said, I can understand creating a copy of the hello network in the testing directory so that all the test stuff is isolated.
|
||
|
||
# Each Warnet plugin must have an entrypoint function which takes two JSON objects: plugin_content | ||
# and warnet_content. We have seen the PluginContent enum above. Warnet also has a WarnetContent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the user need to provide warnet content? seems like the biggest advantage of the plugin architecture is direct access to the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make the warnet content optional. Would that solve this?
def _get_example_activity() -> list[dict]: | ||
pods = get_mission(LIGHTNING_MISSION) | ||
try: | ||
pod_a = pods[1].metadata.name | ||
pod_b = pods[2].metadata.name | ||
except Exception as err: | ||
raise PluginError( | ||
"Could not access the lightning nodes needed for the example.\n Try deploying some." | ||
) from err | ||
return [{"source": pod_a, "destination": pod_b, "interval_secs": 1, "amount_msat": 2000}] | ||
|
||
|
||
@simln.command() | ||
def get_example_activity(): | ||
"""Get an activity representing node 2 sending msat to node 3""" | ||
print(json.dumps(_get_example_activity())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems unecessary. simln can generate its own random activity. I also dont love how its used in the test: create a json file from the plugin then pass it back to the plugin? I feel like it should be more self contained, like simln/plugin.py entrypoint --with-activity-pattern <some options>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this for users to quickly see how to create an activity and what an activity looks like.
Users can run this to get an example activity based on their cluster:
./simln/plugin.py get-example-activity
And it also allows this kind of thing:
./plugins/simln/plugin.py launch-activity "$(./plugins/simln/plugin.py get-example-activity)"
I can see how this might seem more limited than omitting the activity and using the sim-ln random paymnet feature.
I also think the |
simln test is failing for me locally. simln logs:
and test result is a timeout:
|
Hm this error didnt come up when i changed the simln image back to yours 🤷 |
Ok the error might not have to do with the image at all but a possible race condition? Although i dunno why simln would be starting until after ln_init funds and confirms everyhting...
(from Carla) |
Hm but deploy() doesn't return until ln init exits edit to add: right? |
We could get around this issue of waiting for ln_init by having Or, I imagine Max would say we should just make the simln plugin know about the state of the lightning network and make it do the waiting on its own accord regardless of Warnet. |
I see. Yes, that's correct. Looks like this has to do with the random activity feature? Incidentally, running simln like this |
Not sure of that, the simln logs I captured think the nodes have 0 channel funds. Seems more like a race to me. I think the simln plugin needs to check the status of the ln_init scenario. Right now ln init is run in debug mode which means it self-destructs when its done. One solution would be to NOT delete the ln init commander pod when its done, leaving it there in the cluster with a "succeeded" status. |
Agreed, but there's a complicating factor about debug mode (below).
debug mode performs a self-destruct, and it also blocks the thread allowing ln_init to finish which is a side-effect that we want to make sure we recreate. Today, the debug flag causes the _run process to log the stdout of ln_init, and this has the side effect of forcing warnet wait for ln_init to finish. This means if we get rid of the debug flag, the ln_init process will return early—which would be fine, except I'm not sure what other systems rely on ln_init not returning early. Also, the _run command handles our scenarios. So, whatever we do needs to play nice with them. |
What if we _run() ln_init with debug=False and then separately call _logs() to stream the commander output while blocking ? |
The debug=False move you specified works on its own, but its not clear to me how a race condition could exist when simln generates random activity yet it does not exist when we specify an activity. When you specify an activity locally, does the race condition crop show up? If not, then maybe the issue exists at a level deeper than pod status. |
Working great at 7311bc2 tested with 10 node / 13 ch LN with simln, no issues. I'm going to run a 50-node network next on a remote cluster and check that case then I think we can merge. |
Worked great on remote cluster with 100 nodes and 150 channels! Really enjoyed seeing one single |
The Plugin Architecture
(relies on #664)
This new click architecture allows users to launch plugins from within their
network.yaml
files. I have included a SimLN plugin as an example with commentary. I have also crafted some tests around it.To get started, checkout the branch and use
warnet init
to get access to the plugins folder. Make sure to review the README in the plugin folder and the documentation I've created at docs/plugins.md.