Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin Architecture #670

Merged
merged 35 commits into from
Jan 7, 2025
Merged

Conversation

mplsgrant
Copy link
Collaborator

@mplsgrant mplsgrant commented Nov 27, 2024

The Plugin Architecture

(relies on #664)

This new click architecture allows users to launch plugins from within their network.yaml files. I have included a SimLN plugin as an example with commentary. I have also crafted some tests around it.

To get started, checkout the branch and use warnet init to get access to the plugins folder. Make sure to review the README in the plugin folder and the documentation I've created at docs/plugins.md.

@mplsgrant mplsgrant requested a review from pinheadmz November 27, 2024 09:45
@bdp-DrahtBot
Copy link
Collaborator

bdp-DrahtBot commented Nov 27, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

@pinheadmz
Copy link
Contributor

What's this in the test?

https://github.com/bitcoin-dev-project/warnet/actions/runs/12124694288/job/33802978126?pr=670#step:9:159

Reason: Handshake status 500 Internal Server Error -+-+- {'content-length': '29', 'content-type': 'text/plain; charset=utf-8', 'date': 'Mon, 02 Dec 2024 17:32:31 GMT'} -+-+- b'container not found ("simln")'

Copy link
Contributor

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing but wanted to post my comments so far

resources/plugins/simln/charts/simln/files/sim.json Outdated Show resolved Hide resolved
resources/plugins/simln/simln.py Outdated Show resolved Hide resolved
@click.argument("pod", type=str)
@click.argument("method", type=str)
@click.argument("params", type=str, nargs=-1) # this will capture all remaining arguments
def rpc(pod: str, method: str, params: tuple[str, ...]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simln doesn't have an RPC interface. It just loads a config file and starts running, so I don't think it needs anything besides a chart and whatever code is needed to create the config file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to sh so that users can ./simln.py sh [podname] ls or similar. Not married to this, but seems helpful to be able to quickly do that.

resources/plugins/simln/simln.py Outdated Show resolved Hide resolved
resources/plugins/simln/simln.py Outdated Show resolved Hide resolved
src/warnet/plugins.py Outdated Show resolved Hide resolved
src/warnet/plugins.py Outdated Show resolved Hide resolved
@mplsgrant mplsgrant force-pushed the 2024-11-plugins branch 3 times, most recently from 7530c3b to 3ef92bc Compare December 6, 2024 17:40
@mplsgrant
Copy link
Collaborator Author

What's this in the test?

https://github.com/bitcoin-dev-project/warnet/actions/runs/12124694288/job/33802978126?pr=670#step:9:159

Reason: Handshake status 500 Internal Server Error -+-+- {'content-length': '29', 'content-type': 'text/plain; charset=utf-8', 'date': 'Mon, 02 Dec 2024 17:32:31 GMT'} -+-+- b'container not found ("simln")'

Running a command against a pod that doesn't yet exist. I fixed with a wait_for_pod

@mplsgrant mplsgrant force-pushed the 2024-11-plugins branch 3 times, most recently from ee58e0a to b91fe73 Compare December 6, 2024 21:45
Copy link

@a-mpch a-mpch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave a try on how plugins might work and seen some errors running ln_init

I'm using OSX and python 3.10.9

Steps to reproduce:

  1. warnet new mplsnet
  2. changed network.yaml inside mplsnet/networks/mplsnet/network.yaml
network.yaml
caddy:
  enabled: true
fork_observer:
  configQueryInterval: 20
  enabled: true
nodes:
  - addnode:
      - tank-0001
    image:
      tag: "27.0"
    name: tank-0000
    ln:
      lnd: true
  - addnode:
      - tank-0000
    image:
      tag: "27.0"
    name: tank-0001
    ln:
      lnd: true
  - name: tank-0002
    addnode:
      - tank-0000
    ln:
      lnd: true
    lnd:
      config: |
        bitcoin.timelockdelta=33
      channels:
        - id:
            block: 300
            index: 1
          target: tank-0004-ln
          capacity: 100000
          push_amt: 50000
  - name: tank-0004
    addnode:
      - tank-0000
    ln:
      lnd: true
    lnd:
      channels:
        - id:
            block: 300
            index: 2
          target: tank-0005-ln
          capacity: 50000
          push_amt: 25000
  - name: tank-0005
    addnode:
      - tank-0000
    ln:
      lnd: true
plugins:
  postDeploy:
    # Take note: the path to the plugin file is relative to the `network.yaml` file. The location of your `plugin.py` file and `network.yaml` file may differ than what is shown below.
    - '../../../resources/plugins/simln/plugin.py launch-activity ''[{"source": "tank-0003-ln", "destination": "tank-0005-ln", "interval_secs": 1, "amount_msat": 2000}]'''
  1. warnet deploy <path>
    and found the following errors
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mpch/repo/opensource/warnet/src/warnet/deploy.py", line 328, in deploy_network
    if any([queue.get() for _ in range(queue.qsize())]):
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/multiprocessing/queues.py", line 126, in qsize
    return self._maxsize - self._sem._semlock._get_value()
NotImplementedError
  1. warnet status seems fine
  2. warnet logs tank-0000 seems fine
  3. warnet run resources/scenarios/ln_init.py and then warnet logs -f <commander>
Exception in thread Thread-34 (matching_graph):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
Exception in thread Thread-33 (matching_graph):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Thread-35 (matching_graph):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Thread-36 (matching_graph):
  File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
Exception in thread Thread-37 (matching_graph):
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
  File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/shared/archive.pyz/ln_init.py", line 374, in matching_graph
KeyError: 'source_policy'
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
KeyError: 'source_policy'
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/shared/archive.pyz/ln_framework/ln.py", line 37, in from_lnd_describegraph
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
  File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
AttributeError: 'NoneType' object has no attribute 'get'
KeyError: 'source_policy'
    self._target(*self._args, **self._kwargs)
  File "/shared/archive.pyz/ln_init.py", line 375, in matching_graph
KeyError: 'source_policy'
LNInit   All LN nodes have matching graph!

seems like threads are failing iterating over the channels? or we might want to raise a failure to the test when some thread are failing 🤔 might be configuration issue too or just from the LN config not related to this PR specifics but It might fail to test the plugin itself if this is not expected

@mplsgrant
Copy link
Collaborator Author

@a-mpch Thanks for taking the time to experiment with this!

Thank you for calling attention to this:

Exception in thread Thread-34 (matching_graph):
Traceback (most recent call last):
File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner

I noticed that cropped up in some of my runs, and we'll need to investigate that.

However, if you would like to experiment with running the 'activity' file and downloading the results, I think changing a couple things in your setup will get you there despite that error message.

First, you'll need to change the relative path of the plugin.py file. Second, make sure to update the node names listed in the command. It should probably look like this:

- '../../plugins/simln/plugin.py launch-activity ''[{"source": "tank-0004-ln", "destination": "tank-0005-ln", "interval_secs": 1, "amount_msat": 2000}]'''

This makes me think that perhaps I should update the docs so that it mirrors the Warnet user folder rather than the Warnet test framework.

If you have a moment, please give that a shot and let me know what it comes back with. Also, do note it will take a minute for the SimLN program to generate results for you to download.

@m3dwards
Copy link
Collaborator

I've not looked at the code but I've given plugins a spin as a user. It runs!!

I hit an issue with multithreading.qsize() on macos which I think you have already fixed.

I had a question about the difference between pre deploy and pre network and I now realise that pre network is basically post logging stack.

In the ln test should there be something that checks the output of the hello plugin?

Can't see anything in the log output of warnet deploy that plugins are running or being started. I could see ln_init running. Not sure if just saying "running hello plugin before tank-0000 deploy" is enough or if we want the std out like ln_init.

I don't think plugin should read the structure of the yaml file. At least not as common practice as this creates a hard tie between the structure of the file and all plugins. I think there should be an intermediate data structure (perhaps just a dict).

@mplsgrant
Copy link
Collaborator Author

@m3dwards Thanks for checking out the plugin architecture!

I've addressed your feedback in the following way:

  • Fix multithreading on macos
  • Create a separate network file to test the plugin system
  • Create useful messages in the console when doing plugin operations
  • Avoid reading the yaml file; instead pass along only those values needed

@mplsgrant mplsgrant force-pushed the 2024-11-plugins branch 2 times, most recently from a9498f9 to 10b8fb1 Compare December 23, 2024 07:15
This includes a *hello* network on the networks folder
Don't bother the user with messages from the plugin system if there are no plugin processes running.
Copy link
Contributor

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going very well. Most importantly I manually deployed a network with simln defined as a plugin and everything just worked. I have a bunch of nits and questions about the implementation but since the functionality seems to be in place, I also think its ok to merge and then just move forward with followups

Comment on lines +21 to +23
Once `deploy` completes, view the pods of the *hello* network by invoking `kubectl get all -A`.

To view the various "Hello World!" messages, run `kubectl logs pod/POD_NAME`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we trying to avoid kubectl commands in the docs? Lets just reccomend warnet logs -f <pod name> ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnet logs -f <pod name> runs into an issue -- it can't determine the primary container.

We could get around this by choosing to treat the first container as the primary container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some logic for this already. maybe we can address in a followup:

try:
pod = get_pod(pod_name, namespace=namespace)
eligible_container_names = [BITCOINCORE_CONTAINER, COMMANDER_CONTAINER]
available_container_names = [container.name for container in pod.spec.containers]
container_name = next(
(
container_name
for container_name in available_container_names
if container_name in eligible_container_names
),
None,
)
if not container_name:
print("Could not determine primary container.")
return

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If continue using the existing logic, we would need to dictate that plugins set a PLUGIN_CONTAINER as their primary container. I'm not totally opposed to that.

resources/plugins/hello/plugin.py Outdated Show resolved Hide resolved
Comment on lines +64 to +68
@hello.command()
@click.argument("plugin_content", type=str)
@click.argument("warnet_content", type=str)
@click.pass_context
def entrypoint(ctx, plugin_content: str, warnet_content: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the purpose of this command? Looks like the syntax would be warnet hello entrypoint ? Wouldnt _entrypoint() be called by deploy already?

Comment on lines +167 to +176
cmd = (
f"{network_file_path.parent / entrypoint_path / Path('plugin.py')} entrypoint "
f"'{json.dumps(plugin_content)}' '{json.dumps(warnet_content)}'"
)
print(
f"Queuing {hook_value.value} plugin command: {plugin_name} with {plugin_content}"
)

process = Process(target=run_command, args=(cmd,))
processes.append(process)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so the mechanism is to run the arbitrary python script in a subcommand with a hard-coded argument (entrypoint). Would it make any more sense to import the script directly and, for example, look in the module for a function called entrypoint() ? ( or main() ? etc)

Copy link
Collaborator Author

@mplsgrant mplsgrant Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending json to the plugin via the command line felt more inspectable and extensible and also less hidden. Having warnet slurp up the script and then start calling specific functions seems like it would be just as hard coded while also hiding entrypoint which is an important part of the plugin's api.

resources/plugins/simln/README.md Outdated Show resolved Hide resolved
src/warnet/k8s.py Outdated Show resolved Hide resolved
Comment on lines +339 to +340
if not quiet:
print(f"Timeout waiting for initContainer in {pod_name} ({namespace}) to be ready.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like quiet is only set True in simln/plugin.py, why swallow this one log message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want that message sent to terminal because it interfered with piping and other terminal functions.

src/warnet/process.py Outdated Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
@@ -0,0 +1,87 @@
nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to put this network here instead of /test/data ? This location will make it a "default" network that gets copied into all new warnets. Maybe the idea is to have a plugin template as a default? Either way I prefer networks that are run by tests to live in /test/data.

Especially because this network.yaml generates tons of LN activity between two specific nodes, not like general network activity. It seems less default-y

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the 'hello' network as as something that users get when they init a new warnet user directory because I think it's helpful to have an "all in" example for users to look at.

That said, I can understand creating a copy of the hello network in the testing directory so that all the test stuff is isolated.



# Each Warnet plugin must have an entrypoint function which takes two JSON objects: plugin_content
# and warnet_content. We have seen the PluginContent enum above. Warnet also has a WarnetContent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the user need to provide warnet content? seems like the biggest advantage of the plugin architecture is direct access to the cluster

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make the warnet content optional. Would that solve this?

Comment on lines +103 to +118
def _get_example_activity() -> list[dict]:
pods = get_mission(LIGHTNING_MISSION)
try:
pod_a = pods[1].metadata.name
pod_b = pods[2].metadata.name
except Exception as err:
raise PluginError(
"Could not access the lightning nodes needed for the example.\n Try deploying some."
) from err
return [{"source": pod_a, "destination": pod_b, "interval_secs": 1, "amount_msat": 2000}]


@simln.command()
def get_example_activity():
"""Get an activity representing node 2 sending msat to node 3"""
print(json.dumps(_get_example_activity()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unecessary. simln can generate its own random activity. I also dont love how its used in the test: create a json file from the plugin then pass it back to the plugin? I feel like it should be more self contained, like simln/plugin.py entrypoint --with-activity-pattern <some options>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this for users to quickly see how to create an activity and what an activity looks like.

Users can run this to get an example activity based on their cluster:

./simln/plugin.py get-example-activity

And it also allows this kind of thing:

./plugins/simln/plugin.py launch-activity "$(./plugins/simln/plugin.py get-example-activity)"

I can see how this might seem more limited than omitting the activity and using the sim-ln random paymnet feature.

@pinheadmz
Copy link
Contributor

I also think the preNode hook could be better integrated (or a new hook added) for something like circuit breaker - which will ideally be deployed inside each LN pod. That may require a special approach because of just calling run_plugins(...) we'll need to (potentially) apply additional container definitions to the pod being deployed in deploy_single_node(). This can be follow up work, but we'll need it for LN jamming

@pinheadmz
Copy link
Contributor

simln test is failing for me locally.

simln logs:

[sim_cli] Connected to tank-0000-ln - Node ID: 03cf919aa51162e2c5cc0acc35b80090291f0c51dfae033abf9fc4252d1057a2ae.
[sim_cli] Connected to tank-0001-ln - Node ID: 030e9a3a436ca10558c73774fa85c1d1746d39810c0adca4921cf4d56aad3f151f.
[sim_cli] Connected to tank-0002-ln - Node ID: 025d8a4811095d5c382f274be81c0ea7d17bbc91a9b9e94569ce6b4e8afdcf5df0.
[sim_cli] Connected to tank-0003-ln - Node ID: 02345e58df811b7980b466e4c90ca5c0e06e542aaee1de06ab08e7e777cb67cf14.
[sim_cli] Connected to tank-0004-ln - Node ID: 03da59baecdbb30d6dfe08ecf85a666812da53ac136750bc85dcd84d6d9d45b40e.
[sim_cli] Connected to tank-0005-ln - Node ID: 0343980bc91094d868fc26aff18a9510f8027719798f6ca7e8f01e8626a8e88781.
[sim_lib] Running the simulation forever.
[sim_lib] Simulation is running on regtest.
[sim_lib] Simulating 0 activity on 6 nodes.
[sim_lib] Summary of results will be reported every 60s.
[sim_lib] Node: 03cf919aa51162e2c5cc0acc35b80090291f0c51dfae033abf9fc4252d1057a2ae not eligible for activity generation: InsufficientCapacity: node needs at least 7600000 capacity (has: 0) to process expected payment amount: 3800000.
[sim_lib] Node: 030e9a3a436ca10558c73774fa85c1d1746d39810c0adca4921cf4d56aad3f151f not eligible for activity generation: InsufficientCapacity: node needs at least 7600000 capacity (has: 0) to process expected payment amount: 3800000.
[sim_lib] Node: 025d8a4811095d5c382f274be81c0ea7d17bbc91a9b9e94569ce6b4e8afdcf5df0 not eligible for activity generation: InsufficientCapacity: node needs at least 7600000 capacity (has: 0) to process expected payment amount: 3800000.
[sim_lib] Created network generator: network graph view with: 3 channels.
[sim_lib] Starting activity producer for tank-0004-ln(03da59...45b40e): activity generator for capacity: 75000000 with multiplier 2: 39.473684210526315 payments per month (0.05482456140350877 per hour).
[sim_lib] Starting activity producer for tank-0003-ln(02345e...67cf14): activity generator for capacity: 50000000 with multiplier 2: 26.31578947368421 payments per month (0.03654970760233918 per hour).
[sim_lib] Starting activity producer for tank-0005-ln(034398...e88781): activity generator for capacity: 25000000 with multiplier 2: 13.157894736842104 payments per month (0.01827485380116959 per hour).
[sim_lib] Processed 0 payments sending 0 msat total with NaN% success rate.
[sim_lib] Processed 0 payments sending 0 msat total with NaN% success rate.
[sim_lib] Processed 0 payments sending 0 msat total with NaN% success rate.
[sim_lib] Processed 0 payments sending 0 msat total with NaN% success rate.
[sim_lib] Processed 0 payments sending 0 msat total with NaN% success rate.
[sim_lib] Processed 0 payments sending 0 msat total with NaN% success rate.

and test result is a timeout:


2025-01-03 13:48:57 | INFO    | test     | Checking for results file in simln-1735929785
2025-01-03 13:48:57 | INFO    | test     | Results file: simulation_1735929789.835217053s.csv
2025-01-03 13:48:57 | INFO    | test     |
2025-01-03 13:49:02 | INFO    | test     | Stopping network
2025-01-03 13:49:02 | DEBUG   | test     | Executing warnet command: down --force
2025-01-03 13:49:04 | DEBUG   | test     | Waiting for predicate with timeout 60s and interval 1s
Traceback (most recent call last):
  File "/Users/matthewzipkin/Desktop/work/warnet/test/simln_test.py", line 111, in <module>
    test.run_test()
    ~~~~~~~~~~~~~^^
  File "/Users/matthewzipkin/Desktop/work/warnet/test/simln_test.py", line 29, in run_test
    self.copy_results()
    ~~~~~~~~~~~~~~~~~^^
  File "/Users/matthewzipkin/Desktop/work/warnet/test/simln_test.py", line 50, in copy_results
    self.wait_for_predicate(partial_func)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/Users/matthewzipkin/Desktop/work/warnet/test/test_base.py", line 95, in wait_for_predicate
    f"Timed out waiting for Truth from predicate: {inspect.getsource(predicate).strip()}"
                                                   ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/inspect.py", line 1256, in getsource
    lines, lnum = getsourcelines(object)
                  ~~~~~~~~~~~~~~^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/inspect.py", line 1238, in getsourcelines
    lines, lnum = findsource(object)
                  ~~~~~~~~~~^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/inspect.py", line 1060, in findsource
    file = getsourcefile(object)
  File "/opt/homebrew/Cellar/[email protected]/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/inspect.py", line 963, in getsourcefile
    filename = getfile(object)
  File "/opt/homebrew/Cellar/[email protected]/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/inspect.py", line 943, in getfile
    raise TypeError('module, class, method, function, traceback, frame, or '
                    'code object was expected, got {}'.format(
                    type(object).__name__))
TypeError: module, class, method, function, traceback, frame, or code object was expected, got partial

@pinheadmz
Copy link
Contributor

Hm this error didnt come up when i changed the simln image back to yours 🤷

@pinheadmz
Copy link
Contributor

Ok the error might not have to do with the image at all but a possible race condition? Although i dunno why simln would be starting until after ln_init funds and confirms everyhting...

No that is expected, we can’t do really small capacity channels in the random generation cause of some math stuff. If you make the expected payment amount smaller the capacity threshold goes down.

(from Carla)

@mplsgrant
Copy link
Collaborator Author

Although i dunno why simln would be starting until after ln_init funds and confirms everyhting...

I think _run kicks the lightning init file into the cluster, and then the network threads can join because their job is done. This would allow a race condition.

@pinheadmz
Copy link
Contributor

pinheadmz commented Jan 3, 2025

Hm but deploy() doesn't return until ln init exits

edit to add: right?

@mplsgrant
Copy link
Collaborator Author

We could get around this issue of waiting for ln_init by having deploy_network check in with the cluster before moving on, or we could maybe allow plugins to inject some blocking code that could do the same.

Or, I imagine Max would say we should just make the simln plugin know about the state of the lightning network and make it do the waiting on its own accord regardless of Warnet.

@mplsgrant
Copy link
Collaborator Author

mplsgrant commented Jan 3, 2025

Hm but deploy() doesn't return until ln init exits, right?

I see. Yes, that's correct.

Looks like this has to do with the random activity feature? Incidentally, running simln like this sim-cli -p 1 here seems to solve the issue. It could be that the log batching default value of 500 might be too high for our test.

@pinheadmz
Copy link
Contributor

Looks like this has to do with the random activity feature?

Not sure of that, the simln logs I captured think the nodes have 0 channel funds. Seems more like a race to me. I think the simln plugin needs to check the status of the ln_init scenario. Right now ln init is run in debug mode which means it self-destructs when its done. One solution would be to NOT delete the ln init commander pod when its done, leaving it there in the cluster with a "succeeded" status.

@mplsgrant
Copy link
Collaborator Author

I think the simln plugin needs to check the status of the ln_init scenario.

Agreed, but there's a complicating factor about debug mode (below).

Right now ln init is run in debug mode which means it self-destructs when its done. One solution would be to NOT delete the ln init commander pod when its done, leaving it there in the cluster with a "succeeded" status.

debug mode performs a self-destruct, and it also blocks the thread allowing ln_init to finish which is a side-effect that we want to make sure we recreate. Today, the debug flag causes the _run process to log the stdout of ln_init, and this has the side effect of forcing warnet wait for ln_init to finish.

This means if we get rid of the debug flag, the ln_init process will return early—which would be fine, except I'm not sure what other systems rely on ln_init not returning early. Also, the _run command handles our scenarios. So, whatever we do needs to play nice with them.

@pinheadmz
Copy link
Contributor

What if we _run() ln_init with debug=False and then separately call _logs() to stream the commander output while blocking ?

@mplsgrant
Copy link
Collaborator Author

What if we _run() ln_init with debug=False and then separately call _logs() to stream the commander output while blocking ?

The debug=False move you specified works on its own, but its not clear to me how a race condition could exist when simln generates random activity yet it does not exist when we specify an activity. When you specify an activity locally, does the race condition crop show up? If not, then maybe the issue exists at a level deeper than pod status.

@pinheadmz
Copy link
Contributor

Working great at 7311bc2

tested with 10 node / 13 ch LN with simln, no issues. I'm going to run a 50-node network next on a remote cluster and check that case then I think we can merge.

@pinheadmz
Copy link
Contributor

Worked great on remote cluster with 100 nodes and 150 channels! Really enjoyed seeing one single deploy command do so much work ;-)
ACK

@pinheadmz pinheadmz merged commit 80dade1 into bitcoin-dev-project:main Jan 7, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants