the question about the Learning-based algorithm #89

jialianchen · 2021-07-12T09:21:46Z

Hello ,i am m trying to integrate our learning-based algorithms into the ccp,but i meet some problem.
it is the AIMD demo , and the only different part from the origin demo in line 20(self.actor = model.Network([5,8], 6, 1e-4))
i can run the origin demo ,but i can not run the demo that adds the line 20.

import sys
import portus
import time
import numpy as np
import datagram_pb2
import model
from os import path


class AIMDFlow():
    INIT_CWND = 10

    def __init__(self, datapath, datapath_info):
        self.datapath = datapath
        self.datapath_info = datapath_info
        self.init_cwnd = float(self.datapath_info.mss * AIMDFlow.INIT_CWND)
        self.cwnd = self.init_cwnd
        self.datapath.set_program("default", [("Cwnd", int(self.cwnd))])
        # the only different part from the origin demo
        self.actor = model.Network([5,8], 6, 1e-4)

    def on_report(self, r):
        if r.loss > 0 or r.sacked > 0:
            self.cwnd /= 2
        else:
            self.cwnd += (self.datapath_info.mss * (r.acked / self.cwnd))
        self.cwnd = max(self.cwnd, self.init_cwnd)
        self.datapath.update_field("Cwnd", int(self.cwnd))


class AIMD(portus.AlgBase):

    def datapath_programs(self):
        return {
                "default" : """\
                (def (Report
                    (volatile acked 0) 
                    (volatile sacked 0) 
                    (volatile loss 0) 
                    (volatile timeout false)
                    (volatile rtt 0)
                    (volatile inflight 0)
                ))
                (when true
                    (:= Report.inflight Flow.packets_in_flight)
                    (:= Report.rtt Flow.rtt_sample_us)
                    (:= Report.acked (+ Report.acked Ack.bytes_acked))
                    (:= Report.sacked (+ Report.sacked Ack.packets_misordered))
                    (:= Report.loss Ack.lost_pkts_sample)
                    (:= Report.timeout Flow.was_timeout)
                    (fallthrough)
                )
                (when (|| Report.timeout (> Report.loss 0))
                    (report)
                    (:= Micros 0)
                )
                (when (> Micros Flow.rtt_sample_us)
                    (report)
                    (:= Micros 0)
                )
            """
        }

    def new_flow(self, datapath, datapath_info):
        return AIMDFlow(datapath, datapath_info)

alg = AIMD()

portus.start("netlink", alg, debug=True)

and it is my model.py,everytime when it run to the line ( self.pi = self.CreateNetwork(inputs=self.inputs) ),it report Segmentation fault.(i`m sure that this part can run successfully apart from ccp )

class Network():
    def CreateNetwork(self, inputs):
        with tf.variable_scope('actor'):
            pi = tf.contrib.layers.fully_connected(inputs, self.a_dim)
        return pi
    def __init__(self, state_dim, action_dim, learning_rate):
        self.s_dim = state_dim
        self.a_dim = action_dim
        self.lr_rate = learning_rate

        self.sess = tf.Session()

        self.inputs = tf.placeholder(tf.float32, [None, self.s_dim[0], self.s_dim[1]])
      
        self.pi = self.CreateNetwork(inputs=self.inputs)
        self.real_out = tf.clip_by_value(self.pi, ACTION_EPS, 1. - ACTION_EPS)
        
        self.sess.run(tf.global_variables_initializer())

so,i wonder whether ccp does not support learning-based algorithms.
What should I do if I want to integrate the algorithm？

I would be very grateful for your suggestions

akshayknarayan · 2021-07-12T16:49:25Z

Hi,
We have tested learning-based algorithms with ccp before (here: https://github.com/park-project/park/tree/master/park/envs/congestion_control). The high-level approach was to have the learning agent run in a separate process and send it rpcs. This may end up being easier than debugging for you.

That said, there is in theory nothing saying that calling a machine learning library directly from a python ccp algorithm implementation won't work. Where in the underlying code (presumably somewhere down the call stack of the fully_connected function) does your segfault occur?

fcangialosi · 2021-07-12T22:39:04Z

Hi,

If the AIMD demo works for you without the self.actor = ... line, then I would guess that model.Network(...) is actually raising an exception, but the portus library is squelching the traceback and segfaulting. In other words, there's some issue with the model line and you're just not getting the error message. I think this is an error with the portus python library that it's not showing you the error message properly -- I will try to look into this sometime soon. In the meantime, there's one thing you can try. Basically you want to try creating an instance of the AIMDFlow class without using portus. So something like this:

Comment out the last line (portus.start)
Comment out everything other than your self.actor line within __init__
Add this line at the bottom instead of portus.start: AIMDFlow(None, None) (we don't care about the arguments here since we're not actually using them)
Hopefully this will reveal an error with the self.actor line, which you can then fix and revert the rest back to normal. If that works fine then we'll need to do some more investigation.

Would you be able to share your repo with us so we can try to reproduce?

Thanks!

jialianchen · 2021-07-13T03:54:53Z

Thanks for your help!
it is my repo https://github.com/jialianchen/cpp_test and my tensorflow is 1.14.
i try your way to create an instance of the AIMDFlow class without using portus,but it works fine.

fcangialosi · 2021-07-13T14:49:23Z

Thanks for the link to the repo. I just checked out the code and was able to reproduce your problem: with the actor line it does segfault, and with the actor line but without portus it works ok. I'm not quite sure why it's segfaulting yet, but I have at least narrowed down the problem a bit. It seems to be an issue with calling that tensorflow code once portus has started. Moving the self.actor line into AIMD.__init__ (instead of AIMDFlow.__init__) and then passing the actor as another argument to AIMDFlow works for me without segfaulting. I'm not sure what the interface for actually using self.actor is, so I haven't been able to try actually using the actor variable to see if that code will work within portus or not. Can you please give that a try when you get a chance and let us know if it works? Also if you update the code please update the repo as well so we can try to reproduce.

Just for a bit of context, portus itself is actually written in rust. The portus python package basically runs the rust code and then calls into the python interpreter. It's using an early-stage package to interface between python and rust and sometimes there are some strange memory issues to debug. There's no fundamental reason why this should not work, it's just that the tf library is a complicated case and not something we've tested yet. I believe with some testing this should be solvable.

akshayknarayan transferred this issue from ccp-project/ccp-kernel Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the question about the Learning-based algorithm #89

the question about the Learning-based algorithm #89

jialianchen commented Jul 12, 2021

akshayknarayan commented Jul 12, 2021

fcangialosi commented Jul 12, 2021

jialianchen commented Jul 13, 2021

fcangialosi commented Jul 13, 2021

the question about the Learning-based algorithm #89

the question about the Learning-based algorithm #89

Comments

jialianchen commented Jul 12, 2021

akshayknarayan commented Jul 12, 2021

fcangialosi commented Jul 12, 2021

jialianchen commented Jul 13, 2021

fcangialosi commented Jul 13, 2021