Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Native STOMP #9141

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

WIP: Native STOMP #9141

wants to merge 6 commits into from

Conversation

ikavgo
Copy link
Contributor

@ikavgo ikavgo commented Aug 18, 2023

NOTE: Bazel tests pass locally, I don't like the code yet though.
TODO: Double check Management UI correctly shows Connection information.

Proposed Changes

Don't use AMQP 0-9-1 as backend for STOMP.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

Documentation update: rabbitmq/rabbitmq-website#1713

@ikavgo
Copy link
Contributor Author

ikavgo commented Sep 4, 2023

Regarding Connection closure, UTF, queues utilization - the idea is to merge this first, then tweak standard support and resources.


%% {ok, BasicMessage} = rabbit_basic:message(ExchangeName, RoutingKey, Content),

%% Delivery = #delivery{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete any commented code.

%% when SendFun :: fun((atom(), binary()) -> term()),
%% AdapterInfo :: #amqp_adapter_info{},
%% SSLLoginName :: atom() | binary(),
%% PeerAddr :: inet:ip_address().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment the type specs and make sure dialyzer passes.

@ansd
Copy link
Member

ansd commented Sep 5, 2023

Thank you @ikvmw. Overall this PR goes into the right direction. There is some more work to do to make it production ready in order to merge it into main. See my in-line comments and the following comments:

  1. Do we still need the runtime dependency on AMQP 0.9.1 client in the STOMP plugin? If not, please remove that dependency from the Makefile and Bazel file. That's what was done for Native MQTT as well.
  2. Native MQTT removes the separate heartbeat processes and instead uses Erlang timers in the connection process itself. Should we do the same for native STOMP? You could make the stomp_reader implement behaviour(ranch_protocol). This will reduce again the number Erlang processes per STOMP connection.
  3. Field Connected At in the Management UI Connections Tab shows up as NaN:NaN:NaN. Needs to be fixed.
  4. In the Management UI, when clicking on the details for a STOMP connection, I can't see Reductions (in pane Runtime Metrics) with this PR, but I can see on main.
  5. In the Management UI, when clicking on the details for a STOMP connection, I can't Force Close the connection. However, it does work on main. Please fix and write a test case as done for Native MQTT in
    %% Test that MQTT connection can be listed and closed via the rabbitmq_management plugin.
    management_plugin_connection(Config) ->
    KeepaliveSecs = 99,
    ClientId = atom_to_binary(?FUNCTION_NAME),
    Node = atom_to_binary(get_node_config(Config, 0, nodename)),
    C = connect(ClientId, Config, [{keepalive, KeepaliveSecs}]),
    eventually(?_assertEqual(1, length(http_get(Config, "/connections"))), 1000, 10),
    [#{client_properties := #{client_id := ClientId},
    timeout := KeepaliveSecs,
    node := Node,
    name := ConnectionName}] = http_get(Config, "/connections"),
    process_flag(trap_exit, true),
    http_delete(Config,
    "/connections/" ++ binary_to_list(uri_string:quote((ConnectionName))),
    ?NO_CONTENT),
    await_exit(C),
    ?assertEqual([], http_get(Config, "/connections")),
    eventually(?_assertEqual([], all_connection_pids(Config)), 500, 3).
  6. Handle force_event_refresh messages in the STOMP connection and write a test similar as done for Native MQTT in
    management_plugin_enable(Config) ->
    ?assertEqual(0, length(http_get(Config, "/connections"))),
    ok = rabbit_ct_broker_helpers:disable_plugin(Config, 0, rabbitmq_management),
    ok = rabbit_ct_broker_helpers:disable_plugin(Config, 0, rabbitmq_management_agent),
    %% If the (web) MQTT connection is established **before** the management plugin is enabled,
    %% the management plugin should still list the (web) MQTT connection.
    C = connect(?FUNCTION_NAME, Config),
    ok = rabbit_ct_broker_helpers:enable_plugin(Config, 0, rabbitmq_management_agent),
    ok = rabbit_ct_broker_helpers:enable_plugin(Config, 0, rabbitmq_management),
    eventually(?_assertEqual(1, length(http_get(Config, "/connections"))), 1000, 10),
    ok = emqtt:disconnect(C).
    s.t. the management plugin can be enabled dynamically.
  7. Enforce connection limits via rabbit_vhost_limit:is_over_connection_limit/1
  8. I added support for code coverage via Support code coverage #6394 . Please check your new code is tested. We don't need 100% code coverage, but we need more tests compared to what's provided in this PR since this is production code. See my in-line comments.
  9. The STOMP spec mandates support for UTF-8. Unfortunately, this is not supported by this plugin. See
    %% The STOMP spec mandates headers to be encoded as UTF-8, but unfortunately the RabbitMQ
    %% STOMP implementation (as of 3.13) does not adhere and therefore does not provide UTF-8 support.
    However, it's not related to this PR. So this can be addressed independently.
  10. Add function specs for your newly added exported functions.
  11. Support for rabbit_trace gets lost by this PR. Add back and write test. (See Native MQTT).
  12. stomp_reader currently also blocks STOMP connections that are only consuming during a memory / disk alarm. This is bad given we want messages to be consumed to empty queue such that the resource alarm can be cleared. Can be fixed as follow up PR.
  13. Creating quorum queues when sending a message seems to be broken:
SEND
destination:/queue/test
x-queue-type:quorum

hey^@

creates a quorum queue on main, but a classic queue with this PR.
14. Creating quorum queues when subscribing seems to be broken as well:

SUBSCRIBE
id:0
destination:/queue/foo
x-queue-type:quorum

^@

creates a quorum queue on main, but a classic queue with this PR.

There is even a test file called x_queue_type_quorum. Amend the test to check that a quorum queue really got created. Otherwise the test doesn't test what it is supposed to test and and runs green without testing any quorum queues.

Same applies to streams.
15. Have you done any performance comparisons of main vs this PR? Our expectation is that there will be less resource usage and better performance for STOMP in terms of throughput and latency. But please show some numbers that demonstrate the impact of this PR.
16. We should add some form of consumer timeout. As it stands right now with this PR, if a STOMP consumer doesn't ack a message consumed from a quorum queue for a very long time, RabbitMQ will eventually run out of disk space.
17. Synchronous deletion of exclusive queues is broken with this PR because rabbit_queue_collector isn't used anymore.

@@ -4,6 +4,7 @@ load("@rules_erlang//:dialyze.bzl", "dialyze", "plt")
load(
"//:rabbitmq.bzl",
"BROKER_VERSION_REQUIREMENTS_ANY",
"ENABLE_FEATURE_MAYBE_EXPR",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will remove enabling the maybe feature at runtime for 3.13 as 3.13 requires OTP 26.
However, it's needed right now.

@@ -28,6 +28,7 @@
start(normal, []) ->
Config = parse_configuration(),
Listeners = parse_listener_configuration(),
rabbit_global_counters:init([{protocol, stomp}]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote to differentiate by STOMP versions, i.e. 1.0, 1.1, 1.2 since that's what we currently do for MQTT (3.1, 3.1.1, 5.0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the global counters per protocol and queue type initialised? Are they missing?

@@ -140,21 +164,32 @@ initial_state(Configuration,
%% to override this value?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the adapter info. It's not needed anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The management UI shows that there is 1 AMQP 0.9.1 channel for the STOMP connection, which is wrong.

Comment on lines 957 to 1069
SMsgSeqNos = lists:usort(MsgSeqNos),
UnconfirmedCutoff = case rabbit_confirms:is_empty(UC) of
true -> lists:last(SMsgSeqNos) + 1;
false -> rabbit_confirms:smallest(UC)
end,
Cutoff = lists:min([UnconfirmedCutoff | NegativeMsgSeqNos]),
{Ms, Ss} = lists:splitwith(fun(X) -> X < Cutoff end, SMsgSeqNos),
State1 = case Ms of
[] -> State;
_ -> MkMsgFun(lists:last(Ms), true, State)
end,
lists:foldl(fun(SeqNo, StateN) ->
MkMsgFun(SeqNo, false, StateN)
end, State1, Ss).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this code tested? If it's not tested, please add tests.

Comment on lines 1679 to 1688
{eol, Actions} ->
State1 = handle_queue_actions(Actions, State),
State2 = handle_consuming_queue_down_or_eol(QRef, State1),
{ConfirmMXs, UC1} =
rabbit_confirms:remove_queue(QRef, State1#proc_state.unconfirmed),
%% Deleted queue is a special case.
%% Do not nack the "rejected" messages.
State3 = record_confirms(ConfirmMXs,
State2#proc_state{unconfirmed = UC1}),
{ok, State3#proc_state{queue_states = rabbit_queue_type:remove(QRef, QStates0)}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this code tested? If it's not tested, please add tests.

Comment on lines 1702 to 1787
({rejected, _QRef, MsgSeqNos}, S0) ->
{U, Rej} =
lists:foldr(
fun(SeqNo, {U1, Acc}) ->
case rabbit_confirms:reject(SeqNo, U1) of
{ok, MX, U2} ->
{U2, [MX | Acc]};
{error, not_found} ->
{U1, Acc}
end
end, {S0#proc_state.unconfirmed, []}, MsgSeqNos),
S = S0#proc_state{unconfirmed = U},
%% Don't send anything, no nacks in STOMP
record_rejects(Rej, S);
({queue_down, QRef}, S0) ->
handle_consuming_queue_down_or_eol(QRef, S0);
%% TODO: I have no idea about the scope of credit_flow
({block, QName}, S0) ->
credit_flow:block(QName),
S0;
({unblock, QName}, S0) ->
credit_flow:unblock(QName),
S0;
%% TODO: in rabbit_channel there code for handling
%% send_drained and send_credit_reply
%% I'm doing catch all here to not crash?
(_, S0) ->
S0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this code tested? If it's not tested, please add tests.

Comment on lines +1734 to +1795
parse_endpoint(undefined) ->
parse_endpoint("/queue");
parse_endpoint(Destination) when is_binary(Destination) ->
parse_endpoint(unicode:characters_to_list(Destination));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this code tested? If it's not tested, please add tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_endpoint called on each SEND and SUBSCRIBE. Should be many dozens hits across our test suites

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observe 0 hits for these 2 clauses ☹️

Screenshot 2023-09-06 at 08 52 17

Comment on lines 2010 to 2016
rabbit_core_metrics:channel_queue_down({self(), QName}),
erase({queue_stats, QName}),
[begin
rabbit_core_metrics:channel_queue_exchange_down({self(), QX}),
erase({queue_exchange_stats, QX})
end || {{queue_exchange_stats, QX = {QName0, _}}, _} <- get(),
QName0 =:= QName].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these stats erased? It looks like there aren't written in the first place?


maybe
{ok, User} ?= rabbit_access_control:check_user_login(Username, AuthProps),
{ok, AuthzCtx} ?= check_vhost_access(VHost, User, Addr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we additionally check for whether the vhost exists and whether the vhost is alive? (see Native MQTT).

end,

maybe
{ok, User} ?= rabbit_access_control:check_user_login(Username, AuthProps),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rabbit_core_metrics:auth_attempt_failed() is missing if user login is refused.

Comment on lines 41 to 54
-record(subscription, {dest_hdr, ack_mode, multi_ack, description, queue_name}).

-record(pending_ack, {
%% delivery identifier used by clients
%% to acknowledge and reject deliveries
delivery_tag,
%% consumer tag
tag,
delivered_at,
%% queue name
queue,
%% message ID used by queue and message store implementations
msg_id
}).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type specs

Comment on lines +1091 to +1194
Delivery = #'basic.deliver'{consumer_tag = ConsumerTag,
delivery_tag = DeliveryTag,
redelivered = Redelivered,
exchange = ExchangeNameBin,
routing_key = RoutingKey},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating an AMQP 0.9.1 #basic.deliver{} is unnecessary. The STOMP protocol is used to send the message to the client, not the AMQP 0.9.1 protocol. I think you should restructure the code, specifically rabbit_stomp_util:headers_extra/4 should not accept a #basic.deliver{} anymore.

end;
{ExchangeNameList, RoutingKeyList} = parse_routing(Destination, DfltTopicEx),
%% io:format("Parse_routing: ~p~n", [{ExchangeNameList, RoutingKeyList}]),
RoutingKey = list_to_binary(RoutingKeyList),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The STOMP plugin operates too much on lists instead of binaries which is inefficient. At some point we should improve this. However, this should be a follow up PR.


-record(subscription, {dest_hdr, ack_mode, multi_ack, description}).
-record(proc_state, {session_id, subscriptions,
version, start_heartbeat_fun, pending_receipts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My editor shows that start_heartbeat_fun is an unused field. Where is this field used? Remove any unused fields.

queue,
%% message ID used by queue and message store implementations
msg_id
}).

-define(FLUSH_TIMEOUT, 60000).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused macro

@@ -20,10 +20,6 @@
-include("rabbit_stomp_frame.hrl").
-include_lib("amqp_client/include/amqp_client.hrl").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

{error, queue_limit_exceeded}
end.

routing_init_state() -> sets:new().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is v2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.erlang.org/doc/man/sets.html#description

Erlang/OTP 24.0 introduced a new internal representation for sets which is more performant. Developers can use this new representation by passing the {version, 2} flag to new/1 and from_list/2, such as sets:new([{version, 2}]). This new representation will become the default in future Erlang/OTP versions. Functions that work on two sets, such as union/2 and similar, will work with sets of different versions. In such cases, there is no guarantee about the version of the returned set. Explicit conversion from the old version to the new one can be done with sets:from_list(sets:to_list(Old), [{version,2}]).

#'queue.delete'{queue = list_to_binary(QName),
nowait = false}),
QRes = rabbit_misc:r(VHost, queue, list_to_binary(QName)),
io:format("Durable QRes: ~p~n", [QRes]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

case rabbit_amqqueue:with(
QRes,
fun (Q) ->
io:format("Delete queue ~p~n", [rabbit_queue_type:delete(Q, false, false, Username)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the logger if you want to log something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing access control check for queue deletion

Binding = #binding{source = rabbit_misc:r(VHost, exchange, list_to_binary(Exchange)),
destination = QName,
key = list_to_binary(RoutingKey)},
case rabbit_binding:add(Binding, Username) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing access control checks

credit_flow:block(QName),
S0;
({unblock, QName}, S0) ->
credit_flow:unblock(QName),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes probably more sense to do what has been done for Native MQTT, i.e .not using credit_flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not using credit flow has worked well so far for MQTT, so let's just drop it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great! Flow was on my "mysteries" list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Destination = binary_to_list(Queue),
Spec = #{no_ack => true,
prefetch_count => application:get_env(rabbit, default_consumer_prefetch),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong prefetch count.
It leads to crashes later on.

e.g.

SEND
destination:/queue/reply-test
reply-to:/temp-queue/foo

Hello World!^@

then click on the reply-to-queue in the Management UI. You get a 500 error and RabbitMQ logs:

2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>   crasher:
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>     initial call: cowboy_stream_h:request_process/3
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>     pid: <0.1278.0>
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>     registered_name: []
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>     exception error: no function clause matching
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>                      thoas_encode:value({ok,{false,0}},
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>                                         #Fun<thoas_encode.0.30747453>) (src/thoas_encode.erl, line 1720)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in function  thoas_encode:map_naive_loop/2 (src/thoas_encode.erl, line 1712)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in call from thoas_encode:map_naive_loop/2 (src/thoas_encode.erl, line 1713)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in call from thoas_encode:map_naive/2 (src/thoas_encode.erl, line 1704)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in call from thoas_encode:list/2 (src/thoas_encode.erl, line 1692)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in call from thoas_encode:map_naive/2 (src/thoas_encode.erl, line 1703)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in call from thoas:encode/2 (src/thoas.erl, line 92)
2023-09-06 14:10:00.942131+00:00 [error] <0.1278.0>       in call from rabbit_mgmt_util:reply0/3 (rabbit_mgmt_util.erl, line 252)

websocket_info(connection_created, State) ->
Infos = infos(?INFO_ITEMS ++ ?OTHER_METRICS, State),

?LOG_INFO("Connection created infos ~p", [Infos]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the connection infos at info level of every Web STOMP connection pollutes the logs too much.

@@ -233,21 +204,37 @@ websocket_info({start_heartbeats, {SendTimeout, ReceiveTimeout}},
ReceiveFun = fun() -> Self ! client_timeout end,
Heartbeat = rabbit_heartbeat:start(SupPid, Sock, SendTimeout,
SendFun, ReceiveTimeout, ReceiveFun),
{ok, State#state{heartbeat = Heartbeat}};
{ok, State#state{heartbeat = Heartbeat,
timeout_sec = {0, 0}}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{0, 0} is wrong?

%% a map of queue names to consumer tag lists
queue_consumers, unacked_message_q, vhost,
user, queue_states, delivery_tag = 0, msg_seq_no, delivery_flow,
default_topic_exchange, default_nack_requeue}).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put rarely changing fields into a their own sub record (e.g. as done for thecfg field in

-record(ch, {cfg :: #conf{},
) for memory efficiency.


Message0 = mc_amqpl:message(ExchangeName, RoutingKey, Content0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to directly store STOMP messages as part of this PR and to implement a mc_stomp module that does the translations. It's unnecessary to convert from / to AMQP 0.9.1 message format. But it can also be done as a separate PR.

@mkuratczyk
Copy link
Contributor

Using a simple test app, it seems like publishing is significantly slower than on main:

> time ./stomp-go -count 100000 -publishOnly -queue foo
sender finished

________________________________________________________
Executed in    2.11 secs      fish           external

This branch:

> time ./stomp-go -count 100000 -publishOnly -queue foo
sender finished

________________________________________________________
Executed in   22.91 secs      fish           external

(note there's a 1 second sleep in the app, to avoid premature app termination, before all messages were sent)

@mkuratczyk
Copy link
Contributor

Messages published over STOMP, can't be consumed over AMQP 0.9.1.
Using a simple test app for publishing and then perf-test for conuming:

# publish 1 message
./stomp-go -count 1 -publishOnly -queue stomp-to-amqp
# attempt to consume the messsage
perf-test -x 0 -ad false -f persistent -u stomp-to-amqp

logs:

   crasher:
     initial call: rabbit_writer:enter_mainloop/2
     pid: <0.1334.0>
     registered_name: []
     exception error: bad argument
       in function  size/1
          called as size([<<"Message #1">>])
          *** argument 1: not tuple or binary
       in call from rabbit_binary_generator:build_content_frames/7 (rabbit_binary_generator.erl, line 89)
       in call from rabbit_binary_generator:build_simple_content_frames/4 (rabbit_binary_generator.erl, line 61)
       in call from rabbit_writer:assemble_frames/5 (rabbit_writer.erl, line 334)
       in call from rabbit_writer:internal_send_command_async/3 (rabbit_writer.erl, line 365)
       in call from rabbit_writer:handle_message/2 (rabbit_writer.erl, line 265)
       in call from rabbit_writer:handle_message/3 (rabbit_writer.erl, line 232)
       in call from rabbit_writer:mainloop1/2 (rabbit_writer.erl, line 223)
     ancestors: [<0.1333.0>,<0.1323.0>,<0.1318.0>,<0.1317.0>,<0.1137.0>,
                   <0.1136.0>,<0.1135.0>,<0.1133.0>,<0.1132.0>,rabbit_sup,
                   <0.243.0>]
     message_queue_len: 0
     messages: []
     links: [<0.1333.0>]
     dictionary: [{process_name,
                       {rabbit_writer,{<<"[::1]:57240 -> [::1]:5672">>,1}}}]
     trap_exit: false
     status: running
     heap_size: 610
     stack_size: 28
     reductions: 419
   neighbours:
 
     supervisor: {<0.1333.0>,rabbit_channel_sup}
     errorContext: child_terminated
     reason: {badarg,
                 [{erlang,size,
                      [[<<"Message #1">>]],
                      [{error_info,#{module => erl_erts_errors}}]},
                  {rabbit_binary_generator,build_content_frames,7,
                      [{file,"rabbit_binary_generator.erl"},{line,89}]},
                  {rabbit_binary_generator,build_simple_content_frames,4,
                      [{file,"rabbit_binary_generator.erl"},{line,61}]},
                  {rabbit_writer,assemble_frames,5,
                      [{file,"rabbit_writer.erl"},{line,334}]},
                  {rabbit_writer,internal_send_command_async,3,
                      [{file,"rabbit_writer.erl"},{line,365}]},
                  {rabbit_writer,handle_message,2,
                      [{file,"rabbit_writer.erl"},{line,265}]},
                  {rabbit_writer,handle_message,3,
                      [{file,"rabbit_writer.erl"},{line,232}]},
                  {rabbit_writer,mainloop1,2,
                      [{file,"rabbit_writer.erl"},{line,223}]}]}
     offender: [{pid,<0.1334.0>},
                {id,writer},
                {mfargs,{rabbit_writer,start_link,
                                       [#Port<0.168>,1,131072,
                                        rabbit_framing_amqp_0_9_1,<0.1319.0>,
                                        {<<"[::1]:57240 -> [::1]:5672">>,1},
                                        true]}},
                {restart_type,transient},
                {significant,true},
                {shutdown,70000},
                {child_type,worker}]
 
     supervisor: {<0.1333.0>,rabbit_channel_sup}
     errorContext: shutdown
     reason: reached_max_restart_intensity
     offender: [{pid,<0.1334.0>},
                {id,writer},
                {mfargs,{rabbit_writer,start_link,
                                       [#Port<0.168>,1,131072,
                                        rabbit_framing_amqp_0_9_1,<0.1319.0>,
                                        {<<"[::1]:57240 -> [::1]:5672">>,1},
                                        true]}},
                {restart_type,transient},
                {significant,true},
                {shutdown,70000},
                {child_type,worker}]

@mkuratczyk
Copy link
Contributor

There's a heartbeat related crash when publishing more than a couple messages.

./stomp-go -count 10000 -publishOnly

leads to

   crasher:
     initial call: rabbit_heartbeat:'-heartbeater/2-fun-0-'/0
     pid: <0.2263.0>
     registered_name: []
     exception exit: {unexpected_message,resume}
       in function  rabbit_heartbeat:heartbeater/3 (rabbit_heartbeat.erl, line 138)
     ancestors: [<0.2260.0>,<0.2259.0>,<0.910.0>,<0.909.0>,<0.908.0>,
                   <0.906.0>,<0.905.0>,rabbit_stomp_sup,<0.903.0>]
     message_queue_len: 0
     messages: []
     links: [<0.2260.0>]
     dictionary: [{process_name,{heartbeat_receiver,unknown}}]
     trap_exit: false
     status: running
     heap_size: 376
     stack_size: 28
     reductions: 138
   neighbours:
 
     supervisor: {<0.2260.0>,rabbit_connection_helper_sup}
     errorContext: child_terminated
     reason: {unexpected_message,resume}
     offender: [{pid,<0.2263.0>},
                {id,heartbeat_receiver},
                {mfargs,
                    {rabbit_heartbeat,start_heartbeat_receiver,
                        [#Port<0.461>,60,#Fun<rabbit_stomp_reader.2.4391843>,
                         {heartbeat_receiver,unknown}]}},
                {restart_type,transient},
                {significant,false},
                {shutdown,300000},
                {child_type,worker}]

If I change 10000 to 10, there's no stacktrace.

NOTE: the same happens on main, so feel free to treat it as a separate issue.

@mkuratczyk
Copy link
Contributor

Some performance observations at this point.

100 publishers -> 100 queues -> 100 consumers (note: there's no flow control here):
./stomp-go -count 100000 -queue stomp-to-stomp -publisherCount 100 -consumerCount 100 -separateQueues
main: 75s
PR: 47s

Memory usage with 1000 consumers connected ( ./stomp-go -consumeOnly -consumerCount 1000):
main: 600MB (~17500 erlang processes)
PR: 390MB (~5500 erlang processes)

However, single connection/queue consumption performance seems degraded.
Publish with ./stomp-go -count 1000000 -publishOnly -queue one -timestampBody takes 10-11s for both main and this PR. However, consuming this backlog with ./stomp-go -count 1000000 -consumeOnly -queue one -timestampBody takes 23-24s on main and 30s with this PR.

@ansd ansd added this to the 4.0.0 milestone Dec 4, 2023
@MirahImage MirahImage removed this from the 4.0.0 milestone Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants