The future of data collection #1944

quaquel · 2024-01-07T19:52:47Z

quaquel
Jan 7, 2024
Maintainer

There has been quite some discussion in various places about changing data collection. This is my attempt to think this through in some more detail. It is heavily inspired by a suggestion by @Corvince at some point.

In the general case, data collection is taking an object and extracting from this one or more attributes, and optionally applying a callable to this. It might involve only an object and a callable applied to this object in specific cases. This object can be the model, an agent, an agentset, a space, or some user-defined class.

So, it seems sensible to create a separate Collector class that implements this basic logic. Because the behavior of AgentSet is a bit different from other objects (i.e., AgentSet.get instead of relying on getattr), I believe it makes sense to have 2 Collector classes: BaseCollector and AgentSetCollector (PEP 20, flat is better than nested). Rather than burden the user with this distinction, it is possible to use a factory function (e.g., collect(obj, attrs, func=None)) to create the appropriate Collector instance.

Ideally, data should only be extracted once. So, in the case of the Boltzman wealth model, the data collector should be smart enough to extract the wealth attribute only once from the agentset. This can relatively easily be realized by maintaining an internal mapping of all objects and the attributes to be retrieved from them. Moreover, extracting all relevant attributes from a given object in one go might be possible to avoid unnecessary iteration. This would, however, require a minor update to AgentSet.get so that attr_name takes a string or list of strings.

I believe it is possible to design and implement this new-style DataCollector so that the current one can be implemented on top of it for backward compatibility.

Like with the current DataCollector, data collection should happen whenever data_collector.collect is called. However, I believe it is paramount that the data collector also always extracts the current simulation time. Only by having the simulation time for each call to collect can you produce a clean and complete time series of the dynamics of the model over time. In fact, these time stamps could become part of the index/column labels of the DataFrames when turning the retrieved data into a DataFrame.

Like with the current DataCollector, it should be easy to turn any retrieved data into a DatafFrame. This can easily be done through a to_dataframe method on the Collector class.

So, what could the resulting API look like?

# boltzman wealth model
datacollector = DataCollector(model, collectors={
	"gini":collect(model.agents, "wealth", function=calculate_gini),
	"wealth":collect(model.agents, "wealth")
	})

# attribute like access to each collected datafield
datacollector.gini.to_dataframe()
datacollector.wealth.to_dataframe()

# for the eppstein civil violence example
datacollector = DataCollector(model, collectors={
	"n_quiescent":collect(model.get_agents_of_type(Citizen), "condition",
		func=lambda x: len(entry for entry in x if entry=='Quiescent')),
	"n_active":collect(model.get_agents_of_type(Citizen),
		func=lambda agentset: agentset.select(lambda agent: agent.condition=='Active')), # apply a callable to the object directly
	"n_jailed":collect(model.get_agents_of_type(Citizen), "jail_sentence",
		func=lambda x: len(entry for entry in x if entry>0)),	
	"pos":collect(model.agents, ["x", "y"],), # retrieve multiple attributes
	})
datacollector.add_collector("n_cops", collect(model, func=lambda model: len(model.get_agents_of_type("Cop"))))

So, does the basis idea of object, retrieval of one or more attributes, and/or applying a callable make sense? Have a missed a key concern? Is there something obviously wrong or missing in the sketch of the API?

rht · 2024-01-08T09:01:23Z

rht
Jan 8, 2024

Here is an illustration of the API overlap with AgentSet

"gini":collect(model.agents, "wealth", function=calculate_gini)

with AgentSet

"gini": lambda model: calculate_gini(model.agents.get("wealth))

"n_quiescent":collect(model.get_agents_of_type(Citizen), "condition", func=lambda x: len(entry for entry in x if entry=='Quiescent'))

with AgentSet

"n_quiescent": lambda model: len(model.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"))

9 replies

rht Jan 8, 2024

I am less sure about the tiered API. Flat is better than nested.

I think this may refer specifically to nested conditionals. While a chain of method calls should be considered to be something along the line of the paradigm of the UNIX pipes, that are found in Bash pipes, Golang interfaces, and DataFrame composition of operations.

rht Jan 8, 2024

DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
                                                          "gini": collect("wealth", func=calculate_gini)}
	})

Has a problem in that "wealth"'s definition depends on the order of evaluation of the dictionary, where "wealth" has to be evaluated first before "gini". A Python dictionary is unordered, and just happens to preserve key-value insertion order.

quaquel Jan 8, 2024
Maintainer Author

Has a problem in that "wealth"'s definition depends on the order of evaluation of the dictionary, where "wealth" has to be evaluated first before "gini". A Python dictionary is unordered, and just happens to preserve key-value insertion order.

Exactly, but that can be left to the internal logic of the DataCollector class to sort out. For example, you built up an internal defaultdict(list) with the object from which to retrieve attributes as keys and a list of attributes to retrieve. You first execute this. Next, you could have a second internal dict that maps each object+attribute pair back to the associated Collector (there are other ways as well but that would require some smartness in the Collector). This would not require any specification by the user because DataCollector is smart enough to figure this out itself. I hope this is clear, otherwise I am happy to code up a quick example of what I am trying to say.

rht Jan 8, 2024

Even if it can work under the hood, even if the behavior can be documented, I still find it problematic to deviate from the known behavior of Python dict, with the dict elements becoming ordered, and the next elements that may depend on the previous element being defined first. The dict elements should have all been defined simultaneously.

quaquel Jan 9, 2024
Maintainer Author

I am not sure what you are trying to say. I was trying to describe the internal logic of the DataCollector class and how it is pretty easy to reduce the number of times specific data is retrieved.

It seems that you are saying that the behavior of a class (DataCollector in this case) should be determined by the arguments used to instantiate it. This to me, however, is nonsensical. Is Agent an int just because one of the arguments (i.e., unique_id) is an int?

Corvince · 2024-01-09T07:40:56Z

Corvince
Jan 9, 2024
Maintainer

A Python dictionary is unordered, and just happens to preserve key-value insertion order.

Just for reference, this information is outdated. Python dictionaries used to be unordered. In Python 3.6 insertion order became in implementation detail of CPython (the reference implementation of Python). But since Python 3.7 insertion order is guaranteed, so it is perfectly fine to rely on it.

That said the mental model for dictionaries is still set-orientaded (which I think is the right model). So I agree that it would be confusing if this works

DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
                                                          "gini": collect("wealth", func=calculate_gini)}
	})

but this doesn't

DataCollector(model, collectors={"gini": collect("wealth", func=calculate_gini),
                                                          "wealth": collect(model.agents, "wealth")}
	})

So we still would have to work around this problem internally which complicates the code.

But I don't think we need tiered data collectors at all. I think they are a bit hard to understand and provide little benefit. At least how I understand it, they are basically a performance optimization, so you don't need to loop over all agents more than once. For small to medium models I don't think its a problem at all. For larger models or if you really do lots of simulation runs, yes it can matter. But than a better solution would anyway be to calculate your derivate variables afterwards. That is you just collect the wealth attribute, turn your data collection into a pandas dataframe and calculate the gini coefficient from the dataframe. That probably is even faster, because pandas can parallelize the calculations across all rows.

This way you also don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer. If you have the gini coefficient in your model definition feel free to collect it. Otherwise calculate it as part of the data analysis. So for me the callable should be only used to filter your objects (e.g. only a certain type, or based on a condition)

3 replies

quaquel Jan 9, 2024
Maintainer Author

So we still would have to work around this problem internally which complicates the code.

I haven't tried a quick implementation, but it can be done in just five lines of code, so it is not that complicated from a code base point of view. The understandability of the API to the user is a more significant concern.

This way, you don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer.

I agree with this in principle. However, from practical experience, I know it can be convenient to support it (for example, to reduce the amount of data that needs to be stored, not needing to touch the source code itself if you want to quickly track something on the fly). It also would be a backward incompatible change.

rht Jan 9, 2024

I think users shouldn't be imposed on whether they should or should not process the raw data they have collected. With a small change in the API

model.datacollector.collect({"wealth": lambda model: model.agents.get("wealth")})
    .process({"gini": lambda data: calculate_gini(data["wealth"])})
    .process({"wealth**gini": lambda data: data["wealth"] ** data["gini"]})
    .store(["gini", "wealth**gini"])  # optional, if the user needs to store only a subset of the data collected

This is no longer a tiered data collection, but rather a series of processing over the first collection, and so is easier to explain.

quaquel Jan 9, 2024
Maintainer Author

An API along these lines also occurred to me. What I like is that it is declarative and clear. The question for me is whether the user should specify what to collect and process when initializing the DataCollector, in which case this is executed whenever datacollector.collect() is called. This is how the current DatacCollector is designed. Or, do the full specification of what to collect and process whenever you call datacollector.collect() as in the example given by @rht.

Or do we want to be even more radical and fully rerig datacollection on top of the observer design pattern?

rht · 2024-02-05T00:52:23Z

rht
Feb 5, 2024

This is my summary of the problems in the current data collector. I made a summary for the rest of @projectmesa/maintainers. Needs your opinion so that this can happen in time just before the 3.0 release. I think this should not be a GSoC 2024 project.

Data collection problems:

Can't collect based on agent type. Multi-Agent data collection #348, DataCollector requires all agents to have the same Attributes #976
Can't collect based on subset of agents via conditionals, which may change over time. Multi-Agent data collection #348 (comment)
Needs to separate the notion of "measure" (single use data) from data_over_time (timeseries). Measure may be used in visualization, without having to activate data_over_time, because the latter bogs down RAM. Discussion happened on Matrix.org.
Needs to be performant for large scale data collection (e.g. https://github.com/SC3-TUD/PNAS-Uncertainty-in-Boundedly-Rational-Climate-Adaptation/blob/d17fbe180384e7e79cf2adde153d1adfb9a401ba/data_collection.py#L686-L695)
Needs to be serializable to various format: dataframe (i.e., csv/excel), relational db's. Discussion happened on Matrix.org
Needs to have live data processing over collected data, as derived data, see The future of data collection #1944 (reply in thread)
Backend implementation: needs to be based on observer pattern, and pub/sub. Proposal: adding some form of event sourcing to MESA #1947 Event tracking and analysis #1930 Add system state tracking #1933
Flat (name only, e.g. {"data1": ..., "data2": ...}) dictionary of data collected vs 2-level (group, name, e.g. {group1: {"data1": ...}, group2: {"data5": ...}) dictionary of data collected, the latter for easy groupby(group). This is a generalization of get_model_vars_dataframe and get_agent_vars_dataframe of the current data collector.
~~Needs various convenient statistics functions for quick exploration of data~~ needs an easy way to allow user to apply statistical operations (min, median, max, etc) on the measures Aggegrated agent metric in DataCollection, graph in ChartModule #1145

4 replies

quaquel Feb 5, 2024
Maintainer Author

Thanks for this useful summary. I broadly agree. Some additional thoughts below

On 5. Yes, it is important that the datacollection mechanism is open for extension so users can easily replace the default behavior with their own custom (database) solution. The question is whether to do this at the level of each individual collect call, at the level of a complete run, or both.
On 7. I personally think this is indeed the way to go. It will be quite a bit more performant. For example, when testing this on a few models over Christmas, it was about 20%-30% faster than the current data collection. This speed up is mainly due to a shift from "pull" based data collection to "push" based data collection. So, e.g., instead of querying each agent for each collect call, you have a Measure that reflects the underlying data to which agents push their state updates. Another reason why I am in favor of this is that it enables a much cleaner separation of concerns (as evidenced by the discussion #1994 on tight vs. loose coupling between cells and the DiscreteSpace to which they belong. But, as indicated by @Corvince in #1947, adding pub/sub requires a very careful design and will add to the code base that needs to be maintained.
On 8. Can you elaborate on this. I am not sure I fully understand what you mean.
On 9. Not sure about this. If Collectors and Measures take a callable, the basic mechanism is in place for this. Should MESA maintain a list of default operations?

rht Feb 6, 2024

On 8. Can you elaborate on this. I am not sure I fully understand what you mean.

Users sometimes may want to get all the data collected of a particular group. With the current data collector, it would be get_model_vars_dataframe and get_agent_vars_dataframe. This needs to be generalized to groups. I have updated the summary with this info

On 9. Not sure about this. If Collectors and Measures take a callable, the basic mechanism is in place for this. Should MESA maintain a list of default operations?

I think @EwoutH wants an easy (minimal ceremony) way to apply statistics on the measures. The "object" in the current data collector is too indirect to be manipulated. I think your idea (and in @Corvince's wish list) of making it a concrete object solves this issue, because it allows one to do np.mean(model.wealth) (or np.mean(model.wealth.value) depending on the implementation detail), etc.

Optional comment: This seems to stem from the fact that Python's dict-key and Python object-attribute are not fully equivalent, unlike in JavaScript where I can specify an object as a dictionary

const quiescents = {wealth: ..., x: ..., y: ...}

And I'd be able to access wealth via quiescents.wealth, as an object attribute of quiescents.

rht Feb 6, 2024

Should MESA maintain a list of default operations?

Only as a last resort. But the idea of measure as a concrete object/attribute of a model, seems to render this unnecessary.

rht Feb 6, 2024

If Collectors and Measures take a callable, the basic mechanism is in place for this.

You actually already wrote this down before I did. And I somehow missed it. So point 9 already has a proposed solution.

quaquel · 2024-02-10T07:45:53Z

quaquel
Feb 10, 2024
Maintainer Author

I suggest we try to contain the discussion on DataCollection here rather than having it spread over multiple locations. I am getting confused trying to find all the useful ideas and discussions. So rather than respond in #1933, I'll respond here. In 1933, @rht wrote

I suppose measure may allow multiple attributes functions, for the case when the measures can be grouped into 1 DF.
Based on #1944 code example

I am not entirely sure about this. Dataframes, for me, are associated with analyzing the results of a run. So, in my branch, to_dataframe is part of the BaseCollector and its subclasses. Moreover, once one allows for multi-indexing, virtually anything can be gathered into a dataframe.

Measures in my understanding are

Conceptually, I understand a Measure as some observable model state at a particular time instant that is a function of internal model objects (e.g., agents, space, etc.). see here

So is State a single thing, or can it be multiple things? For example, an agent's position is clearly part of the agent (and by extension) model state. However, most of the time, position will be some tuple. So, somewhere, we have to translate the position into its elements. Do we want to do this in Measure, which would imply having multiple "fields" in a measure, or do we handle this downstream wherever Measure is being used?

I personally am inclined to handle this further downstream. To continue the position example, in data collection, we might want to split position into x, y, (and z). For visualization, however, this splitting might not be required.

So, I am unsure if we need multiple attributes/functions on a Measure. Instead, in my current thinking Measure always reflects a single state variable.

3 replies

rht Feb 10, 2024

I should have provided a problem that I want to solve, which is problem 8 in #1944 (comment). I think measure-containing-multiple-attributes/functions may solve problem 8, because it could be considered a preemptive organization method, instead of downstream, for the user to group-by attributes/functions for 1 group in 1 measure. The problem with the user doing group-by at downstream, is that while they may be aware which measures are from the same group, the Mesa framework isn't, because it is not specified explicitly.

Moreover, once one allows for multi-indexing, virtually anything can be gathered into a dataframe.

How does multi-indexing work? An example to see it in action would help.

Between the 2 choices, I prefer whichever has an overall simpler structure, both user facing (easy to conceptualize) and the underlying code.

rht Feb 10, 2024

The problem with the user doing group-by at downstream, is that while they may be aware which measures are from the same group, the Mesa framework isn't, because it is not specified explicitly.

This is not a problem. I suppose the framework does have an awareness of each measure's group. If I implement a function

combine_into_one_dataframe(measure1, measure2, ...)

in #2024's implementation, I can always check each measure's group attribute to see if there a group mismatch.

quaquel Feb 10, 2024
Maintainer Author

How does multi-indexing work?

I meant pandas multi indexing

This is not a problem. I suppose the framework does have an awareness of each measure's group. If I implement a function

That is basically in line with how I have approached it in my branch. Each Collector has a to_dataframe method. If the user wants to combine multiple dataframes into a bigger dataframe, for example, because each dataframe contains data about the same group, the user can easily do this via pd.concat. The benefit of this to me was that it keeps the MESA side of things relatively simple and easy to explain, while opening up more sophisticated use cases for the user to implement themselves on top of this by leveraging e.g., pandas.

quaquel · 2024-02-11T09:49:54Z

quaquel
Feb 11, 2024
Maintainer Author

A quick update from my side. I have been trying to figure out a way to make it possible to access the value of Measure as an attribute. So the basic idea is that the following code works.

class Measure:
    def __init__(self, group, function):
        self.group = group
        self.function = function

    def get_value(self):
        return self.function(self.group)


class MyModel(Model):
    
    def __init__(self, *args, **kwargs):
        # some initiliaziation code goes here

        self.gini = Measure(self.agents, "wealth", calculate_gini)


if __name__ == '__main__':
    model = MyModel()
    print(model.gini) # should actually do model.gini.get_value()

This turns out to be not trivial because in this example self.gini by default will return the Measure instance instead of Measure.get_value. So, we somehow need to hook into Python's mechanisms for setting and getting attributes. One possible way to do this, which works, is given below. Note that this is a barebones example, only focused on making the get_value work. None of the other pieces of Measure are included in this example.

class Measure:

    def __init__(self, model, identifier, *args, **kwargs):
        self.model = model
        self.identifier = identifier

    def get_value(self):
        return getattr(self.model, self.identifier)


class MeasureDescriptor:
    def __set_name__(self, owner, name):
        self.public_name = name
        self.private_name = "_" + name

    def __get__(self, obj, owner):
        return getattr(obj, self.private_name).get_value()

    def __set__(self, obj, value):
        setattr(obj, self.private_name, value)


class Model:

    def __setattr__(self, name, value):
        if isinstance(value, Measure) and not name.startswith("_"):
            klass = type(self)
            descr = MeasureDescriptor()
            descr.__set_name__(klass, name)
            setattr(klass, name, descr)
            descr.__set__(self, value)
        else:
            super().__setattr__(name, value)

    def __init__(self, identifier, *args, **kwargs):
        self.gini = Measure(self, "identifier")
        self.identifier = identifier


if __name__ == '__main__':
    model1 = Model(1)
    model2 = Model(2)
    print(model1.gini)
    print(model2.gini)

To make model.gini return model.gini.get_value(), we must invoke some Python magic. First, calling a method as part of attribute lookup can be done using properties/descriptors. So whenever we assign a Measure instance to an attribute, we should add the appropriate property/descriptor that handles the get_value call. We can intercept attribute assignments through the __setattr__ dunder method. Since we assign the Measure instance to '_{name_of_measure}' and __set__ invokes __setattr__ again, we need a bit of care when creating and assigning the property.

I hope this explanation is clear enough. I admit it is a bit convoluted. It is also one of the only ways I have been able to come up with so far that makes it possible for Measures to behave as if they are normal attributes. Please let me know what you think of this direction for implementing Measure or whether the complexity is not worth it, and we forego the idea of having Measure behave as if it is an attribute that returns a simple value (e.g., int, float, string).

6 replies

rht Feb 13, 2024

I'd say the ReactJS-state API of the self.gini_coeff Measure above would require the __get__ method that would retrieve the latest value as of the most recent self.set_gini_coeff. The difference between Solara's API and ReactJS's API is that the former combines var, set_var into 1 variable.

quaquel Feb 13, 2024
Maintainer Author

I thought that using the model.measure1 to retrieve the measure1 object, while model.measure1.value to retrieve the value has a parallel with Solara's use_reactive, which is Solara's alternative API to ReactJS's useState:

Ok, this is slightly different from what I was trying to achieve and, indeed, easier. The consequence of going this route is that at the level of Collectors, you need to differentiate between collectors that collect attributes and collectors that collect measures. The former can do obj.attr, while the latter needs to do obj.attr.value. This can be handled either within the collector class (by checking the return of each obj.attr, or by having separate Collector subclasses.

I am not familiar with either Solara's use_reactive or ReactJS's useState. I'll try to find time to read up on this.

At the moment, I am trying to develop a conceptual design and working implementation for data collection that does not rely on pub/sub. Ideally, this design is such that the public API can stay in place even if the implementation is switched to pub/sub. Once we all agree on the conceptual design, I'll try to demonstrate how the implementation becomes much easier if you use pub/sub (see @EwoutS comment in #1947).

rht Feb 13, 2024

@Corvince what is your take on the ReactJS angle on the Measure class?

I asked ChatGPT about the similarity/difference between ReactJS state management and pub/sub pattern.
Similarity:

While React's state management and the pub/sub pattern serve different purposes and operate within different scopes, they share underlying principles related to event-driven data flow and decoupling of components or actors. Understanding these similarities can help developers design more effective and efficient React applications, particularly when integrating with or implementing event-driven architectures.

Difference:

ReactJS state management is specifically tailored for handling UI state within the React framework, facilitating a component-based and unidirectional data flow that is closely tied to the React component lifecycle for rendering dynamic interfaces. In contrast, the Publish/Subscribe (pub/sub) pattern is a general-purpose, loosely coupled messaging paradigm designed for decoupled communication across different parts of an application or between different applications, supporting a dynamic and multi-directional flow of information. While React state is integral to component reactivity and structure within a UI context, pub/sub offers broader application-wide communication capabilities, enabling flexible and scalable architectures without being tied to any specific UI or framework.

Corvince Feb 13, 2024
Maintainer

I think the ChatGPT answer regarding the differences is pretty good. But first on useState and reactive. Reacts useState and solaras reactive are fundamentally different strategies to solve the same problem. The latter is not the former combined in a single object. Most importantly useState can only be used inside a component, or, in solara terms with a rendering context (that @solara.component provides). It returns a plain Python object with no magic attached. What triggers state updates are the side effects of the set_* function, which "inform" the rendering context of the state update.

On the other hand solaras reactive can be used anywhere and is not bound to any component, providing global state (there is an equivalent use_reactive for local state). But again, something needs to track these changes. So if you access a reactive variable inside a @solara.component, the component actually subscribes to the reactive object, which publishes its changes.

So in summary, yes, solaras reactive and pub/sub are strongly related. UseState, not so much. You can read more about reactive in the Vue docs

Regarding the API, after seeing the implementation of @quaquel I think the solution provided by @rht in his PR probably suitable. Especially it indeed could be turned into being reactive (or pub/sub) with the same API. Then again, maybe Measure and MeasureDescriptor can indeed be merged and we get the best of both worlds?

quaquel Feb 13, 2024
Maintainer Author

what is the drawback of combining Measure and MeasureDescriptor?

Then again, maybe Measure and MeasureDescriptor can indeed be merged and we get the best of both worlds?

To the best of my knowledge, it is technically impossible to combine Measure and MeasureDescriptor. Descriptors are defined on classes, not on instances. So, in my example with 2 model instances, you have two different Measure instances (one per Model instance) but only 1 ModelDescriptor instance (on the Model class). This is also one reason for my code example's rather complicated __setattr__ implementation (the other has to do with how the Descriptor protocol interacts with __new__). Of course, the descriptor and __setattr__ complicatedness would be hidden from the user if we decide to go down this route. They would become part of the MESA framework, and users would only need to work with the Measure class.

EwoutH · 2024-02-12T09:55:58Z

EwoutH
Feb 12, 2024
Maintainer

Thanks a lot for this. I think we're on the right track, I would only change the abstraction level on which the collect function works.

There are basically the following problems:

We want to dynamically retrieve the members of an object (like an AgentSet). We might want to select a subset of that collection, conditionally.
From either that object or all members of that object, we want to collect some attributes.
We might want to aggerate one or more of those attributes, before saving them.

Then there is the complication that you sometimes have an object with members with attributes (like an AgentSet) and sometimes just have an object with attributes directly (like a Model).

So basically there three levels that need to be defined:

Which object or set objects do I collect from?
For that object or set of objects, which attributes do I want to collect? And do I need to calculate them with some function?
(if set of objects) for each attribute, do I want to save all of the values of all objects, or do I want to aggerate them to one or more values?

You can already see how complicated this can possibly get. I will try to think about some possible abstractions, but feel free to built on this in the mean time.

5 replies

quaquel Feb 12, 2024
Maintainer Author

This is a fair summary of the state of the conversation.

At present, I guess we have identified four classes: DataCollector/CollectorContainer, Collector, Measure, and Group. Each of these might have several subclasses (e.g., AgentSetCollector). Let my try to summarize:

Group is a collection of objects, typically agents, and can be static or dynamic. Dynamic groups are groups were membership changes over the course of the simulation. This can be because agents are added or removed from the model, or because membership is based on the agent being in some state.

Measure reflects a state variable of the model at a given time instant. Ideally, it would be possible to access the value of a measure through attribute lookup

Collector collects measures and attributes and stores them. This allows one to capture the change of them over the course of the simulation.

DataCollector/CollectorContainer, in my thinking, is more of a convenience class. It contains all collectors, and a single collect call on it invokes collect on all collectors.

quaquel Feb 13, 2024
Maintainer Author

Quick update: I discussed this conceptual design with the scientific programmer working on several large MESA-based ABMs. Based on our discussion, this design seems to cover virtually all of their use cases. She particularly liked the ideas of dynamic/conditional Groups/AgentSets, the separation between Measure (part of the state of the Model at a given time instant) and Collector (tracking state over time), and the idea of making Collectors easily extendible on the storage side of things.

EwoutH Feb 15, 2024
Maintainer

An interesting thought we had: It might be useful to let an "aggerate" collector collect another "group" collector. So for example, a group collector collects wealth, and then the aggerate collector aggerates to a single value.

rht Feb 15, 2024

An interesting thought we had: It might be useful to let an "aggerate" collector collect another "group" collector. So for example, a group collector collects wealth, and then the aggerate collector aggerates to a single value.

That's point no. 6 in the summary: #1944 (comment)

rht Feb 15, 2024

Quick update: I discussed this conceptual design with the scientific programmer working on several large MESA-based ABMs. Based on our discussion, this design seems to cover virtually all of their use cases. She particularly liked the ideas of dynamic/conditional Groups/AgentSets, the separation between Measure (part of the state of the Model at a given time instant) and Collector (tracking state over time), and the idea of making Collectors easily extendible on the storage side of things.

It's great to have user validation of the idea, in that it covers complex use cases. Just needs to make sure the DynamicAgentSet is performant. Does she have an additional wish list regarding with the data collection?

rht · 2024-02-12T14:59:47Z

rht
Feb 12, 2024

On Group:

I can see how groups can be used outside of the measure and data collection use case. They may be reused to organize agents step execution as well, e.g. if I want only the quiescent citizens in the Epstein civil violence to take certain actions.

def step(self):  # of a model
    # Instead of
    self.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent").do("rest")
    # we do
    self.quiescents.do("rest")

What about doing addition on the groups

# The drawback being this is not cacheable
(self.quiescents + self.injured_cops).do("rest")
# Needs to be
self.needs_rest = Group(self.quiescents + self.injured_cops)
self.needs_rest.do("rest")

17 replies

quaquel Feb 21, 2024
Maintainer Author

What are the other contenders for the DynamicAgentSet backend implementation?

The class would need to have a base set of agents (or default to model._agents if no base set is provided). Next select, do, get, shuffle, and sort would all take this base set, run it through the condition and then do their normal thing. So, effectively, they all would first call select on the base set of agents. So it's not that complicated.

You could expand on this, just as we discussed with Measure, by having some kind of updated flag that is reset each time tick, while having a force_update keyword to override the flag if necessary. This can be a performance saver if if the DynamicAgentSet/ConditionalAgentSet is used several times without any intervening state changes to any of the agents within the set.

I don't see other viable backend implementation other than message passing, where the model holds a global message inbox object (this is apparently how hash.ai does it). But this sounds similar to the flow library, which @quaquel described the shortcomings of

I don't know enough about hash.ai to comment, but in principle pub/sub or the Observer design pattern is a a software design idea, and I would not conflate it with what I called a flow library which is a discrete simulation concept.

rht Feb 21, 2024

I don't see other viable backend implementation other than message passing,

I wrote this in a way that could be easily misunderstood. I did not mean message passing in the way that hash.ai to be the only viable way instead of pub/sub. I wanted to say that it is one major contender to pub/sub as the backend implemententation. And so, would need to compare contrast between the 2 implementations. I have not decided on either.

rht Feb 23, 2024

hash.ai's agent messages are specifically for communication among agents that are not neighbors. They are not used for communicating between agents and non-agent objects, as such their design shouldn't necessarily be taken into consideration as a backend for dynamic AgentSet. The only other alternative is that for a condition-based membership, that it is always computed from scratch. This has a simple naive implementation, but not performant.
As such, I am on board with pub/sub as the backend for dynamic AgentSet, which should be implemented and merged first before the new data collection construct.

quaquel Feb 23, 2024
Maintainer Author

I have done some more work on pub/sub and datacollection in my datacollection branch. I hope to find time over the weekend to do an actual performance comparison between 2 implementations for ConditionalAgentSet.

pub/sub based, this is already there and working nicely. Initial performance assessments on the Epstein example suggests that it adds little overhead (0.3 seconds on a 4 second run for all data collection). And the entire pub/sub structure adds less than 0.1 second to the run if no datacollection is used.
A more traditional/naive implementation where the ConditionalAgentSet has a base set and for do, shuffle, get, select, and sort, you first apply the condition to the base set. This can be quite straightforwardly implemented because applying the condition is like an annotation to each of these methods, so this can be handled in the constructor (i.e., __new__. I am curious to see what the performance overhead on the full data collection will be of this implementation.

Note that I prefer ConditionalAgentSet over DynamicAgentSet as a name.

quaquel Feb 23, 2024
Maintainer Author

I committed an alternative implementation along the lines of point 2. Initial testing doesn't show a massive performance difference between pub/sub and the naive implementation, at least for the Epstein model.

quaquel · 2024-02-16T09:54:49Z

quaquel
Feb 16, 2024
Maintainer Author

An interesting thought we had: It might be useful to let an "aggerate" collector collect another "group" collector. So for example, a group collector collects wealth, and then the aggerate collector aggerates to a single value.

That's point no. 6 in the summary: #1944 (comment)

The problem is an extension/detailing of 6. Let me try to explain in a bit more detail one of the details I am currently stuck on.

The basic idea of a Collector is that it retrieves one or more attributes from an object or collection of objects, and optionally applies a callable to it. The issue now is that there is no way to specify the return of this optional callable in the current design. This return matters because it affects how data is stored in the collector and how it will be turned into a dataframe in to_dataframe. Knowing this return is also essential in case users extend the collectors for storing data in e.g., a database.

So, for example, we are retrieving wealth from a collection of Agents and apply calculate_gini to it. Here, we go from a list of values to a single number. In contrast, we might retrieve attributes a, b, and c from a collection of agents and next apply a post process function to it, which operates on each agent and so returns another list.

One idea I had after the conversation with @EwoutH is that the entire problem is analogous to e.g., pandas.DataFrame.apply. In case of collecting data from a collection of objects and next applying a callable to it, the user should specify the "axis" over which this function will operate. If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately.

I hope this helps to clarify the issue.

2 replies

EwoutH Mar 10, 2024
Maintainer

If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately.

Yes, I like this. You could even do both, in some order.

I feel we're close! What's needed to get this done?

quaquel Mar 10, 2024
Maintainer Author

mainly time from my side I guess, but with my talk last week done, I hope to get back to this and devs.

EwoutH · 2024-09-05T10:04:03Z

EwoutH
Sep 5, 2024
Maintainer

Played a bit around a few days ago. Now that we have our very powerful AgentSet, API seems to be able to get simpler:

datacollector = DataCollector(
    collectors = [
        c(target=Model, attributes=["n_agents"], methods=calculate_energy),
        c(target=Wolf, attributes=["sheep_eaten"]),
        c(target=Sheep, attributes=["age"], methods=calculate_energy),
        c(target=model.agents, attributes=["energy"], agg={"energy": np.mean}),
    ]

Few notes:

The output is a dict for each object-variable combination. So

        c(target=Model, attributes=["n_agents"], methods=calculate_energy),

gives

{
    f"{Model.__name__}_{n_agents}": {...},
    f"{Model.__name__}_{calculate_energy}": {...},
}

The dicts are structured as:

The model.step as first key, and the agent.unique_id as second key for AgentSet objects
The model.step as key for all other objects

If one or more agg parameters are defined, for each a new dict will be created, with the model.step as key.
Collecting over different combinations of AgentSets can just be done by inputted a selection into the target.

Just one approach. Don't know if it's the best.

3 replies

quaquel Sep 5, 2024
Maintainer Author

It is indeed becoming simpler, but we are not there yet.

It is still not trivial to refer to the current agents of a given type. So your Wolf or Sheep example API does currently not work. Likewise, we don't yet have an easy solution for dynamically changing agent sets.
In my vision, a given collector c can take as an object another collector. This makes it possible to avoid double collection of data:

# partial and incomplete sketch of collect func. signature
def collect(name:str, target:Any, attributes:strList[str], func:callable):
    ... # logic goes here

wealth = collect("wealth", model.agents, "wealth")
gini = collect("gini", wealth_c, method="calculate_gini")
avg_energy = collect("average energy", model.agents, "energy", np.mean) # no need to use agg in agentset here.

3 I don't fully follow your agg idea

Corvince Sep 5, 2024
Maintainer

@EwoutH I like this. I think I like this a lot!

Let me think a bit more about this, but I think this is very close!

Not sure I share the concerns of @quaquel . But let me think about this a bit more

quaquel Sep 5, 2024
Maintainer Author

These remarks are not so much concerns, but more puzzle pieces that are still missing. I think the main one is point 1. How can we make it easy to say that you want to collect data for a given Agent class or a subset of agents that is dynamically changing over time?

The future of data collection #1944

quaquel Jan 7, 2024 Maintainer

Replies: 9 comments · 52 replies

quaquel Jan 8, 2024 Maintainer Author

quaquel Jan 9, 2024 Maintainer Author

Corvince Jan 9, 2024 Maintainer

quaquel Jan 9, 2024 Maintainer Author

quaquel Jan 9, 2024 Maintainer Author

quaquel Feb 5, 2024 Maintainer Author

quaquel Feb 10, 2024 Maintainer Author

quaquel Feb 10, 2024 Maintainer Author

quaquel Feb 11, 2024 Maintainer Author

quaquel Feb 13, 2024 Maintainer Author

Corvince Feb 13, 2024 Maintainer

quaquel Feb 13, 2024 Maintainer Author

EwoutH Feb 12, 2024 Maintainer

quaquel Feb 12, 2024 Maintainer Author

quaquel Feb 13, 2024 Maintainer Author

EwoutH Feb 15, 2024 Maintainer

quaquel Feb 21, 2024 Maintainer Author

quaquel Feb 23, 2024 Maintainer Author

quaquel Feb 23, 2024 Maintainer Author

quaquel Feb 16, 2024 Maintainer Author

EwoutH Mar 10, 2024 Maintainer

quaquel Mar 10, 2024 Maintainer Author

EwoutH Sep 5, 2024 Maintainer

quaquel Sep 5, 2024 Maintainer Author

Corvince Sep 5, 2024 Maintainer

quaquel Sep 5, 2024 Maintainer Author

quaquel
Jan 7, 2024
Maintainer

Replies: 9 comments 52 replies

quaquel Jan 8, 2024
Maintainer Author

quaquel Jan 9, 2024
Maintainer Author

Corvince
Jan 9, 2024
Maintainer

quaquel Jan 9, 2024
Maintainer Author

quaquel Jan 9, 2024
Maintainer Author

quaquel Feb 5, 2024
Maintainer Author

quaquel
Feb 10, 2024
Maintainer Author

quaquel Feb 10, 2024
Maintainer Author

quaquel
Feb 11, 2024
Maintainer Author

quaquel Feb 13, 2024
Maintainer Author

Corvince Feb 13, 2024
Maintainer

quaquel Feb 13, 2024
Maintainer Author

EwoutH
Feb 12, 2024
Maintainer

quaquel Feb 12, 2024
Maintainer Author

quaquel Feb 13, 2024
Maintainer Author

EwoutH Feb 15, 2024
Maintainer

quaquel Feb 21, 2024
Maintainer Author

quaquel Feb 23, 2024
Maintainer Author

quaquel Feb 23, 2024
Maintainer Author

quaquel
Feb 16, 2024
Maintainer Author

EwoutH Mar 10, 2024
Maintainer

quaquel Mar 10, 2024
Maintainer Author

EwoutH
Sep 5, 2024
Maintainer

quaquel Sep 5, 2024
Maintainer Author

Corvince Sep 5, 2024
Maintainer

quaquel Sep 5, 2024
Maintainer Author