The future of data collection #1944
Replies: 9 comments 52 replies
-
Here is an illustration of the API overlap with AgentSet "gini":collect(model.agents, "wealth", function=calculate_gini) with AgentSet "gini": lambda model: calculate_gini(model.agents.get("wealth)) "n_quiescent":collect(model.get_agents_of_type(Citizen), "condition", func=lambda x: len(entry for entry in x if entry=='Quiescent')) with AgentSet "n_quiescent": lambda model: len(model.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent")) |
Beta Was this translation helpful? Give feedback.
-
Just for reference, this information is outdated. Python dictionaries used to be unordered. In Python 3.6 insertion order became in implementation detail of CPython (the reference implementation of Python). But since Python 3.7 insertion order is guaranteed, so it is perfectly fine to rely on it. That said the mental model for dictionaries is still set-orientaded (which I think is the right model). So I agree that it would be confusing if this works DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
"gini": collect("wealth", func=calculate_gini)}
}) but this doesn't DataCollector(model, collectors={"gini": collect("wealth", func=calculate_gini),
"wealth": collect(model.agents, "wealth")}
}) So we still would have to work around this problem internally which complicates the code. But I don't think we need tiered data collectors at all. I think they are a bit hard to understand and provide little benefit. At least how I understand it, they are basically a performance optimization, so you don't need to loop over all agents more than once. For small to medium models I don't think its a problem at all. For larger models or if you really do lots of simulation runs, yes it can matter. But than a better solution would anyway be to calculate your derivate variables afterwards. That is you just collect the wealth attribute, turn your data collection into a pandas dataframe and calculate the gini coefficient from the dataframe. That probably is even faster, because pandas can parallelize the calculations across all rows. This way you also don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer. If you have the gini coefficient in your model definition feel free to collect it. Otherwise calculate it as part of the data analysis. So for me the callable should be only used to filter your objects (e.g. only a certain type, or based on a condition) |
Beta Was this translation helpful? Give feedback.
-
This is my summary of the problems in the current data collector. I made a summary for the rest of @projectmesa/maintainers. Needs your opinion so that this can happen in time just before the 3.0 release. I think this should not be a GSoC 2024 project. Data collection problems:
|
Beta Was this translation helpful? Give feedback.
-
I suggest we try to contain the discussion on DataCollection here rather than having it spread over multiple locations. I am getting confused trying to find all the useful ideas and discussions. So rather than respond in #1933, I'll respond here. In 1933, @rht wrote
I am not entirely sure about this. Dataframes, for me, are associated with analyzing the results of a run. So, in my branch, Measures in my understanding are
So is State a single thing, or can it be multiple things? For example, an agent's position is clearly part of the agent (and by extension) model state. However, most of the time, position will be some tuple. So, somewhere, we have to translate the position into its elements. Do we want to do this in Measure, which would imply having multiple "fields" in a measure, or do we handle this downstream wherever Measure is being used? I personally am inclined to handle this further downstream. To continue the position example, in data collection, we might want to split position into x, y, (and z). For visualization, however, this splitting might not be required. So, I am unsure if we need multiple attributes/functions on a Measure. Instead, in my current thinking Measure always reflects a single state variable. |
Beta Was this translation helpful? Give feedback.
-
A quick update from my side. I have been trying to figure out a way to make it possible to access the value of Measure as an attribute. So the basic idea is that the following code works. class Measure:
def __init__(self, group, function):
self.group = group
self.function = function
def get_value(self):
return self.function(self.group)
class MyModel(Model):
def __init__(self, *args, **kwargs):
# some initiliaziation code goes here
self.gini = Measure(self.agents, "wealth", calculate_gini)
if __name__ == '__main__':
model = MyModel()
print(model.gini) # should actually do model.gini.get_value() This turns out to be not trivial because in this example class Measure:
def __init__(self, model, identifier, *args, **kwargs):
self.model = model
self.identifier = identifier
def get_value(self):
return getattr(self.model, self.identifier)
class MeasureDescriptor:
def __set_name__(self, owner, name):
self.public_name = name
self.private_name = "_" + name
def __get__(self, obj, owner):
return getattr(obj, self.private_name).get_value()
def __set__(self, obj, value):
setattr(obj, self.private_name, value)
class Model:
def __setattr__(self, name, value):
if isinstance(value, Measure) and not name.startswith("_"):
klass = type(self)
descr = MeasureDescriptor()
descr.__set_name__(klass, name)
setattr(klass, name, descr)
descr.__set__(self, value)
else:
super().__setattr__(name, value)
def __init__(self, identifier, *args, **kwargs):
self.gini = Measure(self, "identifier")
self.identifier = identifier
if __name__ == '__main__':
model1 = Model(1)
model2 = Model(2)
print(model1.gini)
print(model2.gini) To make I hope this explanation is clear enough. I admit it is a bit convoluted. It is also one of the only ways I have been able to come up with so far that makes it possible for Measures to behave as if they are normal attributes. Please let me know what you think of this direction for implementing Measure or whether the complexity is not worth it, and we forego the idea of having Measure behave as if it is an attribute that returns a simple value (e.g., int, float, string). |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for this. I think we're on the right track, I would only change the abstraction level on which the There are basically the following problems:
Then there is the complication that you sometimes have an object with members with attributes (like an AgentSet) and sometimes just have an object with attributes directly (like a Model). So basically there three levels that need to be defined:
You can already see how complicated this can possibly get. I will try to think about some possible abstractions, but feel free to built on this in the mean time. |
Beta Was this translation helpful? Give feedback.
-
On Group: I can see how groups can be used outside of the measure and data collection use case. They may be reused to organize agents step execution as well, e.g. if I want only the quiescent citizens in the Epstein civil violence to take certain actions. def step(self): # of a model
# Instead of
self.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent").do("rest")
# we do
self.quiescents.do("rest") What about doing addition on the groups # The drawback being this is not cacheable
(self.quiescents + self.injured_cops).do("rest")
# Needs to be
self.needs_rest = Group(self.quiescents + self.injured_cops)
self.needs_rest.do("rest") |
Beta Was this translation helpful? Give feedback.
-
The problem is an extension/detailing of 6. Let me try to explain in a bit more detail one of the details I am currently stuck on. The basic idea of a Collector is that it retrieves one or more attributes from an object or collection of objects, and optionally applies a callable to it. The issue now is that there is no way to specify the return of this optional callable in the current design. This return matters because it affects how data is stored in the collector and how it will be turned into a dataframe in So, for example, we are retrieving One idea I had after the conversation with @EwoutH is that the entire problem is analogous to e.g., pandas.DataFrame.apply. In case of collecting data from a collection of objects and next applying a callable to it, the user should specify the "axis" over which this function will operate. If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately. I hope this helps to clarify the issue. |
Beta Was this translation helpful? Give feedback.
-
Played a bit around a few days ago. Now that we have our very powerful AgentSet, API seems to be able to get simpler: datacollector = DataCollector(
collectors = [
c(target=Model, attributes=["n_agents"], methods=calculate_energy),
c(target=Wolf, attributes=["sheep_eaten"]),
c(target=Sheep, attributes=["age"], methods=calculate_energy),
c(target=model.agents, attributes=["energy"], agg={"energy": np.mean}),
] Few notes:
c(target=Model, attributes=["n_agents"], methods=calculate_energy), gives {
f"{Model.__name__}_{n_agents}": {...},
f"{Model.__name__}_{calculate_energy}": {...},
}
Just one approach. Don't know if it's the best. |
Beta Was this translation helpful? Give feedback.
-
There has been quite some discussion in various places about changing data collection. This is my attempt to think this through in some more detail. It is heavily inspired by a suggestion by @Corvince at some point.
In the general case, data collection is taking an object and extracting from this one or more attributes, and optionally applying a callable to this. It might involve only an object and a callable applied to this object in specific cases. This object can be the model, an agent, an agentset, a space, or some user-defined class.
So, it seems sensible to create a separate Collector class that implements this basic logic. Because the behavior of AgentSet is a bit different from other objects (i.e., AgentSet.get instead of relying on getattr), I believe it makes sense to have 2 Collector classes: BaseCollector and AgentSetCollector (PEP 20, flat is better than nested). Rather than burden the user with this distinction, it is possible to use a factory function (e.g.,
collect(obj, attrs, func=None)
) to create the appropriate Collector instance.Ideally, data should only be extracted once. So, in the case of the Boltzman wealth model, the data collector should be smart enough to extract the
wealth
attribute only once from the agentset. This can relatively easily be realized by maintaining an internal mapping of all objects and the attributes to be retrieved from them. Moreover, extracting all relevant attributes from a given object in one go might be possible to avoid unnecessary iteration. This would, however, require a minor update toAgentSet.get
so thatattr_name
takes a string or list of strings.I believe it is possible to design and implement this new-style DataCollector so that the current one can be implemented on top of it for backward compatibility.
Like with the current DataCollector, data collection should happen whenever
data_collector.collect
is called. However, I believe it is paramount that the data collector also always extracts the current simulation time. Only by having the simulation time for each call tocollect
can you produce a clean and complete time series of the dynamics of the model over time. In fact, these time stamps could become part of the index/column labels of the DataFrames when turning the retrieved data into a DataFrame.Like with the current DataCollector, it should be easy to turn any retrieved data into a DatafFrame. This can easily be done through a
to_dataframe
method on the Collector class.So, what could the resulting API look like?
So, does the basis idea of object, retrieval of one or more attributes, and/or applying a callable make sense? Have a missed a key concern? Is there something obviously wrong or missing in the sketch of the API?
Beta Was this translation helpful? Give feedback.
All reactions