Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: Separate the Workspace name from its data directory fields #582

Open
Arcitec opened this issue Nov 22, 2024 · 10 comments
Open

[Feat]: Separate the Workspace name from its data directory fields #582

Arcitec opened this issue Nov 22, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@Arcitec
Copy link
Contributor

Arcitec commented Nov 22, 2024

Describe your use-case.

  • This helps with the RunPod cloud feature by decoupling the local machine workspace path and name, so that we can use a hardcoded workspace path on the server (to help with auto-cleanup features). It also means that the user's local-machine data path can be anything they want locally and it won't hurt the server.
  • It also helps local users by making it easier and less error-prone to change workspace names. No more "oops I deleted too many characters of the path" or "oops I only changed one field".
  • It encourages people to actually use more workspaces.

What would you like to see as a solution?

workspace_name

  • A new field for Workspace Name.
  • The Workspace Directory and Cache Directory fields understanding {name} as a placeholder which expands to the Workspace Name value.
  • New code in the config parser which automatically applies this path expansion as early as possible in OneTrainer's internals, so that the project code only needs minimal changes for this feature.
  • Since it's just one variable, the path expansion can be done easily via workspace_dir = config.workspace_dir.replace("{name}", config.workspace_name), which could be wrapped in a module.utils.expand_path(config, config.workspace_dir) utility function taking the config (to grab the name from and possible future variables) and the string that we need to expand.
  • Creating a new utility function also makes it easier to implement other things such as ~/ prefix expanding to Path.home() (which I think works on Windows too). In fact, there's expanduser which explicitly expands such paths. See example implementation below.
  • New tooltip text for the Workspace Directory and Cache Directory fields which explains that {name} can be used.
  • One concern would be what to write to the workspace's config-backup (keeping the unexpanded {name} placeholders in the backup too would be ideal, to make the backup config portable).
  • Migration code: Split the old path values on the final / component, take the last component of workspace_dir as name, use the rest as directory: <the dir>/{name} and cache: <the cache dir>/{name}. This splitting while migrating could be achieved via x = pathlib.Path(oldvalue). Then to get the dir, we use str(x.parent) and to get the name we use x.name (name is always a string already).

Here is an example cross-platform implementation of path expansion:

def expand_path(config: dict, p: str) -> Path:
    p = p.replace("{name}", config["workspace_name"])
    return Path(p).expanduser()

Results:

>>> expand_path({"workspace_name": "foo"}, "workspace/{name}")
PosixPath('workspace/foo')
>>> expand_path({"workspace_name": "foo"}, "~/workspace/{name}")
PosixPath('/home/johnny/workspace/foo')
@Arcitec Arcitec added the enhancement New feature or request label Nov 22, 2024
@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 22, 2024

The Cloud upload code by DXQ could then read the name property and always hardcode the server data destination:

  • Server: workspace/{name}. This means we can easily apply automated cleanups of old data in workspace/ on the server to reduce RunPod server storage costs (as requested by a user). Because we'll know exactly where the server workspaces and caches are, and we won't have to do any hacky parsing of the local system's paths to get an appropriate name for those per-project server paths.
  • Client: User now has a local, decoupled Workspace Directory/Cache Directory, which means that the Cloud code can cleanly and effortlessly upload from/download to a local path which is totally different than the server's storage path. So the user might store all their data in C:\AI Projects\Workspaces\{name} and the server would use workspace/{name} for example.

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 22, 2024

I had a quick look. I see that the project currently uses config.workspace_dir and config.cache_dir and treats them as strings. So a bit of refactoring would be necessary, to have a unified location where these are expanded and parsed a single time per loaded config (probably directly via the fields in TrainConfig.py).

At the same time, that would be a good time to remove os.path.join() stuff and use Pathlib natively instead, where you just do new_path = old_path / "backups/foo/bar" to add things onto the path, for example. This also involves changing all parameters to Path where they used to take str.

Simultaneously, making sure that when TrainConfig serializes itself back to JSON, it writes the original non-expanded strings. Could be done by remembering those exact values as strings, while storing the expanded (used by OneTrainer at runtime) paths in other TrainConfig fields instead. So the TrainConfig would hold both the str version and the expanded Path version of each dir. Possibly via config.workspace_dir: Path and config._workspace_dir: str, signifying that the latter is some internal data which should not be used by people accessing the config. object.


Having looked at this, I can do it, but don't have the time at the moment. And don't know when I'll have time. Could be days, a week or much more. It will depend on having free time. So if someone wants to implement it, let me know.

@dxqbYD
Copy link
Collaborator

dxqbYD commented Nov 22, 2024

Good idea, but I'd apply it differently. You might want different cache dirs between runs, but usually you want the cache dir shared.

What you usually want changed between runs, for example to test multiple parameters, is:

  • Workspace directory
  • Model Output Destination
  • Save Filename Prefix
  • [a prefix for Samples, in case you use the same workspace directory]

Currently I usually change these 3 things, and often forget one.

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 22, 2024

but usually you want the cache dir shared.

In that case, you'd just set the workspace cache to a static path without {name} variable. So that's supported with the current plan.

Model Output Destination

Save Filename Prefix

Really good idea to support {name} variable there too!

[a prefix for Samples, in case you use the same workspace directory]

Yeah we could make the samples automatically prefix themselves with the name too. And that could be hardcoded since it's not something interesting to tweak.


I had a bit more thinking about where to place the "variable expander" code. And I think it belongs as a method on the TrainConfig class. Because literally everything that needs config values passes around the config object already.

So that class could have a config.expand_vars(s: str) -> Path method which expands {name} (at the moment that's it) and the ~/ prefix via the expand_path code shown in the first post.

It returns a Path object, because this means we can concatenate the return value with other Path objects to easily create the full paths. The Path concatenation logic also automatically gives us support for path overriding in individual setting fields, because if the concatenated part is an absolute path, it takes over:

>>> from pathlib import Path
>>> a = Path("/foo/bar")
>>> b = Path("/hello/world")
>>> a
PosixPath('/foo/bar')
>>> b
PosixPath('/hello/world')
>>> a / b
PosixPath('/hello/world')
>>> b = Path("relative/path")
>>> a / b
PosixPath('/foo/bar/relative/path')

Here's the general outline of what needs to be done, whether I do it or someone else beats me to it:

  • Add the config.expand_vars() method to TrainConfig.
  • Review every code instance of workspace_dir and cache_dir and change them to use a Path object rather than a str. This includes their type-hints and usage, including checking wherever the variables get passed onto further, deeper functions.
  • Search the code for os.path.join() and similar oldschool path techniques and replace them with Path. This again needs careful review to make sure variable typehints and function parameters match the new behavior.
  • When a config is loaded from disk/changed in the GUI: Keep the raw, original string values in config._<var name> and the expanded Path variants in config.<var name>.
  • When writing a config to disk as JSON, write the config._<var name> string variants of those raw strings. This preserves the unexpanded text that the user entered.
  • Note: A potential alternative solution which I'm leaning more towards, is to create a new SmartPath class, which takes a config object and a str in its constructor, and then sets two fields: The raw string it was created from, and the Path object it expanded into. Then just make a JSON serialize method on it which only serializes the string object. This has many upsides, the best of which are that JSON serialization is automatic and the TrainConfig fields are very clean (no "internal" fields). The only downsides are that it needs an extra . dot to get the actual path field of those config fields, such as config.workspace_dir.path (since config.workspace_dir would be the SmartPath class object), and that type-hints when passing those paths around will be a bit special (unless we pass the internal .path value to those functions instead), but the small drawbacks seem worth it. Especially since most usage seems to read the path from the config. object rather than passing the raw field data around, so the impact for function parameter signatures is minimal.
  • Add the config migration algorithm described in the first post, which extracts the appropriate workspace name from old configs.
  • After this foundation is done, expand more config fields as suggested by dxq.

@Nerogar
Copy link
Owner

Nerogar commented Nov 22, 2024

I think adding a user configurable placeholder name makes sense. But I don't really understand how this would help cloud training in any way. Can you explain that a bit more? The user would still be able to configure the workspace directory.

Adding the expand functionality into the TrainConfig class is the right decision, but I wouldn't use anything more complicated than a simple string replacement:

  • I hate inconsistent code. Paths are passed around as strings everywhere at the moment. I don't want to change this unless it's done everywhere
  • It's a pretty big change with a very limited benefit. It would need a lot of testing.

Also, I'm a bit confused what you mean by expanding ~/. That's just normal path syntax. There isn't really anything that needs to be expanded unless you want to get an absolute path. Which isn't very useful in most cases.

One thing I'm not really sure about: How do we design the user interface so it isn't too confusing? I don't want to change the current default directories for the reasons dxqbYD already mentioned. You probably don't want to change the cache directory for every run. And adding a new name field that doesn't do anything by default could be pretty confusing. Also, adding this field for example on the general tab would be really confusing if it also impacts data on other tabs (like the save prefix).

@dxqbYD
Copy link
Collaborator

dxqbYD commented Nov 22, 2024

I think any relation to cloud training is a misunderstanding. It doesn't need any placeholder, I understood this as a separate feature.

You can find the path adjustments done for cloud training here: https://github.com/dxqbYD/OneTrainer/blob/e42585e8314cf71562764ac3f64222f1b494a4bd/modules/trainer/CloudTrainer.py#L165

remote_dir is config.cloud.remote_dir, which is for example /workspace on the cloud - but it's not the same directory as the OneTrainer workspace directory. RunPod just named it that way too, but it's not the same directory.

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 22, 2024

But I don't really understand how this would help cloud training in any way. Can you explain that a bit more?

Yeah, the cloud training can then use the workspace name property for its own server directory. To easily control both the local directory and the server directory by naming each project via a single OneTrainer field. Thus avoiding directory clashes/tedious manual editing in multiple parts of the GUI.


You can find the path adjustments done for cloud training here: https://github.com/dxqbYD/OneTrainer/blob/e42585e8314cf71562764ac3f64222f1b494a4bd/modules/trainer/CloudTrainer.py#L165

This is a good example of a cloud issue that can be solved with this. Currently, it hacks the config path to try to convert it to a cloud-usable path. The cloud code could instead use workspace/{name} as its fixed path on the server, making sure we always know where the workspace is stored, and never having issues with invalid server paths.

Ensuring clean separation between local user paths and the RunPod server's paths.


Paths are passed around as strings everywhere at the moment. I don't want to change this unless it's done everywhere.

Yeah if I have time to do this (which is likely but not until I have less other projects on my plate), I will change it everywhere by looking for all path str usage and os.path calls.

Pathlib was added to Python because str paths are such a pain to work with. Pathlib makes every path into first-class objects where we can analyze and modify anything we want via simple fields and calls, like .name, .suffix, .with_suffix(".png"), .parent, etc.

It's a great help when writing code. Full IDE autocompletion, all the calls we need, calls can be chained (like p.parent.with_suffix(".png") easily, etc. No need to import and mess around with legacy os.path code.

Another benefit of passing them around as Paths is that concatenation is easy, as mentioned: We can just do p / "sub/dir.txt" to add to a path, or p / otherpathobj to concatenate two paths, etc.

The third benefit is that it has brilliant methods for walking directory trees, globbing files matching patterns, etc.

The fourth benefit is that pathlib has built-in methods for cross-platform path manipulation, which is a huge help when paths are being forwarded from Windows machines to Unix-based servers, for example. On Windows, the Path will be stored as a WindowsPath internally, but we call .as_posix() whenever we need it as a POSIX-style path.


It would need a lot of testing.

Yeah it would, and should not be merged until that's fully verified. I'll be testing meticulously as usual.


I'm a bit confused what you mean by expanding ~/. That's just normal path syntax.

If we don't expand it, the files will be written into a relative directory under the current working dir, named ~, rather than writing into the home directory. For example, C:\OneTrainer\~\foo. :) Rather than what the user wanted (which is C:\Users\johnny\foo).

That's why expansion of ~ is necessary. Python then converts that prefix to an absolute path to the user's home directory, making it do what the user expected. There's built-in cross-platform methods in pathlib for doing that ~/ prefix expansion, using the call I showed at the bottom of the first post.


How do we design the user interface so it isn't too confusing? Adding this field for example on the general tab would be really confusing if it also impacts data on other tabs (like the save prefix).

I think it makes sense and will be easy to understand when applied as follows:

  • General is the general settings for the project.
  • The "Workspace Name" on the general tab is the name for the current project. We could even call it "Project Name" if that's clearer since that's less mixing of existing terms.
  • Change the default workspace dir to workspace/{name}. That way people only need to change a single field, "Project Name", when they start different training attempts with different settings. And they no longer need to be careful to avoid accidentally removing the workspace/ prefix since name is finally a separate field.
  • Keep the default cache dir as workspace-cache/run, since sharing a cache seems like a common use-case (I wasn't aware). Makes sense. But maybe default it to workspace-cache/shared for brand-new configs?
  • The suggested migration code for old configs can automatically extract their current workspace names. Alternatively, we can leave the old path values as-is when people load old configs, and instead just default the "Project Name" in an unused way for old configs (where it has a value, but not being used by any of the path fields). That makes sense too. Then people are in full control of adopting it for old configs if they want. But... migration of their previous name in a way that becomes universally reliable is easy to do, so we can still try it and see how it feels.
  • Automatically apply it as prefix to "samples" filenames, as dxq suggested. No config GUI needed for that. And his idea was that this helps when sharing samples dirs between multiple projects.
  • Note: TrainConfig should reject empty names. This can best be achieved via .strip() followed by checking if the result is empty, and if it's empty, immediately default it to whatever our default project name will be. So that the value can never be empty. This is important if we're going to use it as prefix and as variable in various places.
  • Tooltips: Add tooltip info to any fields where {name} / variable expansion is supported. We'd just need a good, consistent, brief tooltip wording to explain it, perhaps by adding this at the bottom of their tooltips: This field supports variable expansions, such as "{name}".? The wiki can then list the available expansions. At the moment, {name} and ~/. Later, it would be easy to add more to the expansion function if people have a true need for any other variables.
  • Phase 1: Implement it for workspace_dir and cache_dir to keep the scope small, and merge that. This includes implementing JSON serialization (a SmartPath class makes that easiest, but will look around at the code when the time comes, to see how the current TrainConfig class serialization works).
  • Later phase after the core feature has been implemented and merged: Support it in some other, useful fields. But do not apply it by default anywhere else (except if it makes more sense to do that). Dxq listed some fields where it would be useful.

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 22, 2024

Actually, I will keep the string type for now, because it reduces the amount of work.

My mind always defaults to Path since it's vastly superior, but that's not necessary for implementing this particular feature. And when it's needed (for other features such as cloud and rsync paths), we can always str(Path(config.workspace_dir).as_posix()) to use its transformations while staying as strings.

So then the work will be:

  • Store the raw, unexpanded input value/JSON config string in config._workspace_dir.
  • Store the expanded string in config.workspace_dir. This makes it a drop-in change.
  • Look through the code to make sure nothing writes to the config variable after the user GUI/JSON loader has set its value. If so, make some decisions about how to update the raw field too.
  • When serializing the config, use the raw strings (not the expanded strings). Hopefully that's not too complicated, since I know that OneTrainer already has its own, custom serialization method. But I really need to look into how it can handle having an internal config class field whose runtime value must be separate from the raw, unexpanded value that's written to/read from the JSON file.
  • Decide on a default value for "Project Name". What should the project be called when they haven't named it? "default" sounds too much like it's a dynamic placeholder (which it isn't). "project1" could be great and is a good indication to the user that "hey, you can and should manually set this to your own names if you have multiple projects".
  • Decide on the default value for "Workspace Cache". Probably the static string "workspace-cache/shared", to be a clear indication that people can create per-project caches by changing its value. The tooltip will also mention that variable expansion is available for use.

This greatly cuts down the amount of hard work to implement this feature. 👍

@Nerogar
Copy link
Owner

Nerogar commented Nov 22, 2024

Yeah, the cloud training can then use the workspace name property for its own server directory

That would change the behavior between cloud training and local training. Not something a user would or should expect. It should be very clear and intuitive what each setting does.

Change the default workspace dir to workspace/{name}. That way people only need to change a single field, "Project Name", when they start different training attempts with different settings. And they no longer need to be careful to avoid accidentally removing the workspace/ prefix since name is finally a separate field.

There is no requirement that the workspace needs to be saved under workspace/. That's just the default location. But it could be anywhere on the users file system.

The suggested migration code for old configs can automatically extract their current workspace names

No, it can't. Because the workspace could be anywhere.

Note: TrainConfig should reject empty names

That's even more confusing. If I want to set the workspace directory to this/very/specific/path, why should I need to set a name if it's not even used anywhere?

-Store the raw, unexpanded input value/JSON config string in config._workspace_dir.
-Store the expanded string in config.workspace_dir. This makes it a drop-in change.

Properties of the TrainConfig class are directly saved in the json file with their names. Using private (underscore prefixed) names doesn't makes sense for this class. Also, actually storing the expanded path isn't really possible, because you never know when the unexpanded string changes. The structure of the config class is intentionally kept simple to serve as a single contact point between ui/cli code and trainer code. There shouldn't be any assumptions about the use of the properties beyond setting and getting values.

The better approach would be a new function that returns the expanded path. Similar things are already done for other values in that class.


Honestly, the more I think about this idea, the more I believe it just complicates things without providing any benefit.

  • Changing the workspace directory: already possible in a single input field
  • Automatically changing the samples/saves names: not very useful if you actually set different work spaces for every training run (as you should do anyway)
  • Automatically changing the cache directory: shouldn't be done anyway, as discussed earlier
  • Using the name for a cloud training workspace name: This changes the already established behavior for local training, which would confuse a lot of people.

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 23, 2024

That would change the behavior between cloud training and local training. Not something a user would or should expect. It should be very clear and intuitive what each setting does.

That is necessary for this scenario:

  • User wants the finished models to be written to their OneTrainer GUI's workspace_dir path, which is their local machine path.
  • Training happens on the server in workspace/<name> instead, in a reliable server location which is totally unrelated to the local machine.
  • Right now, the local and cloud paths are identical, handled via a single workspace_dir GUI field, and THAT is a freaking confusing mess (if the user wants to store the workspace in "C:\foo\bar\baz" then the current RunPod code has to convert that to a cloud-compatible path in a messy way). So all the confusion comes from only having a single workspace_dir field to represent both local and remote paths. Which obviously doesn't work optimally! At least if we have a project name, we can generate a unique server-side path dynamically, totally independently from whatever local dir that the user will be uploading/downloading data from.

There is no requirement that the workspace needs to be saved under workspace/. That's just the default location. But it could be anywhere on the users file system.

Yes, and people can change the default to anything they want, of course. A default is just a default.

The suggested migration code for old configs can automatically extract their current workspace names
No, it can't. Because the workspace could be anywhere.

It can. I described the algorithm in the first post.

If the old config uses /media/AI/OneTrainer Workspaces/foo, migration would set it to:

  • Project Name: foo
  • Workspace Directory: /media/AI/OneTrainer Workspaces/{name}

Result: Exactly the same path is being used after automatic migration.

But as I said, automatic migration of their hardcoded path from existing configs is not necessary and can be skipped.

That's even more confusing. If I want to set the workspace directory to this/very/specific/path, why should I need to set a name if it's not even used anywhere?

You wouldn't need to set a name. You would just ignore the name field and let its default "project1" value stay like that. Like when people ignore the 19382 other fields that are visible but not used in OneTrainer unless very specific options are clicked or specific values are entered. This is no different.

The better approach would be a new function that returns the expanded path. Similar things are already done for other values in that class.

Ah okay. Well I wanted to avoid having to call "expand it" in every single consumer of the config.workspace_dir field. By instead only expanding it when the config is loaded and whenever the user changes the value in the GUI.

Normally this would be solved by having a TrainConfig class where workspace_dir is a class instance (such as SmartPath, as mentioned earlier). A SmartPath class could be implemented with one internal field for the "raw path" and one for the "expanded path".

Then SmartPath then implements a setter and getter, and automatic JSON serialization. So when the TrainConfig.workspace_dir field is serialized, it saves the raw (unexpanded) string to the config JSON file.

That can be achieved in python via __setattr__ and __getattr__ on TrainConfig to have it act transparently: When something reads config.workspace_dir it receives a string (by automatically reading from the str field of the SmartPath object), and when something sets config.workspace_dir = "foo" the SmartPath object writes to its internal, raw, unexpanded string and updates the expanded value, thus updating both.

This means there's a single point which handles reading/writing conversion and serialization transparently.

This can be implemented even with the existing architecture of TrainConfig.

(There's also another technique, with methods marked @property and @download_dir.setter to make methods execute during set/get, but then we'd need to use a hidden attribute with self._download_dir to store the raw value, which seems like a no-go for this class.)


Now whether all of this is worth doing:

  • It helps in RunPod by having a clear "this is the project name" field which the cloud algorithm can use for storing isolated per-project workspaces on the server without any clashes with the local machine's paths. Whereas the user's local data will be stored in the workspace dir written in the OneTrainer GUI.
  • It makes it easier to change the workspace name because the user doesn't have to carefully select and backspace until they reach workspace/ without accidentally deleting until worksp (happens a lot unless I am super careful).
  • I thought it was worth having different caches per-project so I always changed both fields. Now I know that that's not a good idea. So that benefit is gone. So Workspace Cache would use a static, shared path by default.
  • But it's still super useful for other fields such as Model Output filename: models/{name}.safetensors for example. Beautiful, dynamic filenames for the result of the project's completion. Helps avoid accidentally overwriting models by forgetting to change this field. Dynamically uses the project name for greater ease of use (by default; but people can of course rewrite it to a static path in their config anytime).
  • For these reasons, I think it's worth doing. But seeing the architecture problems of OneTrainer makes me much less motivated to attempt to add the feature, to be honest. The current code makes too many assumptions and is too interconnected and inflexible (passing around string paths, possibly modifying and writing back a different path to the TrainConfig object), and the inflexible serialization/deserialization makes it clunky to add this feature. We might as well drop it. It's just gonna make things a bit more convenient for users. But is a hassle to implement in OneTrainer. Feel free to close the ticket if you don't see any value vs the effort it would take to implement it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants