Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow requests for static Media files when a lot of them are requested at the same time under Azure App Services (Web Apps) #14859

Closed
Piedone opened this issue Dec 7, 2023 · 33 comments
Milestone

Comments

@Piedone
Copy link
Member

Piedone commented Dec 7, 2023

Describe the bug

We're seeing strangely slow requests for static Media files. This can't be explained by slow storage, and while both sites I've seen these on use Azure Blob Storage, I ruled out that being slow (and the requests are also slow if the files are already cached on the server's file system).

Perhaps related: #1634.

To Reproduce

I don't have a 100% repro, but this seems to be the rough playbook when the issue happens:

  1. Visit a page that has a lot of images from the Media Library, e.g. https://ikwileentaart.nl/broodjes or https://ikwileentaart.nl/brood.
  2. Sometimes observe that images load in 1-2 or frequently even 10s, despite their <100 KB size not warranting this. Not just the client-side load time is slow, but the actual server request being served too, as indicated by Application Insights.

Note that while the linked pages use image resizing, I've seen exactly the same issue on another site that loads images without any resizing.

The issue seems to correlate with increased CPU usage, but well below the server's capacity (<10%). Other metrics are either uncorrelated/normal or show the effect (like increased client receive time), not a probable cause. So, it doesn't seem that simply there are too many requests coming in and the server is too low spec to handle it.

I ruled out recent shell restarts with tracing. I added tracing to measure the runtime of the bodies of IMediaFileStore.GetFileInfoAsync(string path), GetFileStreamAsync(string path), GetFileStreamAsync(IFileStoreEntry fileStoreEntry), and IMediaFileStoreCacheFileProvider.IsCachedAsync(string path), SetCacheAsync(Stream stream, IFileStoreEntry fileStoreEntry, CancellationToken cancellationToken). Nothing was slow enough (at most ~200ms sometimes, but nothing in the order of magnitude of seconds, let alone 10s of seconds).

Something seems to throttle requests. Perhaps the lazy workers?

Expected behavior

Static Media files are served within milliseconds of the underlying storage's latency, so for locally cached files <1ms.

@jtkech
Copy link
Member

jtkech commented Dec 8, 2023

ConcurrentDictionary + Lazy<Task> is a good pattern to ensure that a worker is executed once. Hmm, but not fully sure about what may happen when the worker removes itself from the cache on execution.

finally
{
_workers.TryRemove(subPathValue, out var writeTask);
}

Maybe good to keep it in the cache a minimum of time, at least to be sure the cached file is recognized as existing (I saw that in SetCacheAsync() the file is first deleted in case of a partial download).

Otherwise I would suggest to use a regular async semaphore with a regular double checks.

@Piedone
Copy link
Member Author

Piedone commented Dec 8, 2023

Hmm, I'm uncertain about the whole worker code here. LazyThreadSafetyMode.ExecutionAndPublication). This says:

        //     Locks are used to ensure that only a single thread can initialize a System.Lazy`1
        //     instance in a thread-safe manner. Effectively, the initialization method is executed
        //     in a thread-safe manner (referred to as Execution in the field name). Publication
        //     of the initialized value is also thread-safe in the sense that only one value
        //     may be published and used by all threads. If the initialization method (or the
        //     parameterless constructor, if there is no initialization method) uses locks internally,
        //     deadlocks can occur. 

Will this still work if the supplied delegate closes over different variables in different requests? It seems to me that it's not guaranteed here that it'll recognize both as being the same initialization method.

Do you think I can get some more data with further tracing here before using a profiler in prod?

@jtkech
Copy link
Member

jtkech commented Dec 9, 2023

Good catch I think about closure, as I can see there is at least one error.

await _workers.GetOrAdd(subPathValue, x => new Lazy<Task>(async () =>

Should be replaced by

await _workers.GetOrAdd(subPathValue, subPathValue => new Lazy<Task>(async () =>

Worth to just try this first, I will submit a PR.

@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

Adding this here because it's not related to the PR now.

I did some testing with the code under #14869 and added some tracing. It turns out that what's slow isn't happening in MediaFileStoreResolverMiddleware. Rather, this line sometimes takes seconds (like 5 even if there's nothing else happening on the server really). So, another middleware is slow.

I fired up the Application Insights profiler, and this was the hot path:

image

The whole thread time here was 10s, and a lot of time is added all around the place to contribute to that, but in the end, it comes down to AWAIT_TIME (see docs in microsoft.aspnetcore.staticfiles. This looks like file IO to me.

Drilling into this further with framework dependencies enabled we can see this:

image

NtTraceEvent? That's the Windows Event Log. However, there's not much activity in the event log, I don't see many new entries in it, for example.

Note that FileStream is slow too:

image

Hmm, can it be that Media Cache is actually making things slower, since the file I/O of an Azure App Service is slow? I also checked out the App Service's IO metrics, but there is nothing really, with it maxing out at <6 MB/s.

@jtkech
Copy link
Member

jtkech commented Dec 14, 2023

Didn't use Azure App Service for a while but as I remember, depending on the plan (dedicated VM or not) it was a network file system whose access times may depend on the usage of other apps and may not be so good, particularly when writing a file, like we do for caching.

@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

That's a great point, and yes, that's indeed the case with App Services. I'm looking into using a local drive for such caches to see if it makes a difference.

@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

It seems that we'd need to put the whole wwwroot folder of the app (i.e. that's under 0:/site/wwwroot/wwwroot/ in Azure) onto a fast drive, since that's where all such caches (is-cache for ImageSharp, ms-cache for Media files, sm-cache for Sitemaps) are. Perhaps it'd be worth putting App_Data under a local drive too if you don't store anything on the webserver (but e.g. Media in Blob Storage, shell settings in the DB), what you should. At that point, perhaps it'd be worth putting the whole of 0:/site/wwwroot, i.e. the entire application folder, on a local drive too, to aid performance, because in such use cases having them on a shared drive doesn't bring any benefits.

@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

So there're multiple approaches to this, perhaps depending on whether you run Windows or Linux, because why would it be simple:

  • App Cache. This needs to be set for both the staging and prod slots.
  • Local Cache.
    • Only available under Windows. However, apparently, such caching shouldn't be much of an issue under Linux.
    • Max size is 2GB.
  • There's a C:\local folder at least on a Windows App Service Plan.
    • This is the VM's temporary storage.
    • This is ephemeral with files going away even on a restart.
    • %TEMP% is under it (better to use environment variables than hard-coded paths, since those can change).
    • Max size depends on the tier but for paid tiers it starts at 11 GB, with around 9.5 GB free (you can check this in Kudu under Environment -> "C:\local usage"). So, this is much more usage than the 2 GB of Local Cache.
  • Dynamic Cache
    • Looks like an automatic cache for the shared file system, just what we need.
    • Supposedly bring some nice performance improvements, though unclear how this translates to write performance (that's what's slow for us most possibly, and it would only help if local writes happen first, with the files asynchronously transferred to the shares storage).
    • Enabled by default. Even if you don't have a corresponding configuration explicitly, in Kudu under Environment you can check WEBSITE_DYNAMIC_CACHE.

Further useful docs:

  • An overview of all the settings of Local Cache and Dynamic Cache are available here.
  • App Service file system details.
  • Bring Your Own Storage "Do not enable Failed Request Tracing or Detailed error logging on the production slot for long durations." This seems like one of the culprits, though the requests in question didn't fail. I disabled it just in case.

Next, I'll try each of the caching options though I think 2 GB of Local Cache will be too low for anything useful.

@sebastienros
Copy link
Member

Very good discussion, thanks. We might need some options to get the caches in specific locations on App Service. We do recommend to put App_Data is a persistent folder when running in Docker. The cached files should be placed in local (ephemeral) physical disks.

@sebastienros sebastienros added this to the 1.x milestone Dec 14, 2023
@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

It seems that we'd only need to set IWebHostEnvironment.WebRootPath to a local folder (like %TMP%) and that should solve some of the issues. I'm looking into this now.

I wanted to check if our app will fit into the 2 GB Local Cache limit but every attempt of mine to get the size of wwwroot failed (or takes more than hours from the Kudu PowerShell console), so I started to just download everything and check locally. It's still in progress (there's an immense amount of cache files and folders) but it doesn't seem that even if it'd fit now, it'd be comfortable for the future, since it's more than 1 GB now, and most of the affected tenants are still not downloaded (given that all cached media files, i.e. all versions of all media files, as well as the app binaries should fit into this, it seems hopeless).

I've done some more profiling, and charts like this consistently come up as the hot path:

image

BTW sometimes I also see JIT being on the hot path, but that's something I'd expect:

image

Sometimes events-related things again:

image

And:

image

This again suggests IO issues to me.

@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

First I tried out App Cache by setting the App Service configuration WEBSITES_ENABLE_APP_CACHE = true, and doing a new deployment. It's unclear to me whether App Cache is supposed to work under Windows; the docs talks about containers (which is a Linux thing under App Services), it's signed by the App Service Linux team, and the Local Cache docs warns about it being a Linux substitute to the Windows-only Local Cache. In any case, it didn't seem to help, and I still got image responses that took 1-2 or even 10, 40s. I've opened an issue about this: Azure-App-Service/KuduLite#269

So, I tried to set IWebHostEnvironment.WebRootPath to %TEMP%\wwwroot, and pre-creating the wwwroot folder there. The whole issue got a lot worse! With this, I got image requests that started at 1s but averaged around 20s... So, I suppose the culprit is found, we just need to do the opposite to fix it :D.

IWebHostEnvironment.WebRootPath has some curious behavior depending on the environment BTW, see the docs.

I'm not sure what to try next, since apparently, the local drive is slower than the network one, which makes little sense to me (it doesn't make sense that writing a couple dozen 200 KB images would cause an IO bottleneck either). In the meantime, it turned out that 2 GB will definitely be a lot less than what we need, so Local Cache is out of the question.

I have the following ideas next:

  • Disable local caching of files. Since we don't have options for this, this can only happen with some messy overrides. I'm also not sure about ImageSharp: it needs to cache resized files somehow, so some storage is still needed.
  • Map a Premium storage account to 0:/site/wwwroot/wwwroot/ as explained here and see if it'll be faster. While if it is, then it can serve as a temporary solution, I'd argue that this issue shouldn't happen to begin with.

@sebastienros
Copy link
Member

Disable local caching of files.

This has been designed this way, we can't stream the content of the files we serve.

One repro though could be to setup a site with static files and a static file provider for each different location, then to a test on each of these endpoints. I would expect it to behave the same way, but I worry that it would not show up as slow as you are describing, or more people would have mentioned it.

@Piedone
Copy link
Member Author

Piedone commented Dec 14, 2023

I see, so we always need local files.

Do you mean, to test static file performance as raw as possible in multiple Azure regions? I'm not sure I have that kind of motivation in me :).

@sebastienros
Copy link
Member

@Piedone no, I meant testing the different disks/folders we can use locally with just static files, no orchard middleware. In the same region/subscription you are already using.

Just to repro the problem:
a- it repros, they can't tell us it's a problem in orchard
b- it doesn't repro then we can investigate more and try to find what we did wrong

@Piedone
Copy link
Member Author

Piedone commented Dec 15, 2023

I see what I can do. For now, I did some testing with file mounts, following the docs. All of the tests below are with the same App Service as before, without any traffic apart from me.

  • With a standard Hot tier file share the performance was even worse than with the local drive, going up to 1 minute for certain requests.
  • With a Premium tier file share (which is supposed to be a high perf SSD) it was sometimes worse than the default shared storage of the App Service (requests up to 30-40s), sometimes about the same (2-4s) but definitely not faster.

So, this is not a solution. However, I continued testing the Premium tier because (since you're closer to renting a piece of hardware than with the default App Service storage) you ought to get more consistent performance.

Opening a page after a cache purge with 26 resized images (https://ikwileentaart.nl/gebakjes), causes, according to the metrics of the file share, almost 2k transactions (and ~3 MB egress/ingress), while a page with 7 resized images (https://ikwileentaart.nl/mini-gebakjes) causes 735 transactions (1 MB egress/ingress). Opening the same pages once the caches are warm causes 286 transactions (900 KB egress, 55 KB ingress) and 155 transactions (675 KB egress, 35 KB ingress), respectively.

This seems like excessive storage usage in terms of transactions, which as I understand, are basically file operations:

image

I also checked simply opening the Media Library admin (also after a cache purge, but having visited it before so not JIT compilation), to rule out anything custom. Note that the images there are thumbnails and thus also resized. Loading the 10 images on a page there each has a latency of around 1s usually, with up to 5s. This causes 450-600 transactions per page (with 0.5-1 MB egress and ingress).

Opening the second page of Media Library (with empty browser cache but with a warm Asset cache), just to be sure that the only thing that happens is loading those 10 images, causes 113 transactions (with 140 KB egress that corresponds to the file size of the images loaded, and 20 KB ingress for some reason). The third page with 8 images caused 91 transactions.

The server-side latencies (previously I talked about client-side latency, i.e. what you see in a browser) of these requests were around 50-150 ms (averaging at 97 ms). While this is not huge, we're talking about 10-25 KB images that should be served a lot faster by an idle server/storage IMO.

Finally, I also loaded a single, unresized but at that point uncached 10 KB image (https://ikwileentaart.nl/media/VADERDAGCAKE.jpg), which took (the server-side latency) 86 ms with 17 storage transactions. https://ikwileentaart.nl/media/Unknown-21.jpeg took 40 ms and 8 transactions BTW. Doing the same again for the now cached images took 10 transactions and 56 ms, and 7 transactions and 63 ms, respectively. Keep in mind that the only thing this Web App did was serve these images.

So, it seems that OC is doing something excessive because a handful of small files cause a large amount of transactions. I can't comment much on the image resizing that ImageSharp does, but for the simple asset caching I'd expect perhaps 3 operations per file (one for existence check, one to write it to the cache, and one to load it), but the metrics rather show at least 8. For reads, I'd expect at most 2 (existence check, load) but I see around 7. Now I'm not sure if the amount of transactions is the bottleneck here, but I can only see that and egress/ingress as the metric, and the latter two certainly don't seem like something that should even register.

Most possibly the storage transaction latency dominates (which is supposed to be single-digit ms for Premium storage). In this sense, storage here is more like database access: it's less of an issue what you do in a transaction, and more how many transactions you have, because the latency of each transaction will add up. Since Azure App Services use shared network drives, it's not just that you're accessing shared slow HDDs, but those are also across a network (and if you use Premium storage with SSDs, those are still across a network).

Even with local HDDs (which is not unusual for a simple webserver) you can expect ~5 ms latencies at least, so these can quickly add up.

@jtkech
Copy link
Member

jtkech commented Dec 15, 2023

For info, for image resizing, related to the is-cache also used for non remote files (not the ms-cache only used to cache remote source files) it also creates/manages an ImageCacheMetadata *.meta file.

https://github.com/SixLabors/ImageSharp.Web/blob/609670442be603bf1c2e82a12302c3c02f581ed5/src/ImageSharp.Web/ImageCacheMetadata.cs#L29-L41

For example to check if the cached image has expired and need to be re-processed.

https://github.com/SixLabors/ImageSharp.Web/blob/609670442be603bf1c2e82a12302c3c02f581ed5/src/ImageSharp.Web/Middleware/ImageSharpMiddleware.cs#L477-L486

But when there is no command (e.g. resizing commands) normally the is-cache midleware is shortcut.

https://github.com/SixLabors/ImageSharp.Web/blob/609670442be603bf1c2e82a12302c3c02f581ed5/src/ImageSharp.Web/Middleware/ImageSharpMiddleware.cs#L239-L244

This knowing that our middleware to manage the ms-cache runs before the IS is-cache middleware (being shortcut or not) and that at the end the file is served by a regular dotnet static file middleware that runs later on from the pipeline.

@Piedone
Copy link
Member Author

Piedone commented Dec 15, 2023

I'm not sure what to try next. It seems that without diving into how ImageSharp manages file caching, and how ASP.NET serves static files, we can't really optimize file access.

While I'm quite sure, I'm not 100% sure that file IO is indeed our problem, or whether simply raw single file caching/loading performance is. The latter doesn't look good, but if it were the only issue, then we'd see consistent ~100 ms response times for file requests (though I'd want no more than 10 ms). A large part of the time we do see that, but a lot of times it's rather 1-2 or 10-20s, so there should be something else too.

My hunch is that there's some storage usage burst throttling going on behind the scenes. I didn't find any info about the storage of an App Service, but for file shares, which they kind of use, this is a documented feature (see "burst credits" for Premium storage, and "Max IOPS" for standard storage). Then this would cause these huge latencies on pages when a bunch of images are loaded all at once.

This would kind of explain the issue, since e.g. the above-mentioned 2000 storage operations for a cold page view would go over a standard file share's max IOPS for 100 ms (which is "1,000 or 100 requests per 100 ms", whatever this "or" means). So, there in the worst case we'd be delayed for 2 s for such a page view, which is in line with what we see (and keep in mind that this is a single page view, a lot else is happening on the web server storage-wise, also for other requests).

Why is local storage even slower? I didn't find info about what kind of VM an App Service runs on but for Standard App Services they most possibly use HDDs (and Premium definitely uses SSDs). I have no idea which HDD tier would be uses, but looking at the size table most possibly we'd get one around 500 max IOPS. This being shared among everything that happens on that VM (and running Windows, IIS, and everything else apart from serving that one request) can be exhausted immediately.

The other app where we see this issue runs on the Premium P3v3 plan, and thus its backing SSD might have the same performance as a 64, or at most 128 GB Premium SSD (since the free local storage on such App Services starts at around 62 GB). This would offer at most 500 IOPS as a baseline with 3500 IOPS bursting. This is much lower than premium file shares, but due to a lower latency can still be interesting. Next, I'll check out whether using a local webroot will help on a Premium App Service.

@Piedone
Copy link
Member Author

Piedone commented Dec 15, 2023

BTW we consistently see low performance (with 4-5 s server response times) for /Admin/Media/Upload and /Admin/Media/GetMediaItems too, which seems like the same issue.

The stats of the slowest requests for us are almost all static files.

@jtkech
Copy link
Member

jtkech commented Dec 15, 2023

Sorry for not being able to help more, except for sharing some hypothetical thoughts.

Yes, all middlewares use the file system, our middleware, the IS middleware (how we configure it) and the static file middleware. This because we assume that the file system is faster than the blob storage without taking into account Azure App which is only considered as a specific case among others.

But my feeling is that OC should be well adapted to Azure App because it is often used. So maybe we should provide the ability to override the cache configurations to use for example AzureBlobStorageImageResolver and AzureBlobStorageCacheResolver instead.

https://github.com/SixLabors/ImageSharp.Web/blob/609670442be603bf1c2e82a12302c3c02f581ed5/src/ImageSharp.Web.Providers.Azure/Resolvers/AzureBlobStorageImageResolver.cs#L13

https://github.com/SixLabors/ImageSharp.Web/blob/609670442be603bf1c2e82a12302c3c02f581ed5/src/ImageSharp.Web.Providers.Azure/Resolvers/AzureBlobStorageCacheResolver.cs#L13

@Piedone
Copy link
Member Author

Piedone commented Dec 15, 2023

Those could definitely help with the resizing use case and we could try them. We still need to do something with the simple static file access use case. Perhaps a similar Blob Storage-specific implementation would help there.

@Piedone Piedone changed the title Slow requests for static Media files when a lot of them are requested at the same time Slow requests for static Media files when a lot of them are requested at the same time under Azure App Services (Web Apps) Dec 17, 2023
@Piedone
Copy link
Member Author

Piedone commented Dec 17, 2023

Did some further testing:

Higher file share IOPS: I thought about increasing the size of the premium file share from 100 GB, since this would also increase the max IOPS from 3100 and the burst IOPS from 10000. However, I didn't do this because even at 1000 GB the burst IOPS would stay the same, and the base max IOPS would only increase to 4000.

Linux: I did some perf testing with Linux App Services before in 2021 and found that Linux on the same tier served requests with 25% larger latency on average (in a production scenario for DotNest, running it for about a week). Nevertheless, I wanted to see if it makes and difference now. While the file share type used by Linux and Windows App Services seem to be the same kind (though not sure about the file system), the locally used file system is surely different (EXT4 vs NTFS) even if the hardware is the same, so I figured there might be better performance (since EXT4 is really good with a lot of small files).

Here are the results. I used the Code publish model, i.e. not Docker, S1, the same as for the Windows testing.

Premium file share (see Windows results here): It's roughly the same but a bit slower (1-6s responses), and the reason is that interestingly, it issues a lot more storage transactions. The previously mentioned https://ikwileentaart.nl/gebakjes page with 26 resized images causes 5910 transactions instead of ~2k under Windows, while https://ikwileentaart.nl/mini-gebakjes produces ~700 ms responses with 1.7k storage transactions (vs 735 under Windows). The increased storage usage is very curious. Egress/ingress is in the same ballpark.

I also repeated the Media Library test: The first page seemed slightly faster, with images loading in 1-2s (1k transactions vs <600 on Windows). The second page was slower with 1-5s responses and 2k transactions...

Local temp folder (see Windows results here): This doesn't seem to be as well supported as under Windows, since Kudu doesn't display storage usage for the local folder, nor do the docs go into any details (as opposed to with Windows. Anyway, I used a subfolder of Path.GetTempPath().

This, just as with Windows, was much slower, with resized files loading in 1-55s, mostly closer to the latter (and even for small pages 1-6s). The whole process was visibly throttled.

I also tried App Cache without any special further config (i.e. OC used the default webroot). This seems to be the same perf as the Windows default.

@jtkech
Copy link
Member

jtkech commented Dec 19, 2023

Sorry didn't have so much time but to be sure I'm following you.

So, as I understand the biggest issue is when many images are resized, but there is still an issue even if the images have no resizing commands and even if they are already cached. Would be interesting to know how non media static files are served but maybe you already tried it.

Not sure at all but I may have found something but first are you sure that the ASPNETCORE_ENVIRONMENT variable is not set to Development?

What I saw is in ShellFileVersionProvider, there is AddFileVersionToPath() which is called if you are using appendVersion with AssetUrl, AnchorTagHelper, MediaAnchorTag, ImageTagHelper and so on. If that's the case, fileProvider.Watch() is called line 106 knowing that IMediaFileProvider is an IVirtualPathBaseProvider whose VirtualPathBase has a value equal to /media by default.

// Perform check against VirtualPathBase.
if (!fileInfo.Exists &&
fileProvider is IVirtualPathBaseProvider virtualPathBaseProvider &&
virtualPathBaseProvider.VirtualPathBase.HasValue &&
resolvedPath.StartsWith(virtualPathBaseProvider.VirtualPathBase.Value, StringComparison.OrdinalIgnoreCase))
{
resolvedPath = resolvedPath[virtualPathBaseProvider.VirtualPathBase.Value.Length..];
cacheEntryOptions.AddExpirationToken(fileProvider.Watch(resolvedPath));
fileInfo = fileProvider.GetFileInfo(resolvedPath);
}

So maybe all or part of the problem is that you have many file watchers, if that's the case we could find a way to not use them or at least change the way these files are watched.

By default, uses to listen to file change events for . is ineffective in some scenarios such as mounted drives. Polling is required to effectively watch for file changes. The default value of this property is determined by the value of environment variable named DOTNET_USE_POLLING_FILE_WATCHER.

https://github.com/dotnet/runtime/blob/c282395b63c1757d4f4c1dc2e236680cfe2e7f96/src/libraries/Microsoft.Extensions.FileProviders.Physical/src/PhysicalFileProvider.cs#L73-L87

@Piedone
Copy link
Member Author

Piedone commented Dec 20, 2023

No worries, thank you for helping, JT.

Yes, the biggest issue is when many (>10) images are resized simultaneously, like when loaded on one page (which is the case for any gallery/brochure-like page with thumbnails). A smaller, but still similarly pronounced issue is when such images are simply loaded at the same time, when their backing storage is Azure Blob Storage.

While I didn't specifically test non-media static files, no such files are there in the top 100 slowest URLs for DotNest.

In the tests I've elaborated above, ASPNETCORE_ENVIRONMENT was Staging. However, in the apps we see the perf issues in production, they're Production.

Hmm, interesting idea about the file watchers. While I can definitely see these becoming a problem when the app runs for a while, note that in my tests the performance issue is immediately apparent when opening a page with many images after a new app start.

I wanted to see what the actual storage operations are. So, under the storage account's diagnostics settings I added one config that enabled all the logging.

There are what accessing an uncached image (/media/VADERDAGCAKE.jpg, without a cache busting parameter, and directly by opening the URL) without resizing produces:

image

This is after it was cached:

image

Not much time is spent, but it's 13 transactions, roughly what I've seen earlier. If this is a single request, then no problem. however, if dozens like this are issued simultaneously, the app will get throttled.

A single resized image is requested for the first time directly via its URL (/media/producten/belegde-broodjes/broodje-boerenham.jpg?width=400&height=300&token=XI8Rt826N3MTVmVEpMxuahkqvETYJfV%2FwbLfGSB3p8c%3D):

image

After it was cached:

image

I then also clicked around on Media Library admin, opening its first page, then the other two too, with 28 resized images being shown and freshly resized. This, as before, caused some requests to take seconds.

image

So, all in all, a lot of storage transactions Note that these are not just file read/writes.

@jtkech
Copy link
Member

jtkech commented Dec 21, 2023

Yes, many operations, interesting to see all operations, for example on *.meta files.

I tried to analyze the simplest use case of non resized and already cached media files.

At the end all static file providers are involved until one find a file, we can see this, there are operations on .../ms-cache/... but also an operation on .../media/... done by the regular provider I think.

The Create operation seems to be used before anything without meaning that a file is created, for example the regular provider only check if the file exists which results in a Create operation.

So for me, file exists or directory exists => Create operation.

I pass IoCtl and Tree protocol operations, so it seems that Create is done before anything, then that a Close operation is done before and after any concrete file operation.

So checking if a file exists and then reading it => Create, Close, Create, Read, Close.

So under wwwroot checking .../media/..., then checking .../ms-cache/... and read the file results in 6 operations, this + 7 protocol operations = 13 operations.


In fact I think that a transaction may be related to multiple operations, anyway there are many operations, the weirdest being the Create operation where in fact nothing should be created, is it an issue, I don't think so but it might be worth confirming from the Azure people.

@Piedone
Copy link
Member Author

Piedone commented Dec 21, 2023

Interesting... We do need to do something with this though, since it's quite a peculiarity that we can serve complex content pages with a high r/s within 10s of ms, but serving a 10 KB image routinely takes seconds (which issue can only be alleviated if you use a CDN, which is a good practice in any case, but still).

@sebastienros would you be able to kindly ask some App Services people to chime in here?

@Piedone
Copy link
Member Author

Piedone commented Dec 25, 2023

Another workaround is to use Response Caching in a reverse proxy, or Output Caching within the app to cache /media requests (with care for not caching authenticated requests but still cache the ones setting cookies). This would essentially add an in-memory layer of caching to Media files. A better approach would be to do that directly, within the middleware instead.

@Piedone
Copy link
Member Author

Piedone commented Dec 26, 2023

We could have something like a CachedWebRootFileProvider that live.asp.net used.

@Piedone
Copy link
Member Author

Piedone commented Dec 27, 2023

That can help with reads (i.e. when unresized Media is accessed, or a cached resized image) but not writes (including accessing resized images the first time), since IFileProvider is only for read-only file access, we'd need to roll out our own abstracted read-write webroot storage provider, and change consumers to use that (which is not that daunting, BTW, see below for the current usages).

image

Since there are existing in-memory IFileProviders (like this and this one), we could use them, either directly or as an inspiration at least.

Or alternatively, at a smaller scale, we could replace IMediaFileStoreCacheFileProvider with an in-memory implementation (this can also be done from a module). This would work for a single server node; for multi-node, we'd need distributed caching. RemoteMediaCacheBackgroundTask needs to be disabled in this case. I looked into this but this seems quite complex with watchers and everything. Making it distributed even more (at which point the value of such caching vs a locally appearing but shared file system probably diminishes too).

The ImageSharp.Web cache needs to be changed too. We could use the blob one as JT mentioned above; most possibly that'd faster than a mounted File Share, since despite similarly using an Azure Storage account, the AzureBlobStorageImageCache can be more efficient with transactions.

Hmm, blanket output caching for /media might just be easier.

@Piedone
Copy link
Member Author

Piedone commented Dec 27, 2023

Hmm, perhaps a cached webroot IFileProvider would be the easiest to tackle. It needn't really anything more than what I linked above:

  • We won't need priming (files can be cached as they're requested) so we don't have a bunch of cache entries for sleeping tenants or files that nobody accesses.
  • In the end, unresized Media files and resized images by ImageSharp are all served from the webroot, so that's the lowest common denominator. Sitemaps wouldn't be covered since those go through SitemapController but we can live with that.
  • File watchers would still be opened for the underlying files, so cache invalidation would work the same.
  • While this would speed up the most important read scenario (when accessing already cached files), when files are freshly cached (including by ImageSharp) would still be slow.

I'll check this out.

@Piedone
Copy link
Member Author

Piedone commented Dec 28, 2023

Well, no... I think I only now start to really grasp how the whole thing works, in the myriad of providers and middlewares.

When it comes to standard unresized Media files, the IFileProvider that'll in the end be used to serve them by ASP.NET Core's StaticFileMiddleware is DefaultMediaFileStoreCacheFileProvider.

For resized images, MediaResizingFileProvider will use (in the end) DefaultMediaFileStoreCacheFileProvider to load the original image, ImageSharp will resize and save it somehow, and then serve it via ImageSharpMiddleware, reading it via PhysicalFileSystemCache and in the end PhysicalFileSystemCacheResolver. So, this is not touching the webroot IFileProvider at all, and StaticFileMiddleware is not serving the images.

So... For some in-memory caching, we'd need the following:

  • Wrap DefaultMediaFileStoreCacheFileProvider into a caching implementation or provide a fully in-memory one. Wrapping MediaOptions.StaticFileOptions.FileProvider is not enough since we also need IFileStoreCache.IsCachedAsync() to come from the cache instead of a file existence check. However, we could perhaps add the same IMediaFileStoreCacheFileProvider-implementing singleton as a decorator for IMediaFileProvider and IMediaFileStoreCache (these are the services that DefaultMediaFileStoreCacheFileProvider replaces, see from here. This can be a file-watching read-only cache, so cache invalidation is given; or it can be a fully in-memory implementation, what's a lot more complex to implement.
  • Implement an in-memory IImageCache. This can be fully in-memory or perhaps more prudently, also with a ready-only cache (so files are still written to the file system, but served from memory the second time). This can be a decorator on PhysicalFileSystemCache.

An alternative is still (in-memory) output caching.

@Piedone
Copy link
Member Author

Piedone commented Jan 8, 2024

I'm testing the Azure Blob cache of ImageSharp. It seems quite useful, so we could have a feature for it: #15016.

@Piedone
Copy link
Member Author

Piedone commented Mar 12, 2024

I did some longer experiments with DotNest with various IS cache approaches. I tried combinations of S1 and P0V3 tier Azure App Services, as well as storing the wwwroot in the standard way locally (which for App Services means using the Standard-tier file share it uses under the hood), as well as the ImageSharp cache, or in mounted separate Standard or Premium (SSD) file shares, or the ImageSharp cache in separate Standard/Premium Blob Storage accounts (see #15016).

Results:

Metrics for OC GitHub.xlsx

You get the best performance by putting the wwwroot folder, including the ImageSharp cache under it, onto a Premium file share, mounted under that folder path in the App Service. Slow requests did still happen, but only occasionally, and those seem to be due to the CPU of the webserver being a bottleneck (when resizing images), or the Blob Storage account used for Media storage. Moving the latter to a Premium tier would most possibly eliminate the IO issues completely, without significantly impacting the costs

@Piedone
Copy link
Member Author

Piedone commented Jun 18, 2024

Not much to do in OC here, in the end.

You can do the following to make Media requests faster, all just basically throwing money at the problem, I don't see any glaring opportunity for optimization:

  • Use Premium Blob Storage to store Media.
  • Use a Premium Azure File share mounted as the wwwroot of your App Service. That way, copying from Blob Storage to the "local" folder, and accessing that, will be much faster.
  • Use an App Service that's faster than the single-core S1.

@Piedone Piedone closed this as not planned Won't fix, can't repro, duplicate, stale Jun 18, 2024
@MikeAlhayek MikeAlhayek modified the milestones: 2.x, 2.0 Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants