-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030
Conversation
c256f63
to
3d4c0fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment on this PR:
I think it's probably too late to change now but it seems like we have a lot of extra complexity for handling zip files and bulk zip files such as the way we record the zip file as a File
entry in the database along with dedicated their dedicated FileType
, link it with DataImport
, retain the zip file in blob storage, ensure we don't delete a bulk zip unless all of its data sets have been deleted, etc, etc.
The validation which is spread across the Admin and Data Processor is also pretty complicated.
I think the problem spans from the Admin using a mix of the database and the blob storage as part of its communication with the Data Processor to trigger imports, in addition to the queue message.
It feels like the use of zip files should just be as simple as a transport mechanism for large files, and that after extraction we should be able to trigger the creation and processing in the same way as uncompressed file uploads.
I appreciate it's not simple for bulk uploads because you need to check for duplicate files and have a way of providing the data set titles.
If I was starting this work again I would look at whether it would be feasible to change the Admin to proxy the upload request to the Data Processor and let it do all the creation and handling. If the request is not valid, nothing would exist in the database or in blob storage. At most the Admin might peek at the stream to check it's zip encoded before passing on the call to the Data Processor.
The Data Processor would unpack the zip file, identity pairs of data/metadata files, check for duplicates, and check the manifest contains titles if there's more than one pair.
It would not be concerned with the name of the zip file itself because this would never be stored in the database.
If this extraction fails it would neither create anything in the database nor upload anything to blob storage and errors would be returned back up to the user via the HTTP request chain.
If it succeeds it would create the set of release data and metadata files in the database and then queue processing as with a regular upload.
It might be a bit optimistic to say we could get away without needing to stage the file anywhere and it would probably involve some clever handling for large files, but it would get away from using the database for communicating the 'zip' part of the request, which after the import we don't care about.
src/GovUk.Education.ExploreEducationStatistics.Admin/Controllers/Api/ReleasesController.cs
Outdated
Show resolved
Hide resolved
[Produces("application/json")] | ||
[ProducesResponseType(200)] | ||
[ProducesResponseType(404)] | ||
[RequestSizeLimit(int.MaxValue)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention to limit the request size to 2Gb here?
Just wondering because int.MaxValue
would be 2147483647 bytes which is 2Gb, but it's not clear why int.MaxValue
has been chosen. The value type is long
so a value higher than 2147483647 can be set.
I think the annotation [DisableRequestSizeLimit]
might be a better option here.
Note that the Admin runs in the in-process model and is fronted by IIS when running in Azure environments (at least until we make a change to containerise the service), so the IIS request limit of 2gb that's set in web.config will be the limit in non-local environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just wondering if you have done any testing of large zip file uploads for this new bulk upload feature? In the testing for EES-3503 @N-moh found that uploading a 2gb file was taking 9 mins or more, and that would be just for the upload, excluding the processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention to limit the request size to 2Gb here?
Frankly, I was just copying the other endpoints. I could switch to DisableRequestSizeLimit
as you suggest, but it the current attribute these has been in place for a long time and I'm wary of changing something that could potentially break something, without spending the appropriate time investigating it.
My inclination is to create a ticket to do this separately. But if you're confident in the change, I'm happy to do it - to be honest I'm probably worrying over nothing here.
Also just wondering if you have done any testing of large zip file uploads for this new bulk upload feature?
I haven't - I'm inclined to leave that to the testers. If this goes to the dev branch and we do have a problem, a small migration will be required to revert the change, but otherwise it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be pretty quick to test just so we can check our understanding of what's going off here. If you take away the annotation completely I think you should hit an error if you try to upload a file bigger than 28.6Mb which is the default Kestrel request limit (we only use Kestrel when running locally).
We haven't overriden serverOptions.Limits.MaxRequestBodySize
to increase that global 28.6Mb limit within our existing Kestrel configuration. See
webBuilder.ConfigureKestrel(serverOptions =>
{
serverOptions.AddServerHeader = false;
});
Once you've verified that 28.6Mb limit, if adding in the [DisableRequestSizeLimit]
annotation allows you to upload bigger files then I think it should be good to swap instead of using [RequestSizeLimit(int.MaxValue)]
.
Alternatively to enforce a 2Gb limit I think it would be better to make that explicit by using [RequestSizeLimit(2147483648)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to using [DisableRequestSizeLimit]
, and made the same change across the whole codebase.
I'll leave a note to test for large files on the ticket (although I'm sure they would anyway).
src/GovUk.Education.ExploreEducationStatistics.Data.Processor/Services/ValidatorService.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Data.Processor/Services/DataArchiveService.cs
Outdated
Show resolved
Hide resolved
using var dataFileReader = new StreamReader(await datafileStreamProvider.Invoke()); | ||
using var csvReader = new CsvReader(dataFileReader, CultureInfo.InvariantCulture); | ||
await using var stream = await datafileStreamProvider.Invoke(); | ||
using var streamReader = new StreamReader(stream); | ||
using var csvReader = new CsvReader(streamReader, CultureInfo.InvariantCulture); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what the benefit of changing this is as it seemed ok before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was more for clarity of knowing, at a quick glance, that all streams will be closed due to being in a using block.
I think you said in a previous PR that you had to check whether certain streams were getting closed? My thought is this makes it crystal clear.
It seems like calling close on csvReader
and dataFileReader
was enough, but for completeness, I'd prefer it to be clear that .Close()
gets called on the datafileStreamProvider.Invoke()
stream too. Even if it is unnecessary, this means nobody has to check in the future if they saw this and they're unsure.
src/GovUk.Education.ExploreEducationStatistics.Admin/Services/DataArchiveValidationService.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Admin/Services/DataArchiveValidationService.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Admin/Services/DataArchiveValidationService.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Admin/Services/DataArchiveValidationService.cs
Outdated
Show resolved
Hide resolved
src/GovUk.Education.ExploreEducationStatistics.Admin/Services/FileUploadsValidatorService.cs
Outdated
Show resolved
Hide resolved
Replying to Ben's general comment above
I did mostly use the same pattern that existing DataZips used for BulkZips. If we didn't do that, then it would have involed refactoring the existing DataZip functionality before starting the BulkZip work. I agree with that if we instead uploaded files to the Data Processor and that handled all validation and importing that would be simpler, but in terms of getting from where we were initially to having bulk imports, that extra refactoring adds a lot of work - and I mean a lot of work, in my opinion - and I don't think it's justified. If there is one way I'm inclined to think you may have a point is that maybe refactoring it now would make implementing some future requirement much easier in a way that is difficult to foresee. The reason I discount that is because that's a what-if, while the time saved in implementation is a concrete saving we gain right now. Although not saying you couldn't be right - you have much more experience than me and I'd be a fool to totally discount your opinion :) I don't think the bulk zip work puts a refactor like you want totally out of the question - but I agree it puts it further out of reach. That said, how much extra work would it be to do later compared to now? While I think refactoring now would have been more efficient than refactoring later, but I'm not sure it's a huge saving.
If this ends up with Admin dealing with all the validation - say with the screener added - and the Processor with importing, I'm inclined to think that is okay. It doesn't feel to me like the complexity will become out of hand - of the stuff I touched, I found it fairly easy to follow. So while I am keen to minimise complexity everywhere, if we had extra time to spend, there are many other places where I think the complexity is getting out of control - something like untangling the publication of methodologies and releases (which in turn made redirects a nightmare) would be a much better place to spend extra time, I think. |
3e4d82f
to
dcca428
Compare
public class ArchiveDataSetFile( | ||
string title, | ||
string dataFilename, | ||
string metaFilename, | ||
long dataFileSize = 1048576, | ||
long metaFileSize = 1024) | ||
{ | ||
public string Title { get; } = title; | ||
|
||
public string DataFilename { get; } = dataFilename; | ||
|
||
public string MetaFilename { get; } = metaFilename; | ||
|
||
public long DataFileSize { get; } = dataFileSize; | ||
|
||
public long MetaFileSize { get; } = metaFileSize; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative I think it would be safe to use the following instead
public record ArchiveDataSetFile(
string Title,
string DataFilename,
string MetaFilename,
long DataFileSize = 1048576,
long MetaFileSize = 1024);
@@ -8,12 +8,12 @@ namespace GovUk.Education.ExploreEducationStatistics.Common.Services.Interfaces | |||
{ | |||
public interface IFileTypeService | |||
{ | |||
Task<string> GetMimeType(IFormFile file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest this method in FileTypeService
could be made private now but I see it's still being used by unit tests and I'm guessing you already considered this and decided to keep them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't - but now you point it out I'd be inclined to keep the unit tests, even if it's really testing dependencies for the most part.
[Produces("application/json")] | ||
[ProducesResponseType(200)] | ||
[ProducesResponseType(404)] | ||
[RequestSizeLimit(int.MaxValue)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be pretty quick to test just so we can check our understanding of what's going off here. If you take away the annotation completely I think you should hit an error if you try to upload a file bigger than 28.6Mb which is the default Kestrel request limit (we only use Kestrel when running locally).
We haven't overriden serverOptions.Limits.MaxRequestBodySize
to increase that global 28.6Mb limit within our existing Kestrel configuration. See
webBuilder.ConfigureKestrel(serverOptions =>
{
serverOptions.AddServerHeader = false;
});
Once you've verified that 28.6Mb limit, if adding in the [DisableRequestSizeLimit]
annotation allows you to upload bigger files then I think it should be good to swap instead of using [RequestSizeLimit(int.MaxValue)]
.
Alternatively to enforce a 2Gb limit I think it would be better to make that explicit by using [RequestSizeLimit(2147483648)]
return mimeType == "application/octet-stream" | ||
|| mimeType == "application/zip"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the comment // i.e. is zip file
looks like it applies to the "application/octet-stream"
side of the condition as well, but I think that's for a different type of file, not zip, which is also best suited to be checked using mime detective?
Not a big deal, just an observation while reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this has been moved from Common to Admin it needs #nullable enable
as we still don't have that set globally in the Admin project.
using System.Threading.Tasks; | ||
using CsvHelper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a few unused imports in this class including this one.
} | ||
|
||
private string GuessMagicInfo(Stream fileStream, MagicOpenFlags flag) | ||
private string GuessMagicInfo(Stream fileStream, MagicOpenFlags flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's okay but I regularly format all my changed files and spend time wondering whether to pollute my own changes fixing prexisting formatting issues or undo them. It also doesn't make it easy to work out why/when a change has been made if it's attributed to another issue later on.
This PR adds the ability to upload multiple data sets at once when importing new data sets.
Bulk zip files contain a
dataset_names.csv
which includes the data set title for each data set, and a file base name. The base name is used to find the actual data files also included in the zip, which it expects to be named[basename].csv
and[basename].meta.csv
Examples of valid and invalid bulk zip files can be found in
src/GovUk....Admin.Tests/Resources
.The new
ReleaseController#UploadAsBulkZip
method can return multiple errors rather than one error at a time as we've done in the past. This is because bulk zips are liable to be large files and it's easy to make a mistake when creating them, so this work was done on the backend as a convenience for the user who might otherwise have to upload-fix-upload-fix the same file multiple times.Even though this work to support multiple errors has been done in the backend, the frontend still returns only one error at a time - mostly just so we can get this feature out quicker. A ticket will be created to add support for multiple errors to the frontend.
This work also includes a refactor of existing methods in
ReleaseDataFileService
'sUpload
andUploadAsZip
, mostly how they're validated.UploadAsZip
previously wasn't validating all the same cases asUpload
.All upload methods now return multiple errors - which was necessary to create reusable validation methods for all types of upload -
FileUploadsValidatorService#ValidateDataSetFilesForUpload
is called by them all and handles most validation for a specific data set.DataImportService#Import
now supports all types of upload. The Data Processor basically treats DataZips and BulkDataZips as the same, treating each "import" as an individual data set. BulkDataZips call the Data Processor multiple times, once per data set.Other changes
ReleaseDataFileService#Delete
has been updated to not remove the source zip file if other data sets still exist with that zip as a source.FileTypeService
has been changed a lot. This was started as I wantedIsValidCsv
to accept a stream rather than a stream provider, and then basically cascaded into a major refactoring.ErrorViewModel
file from Admin - we're using the one in Common.