EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030

mmyoungman · 2024-07-04T12:04:00Z

This PR adds the ability to upload multiple data sets at once when importing new data sets.

Bulk zip files contain a dataset_names.csv which includes the data set title for each data set, and a file base name. The base name is used to find the actual data files also included in the zip, which it expects to be named [basename].csv and [basename].meta.csv

Examples of valid and invalid bulk zip files can be found in src/GovUk....Admin.Tests/Resources.

The new ReleaseController#UploadAsBulkZip method can return multiple errors rather than one error at a time as we've done in the past. This is because bulk zips are liable to be large files and it's easy to make a mistake when creating them, so this work was done on the backend as a convenience for the user who might otherwise have to upload-fix-upload-fix the same file multiple times.

Even though this work to support multiple errors has been done in the backend, the frontend still returns only one error at a time - mostly just so we can get this feature out quicker. A ticket will be created to add support for multiple errors to the frontend.

This work also includes a refactor of existing methods in ReleaseDataFileService's Upload and UploadAsZip, mostly how they're validated. UploadAsZip previously wasn't validating all the same cases as Upload.

All upload methods now return multiple errors - which was necessary to create reusable validation methods for all types of upload - FileUploadsValidatorService#ValidateDataSetFilesForUpload is called by them all and handles most validation for a specific data set.

DataImportService#Import now supports all types of upload. The Data Processor basically treats DataZips and BulkDataZips as the same, treating each "import" as an individual data set. BulkDataZips call the Data Processor multiple times, once per data set.

Other changes

ReleaseDataFileService#Delete has been updated to not remove the source zip file if other data sets still exist with that zip as a source.
FileTypeService has been changed a lot. This was started as I wanted IsValidCsv to accept a stream rather than a stream provider, and then basically cascaded into a major refactoring.
Removed an unused ErrorViewModel file from Admin - we're using the one in Common.
I've done a lot of renaming - mostly subject/dataSetName to dataSetTitle.

benoutram

General comment on this PR:

I think it's probably too late to change now but it seems like we have a lot of extra complexity for handling zip files and bulk zip files such as the way we record the zip file as a File entry in the database along with dedicated their dedicated FileType, link it with DataImport, retain the zip file in blob storage, ensure we don't delete a bulk zip unless all of its data sets have been deleted, etc, etc.

The validation which is spread across the Admin and Data Processor is also pretty complicated.

I think the problem spans from the Admin using a mix of the database and the blob storage as part of its communication with the Data Processor to trigger imports, in addition to the queue message.

It feels like the use of zip files should just be as simple as a transport mechanism for large files, and that after extraction we should be able to trigger the creation and processing in the same way as uncompressed file uploads.

I appreciate it's not simple for bulk uploads because you need to check for duplicate files and have a way of providing the data set titles.

If I was starting this work again I would look at whether it would be feasible to change the Admin to proxy the upload request to the Data Processor and let it do all the creation and handling. If the request is not valid, nothing would exist in the database or in blob storage. At most the Admin might peek at the stream to check it's zip encoded before passing on the call to the Data Processor.

The Data Processor would unpack the zip file, identity pairs of data/metadata files, check for duplicates, and check the manifest contains titles if there's more than one pair.

It would not be concerned with the name of the zip file itself because this would never be stored in the database.

If this extraction fails it would neither create anything in the database nor upload anything to blob storage and errors would be returned back up to the user via the HTTP request chain.

If it succeeds it would create the set of release data and metadata files in the database and then queue processing as with a regular upload.

It might be a bit optimistic to say we could get away without needing to stage the file anywhere and it would probably involve some clever handling for large files, but it would get away from using the database for communicating the 'zip' part of the request, which after the import we don't care about.

src/GovUk.Education.ExploreEducationStatistics.Admin/Controllers/Api/ReleasesController.cs

benoutram · 2024-07-08T14:31:39Z

src/GovUk.Education.ExploreEducationStatistics.Admin/Controllers/Api/ReleasesController.cs

+        [Produces("application/json")]
+        [ProducesResponseType(200)]
+        [ProducesResponseType(404)]
+        [RequestSizeLimit(int.MaxValue)]


Is the intention to limit the request size to 2Gb here?

Just wondering because int.MaxValue would be 2147483647 bytes which is 2Gb, but it's not clear why int.MaxValue has been chosen. The value type is long so a value higher than 2147483647 can be set.

I think the annotation [DisableRequestSizeLimit] might be a better option here.

Note that the Admin runs in the in-process model and is fronted by IIS when running in Azure environments (at least until we make a change to containerise the service), so the IIS request limit of 2gb that's set in web.config will be the limit in non-local environments.

Also just wondering if you have done any testing of large zip file uploads for this new bulk upload feature? In the testing for EES-3503 @N-moh found that uploading a 2gb file was taking 9 mins or more, and that would be just for the upload, excluding the processing.

Is the intention to limit the request size to 2Gb here?

Frankly, I was just copying the other endpoints. I could switch to DisableRequestSizeLimit as you suggest, but it the current attribute these has been in place for a long time and I'm wary of changing something that could potentially break something, without spending the appropriate time investigating it.

My inclination is to create a ticket to do this separately. But if you're confident in the change, I'm happy to do it - to be honest I'm probably worrying over nothing here.

Also just wondering if you have done any testing of large zip file uploads for this new bulk upload feature?

I haven't - I'm inclined to leave that to the testers. If this goes to the dev branch and we do have a problem, a small migration will be required to revert the change, but otherwise it should be fine.

I think it would be pretty quick to test just so we can check our understanding of what's going off here. If you take away the annotation completely I think you should hit an error if you try to upload a file bigger than 28.6Mb which is the default Kestrel request limit (we only use Kestrel when running locally).

We haven't overriden serverOptions.Limits.MaxRequestBodySize to increase that global 28.6Mb limit within our existing Kestrel configuration. See

webBuilder.ConfigureKestrel(serverOptions => { serverOptions.AddServerHeader = false; });

Once you've verified that 28.6Mb limit, if adding in the [DisableRequestSizeLimit] annotation allows you to upload bigger files then I think it should be good to swap instead of using [RequestSizeLimit(int.MaxValue)].

Alternatively to enforce a 2Gb limit I think it would be better to make that explicit by using [RequestSizeLimit(2147483648)]

I've switched to using [DisableRequestSizeLimit], and made the same change across the whole codebase.

I'll leave a note to test for large files on the ticket (although I'm sure they would anyway).

src/GovUk.Education.ExploreEducationStatistics.Data.Processor/Services/ValidatorService.cs

src/GovUk.Education.ExploreEducationStatistics.Data.Processor/Services/DataArchiveService.cs

benoutram · 2024-07-08T16:45:21Z

src/GovUk.Education.ExploreEducationStatistics.Content.Services/DataSetFileService.cs

-        using var dataFileReader = new StreamReader(await datafileStreamProvider.Invoke());
-        using var csvReader = new CsvReader(dataFileReader, CultureInfo.InvariantCulture);
+        await using var stream = await datafileStreamProvider.Invoke();
+        using var streamReader = new StreamReader(stream);
+        using var csvReader = new CsvReader(streamReader, CultureInfo.InvariantCulture);


I'm wondering what the benefit of changing this is as it seemed ok before.

This was more for clarity of knowing, at a quick glance, that all streams will be closed due to being in a using block.

I think you said in a previous PR that you had to check whether certain streams were getting closed? My thought is this makes it crystal clear.

It seems like calling close on csvReader and dataFileReader was enough, but for completeness, I'd prefer it to be clear that .Close() gets called on the datafileStreamProvider.Invoke() stream too. Even if it is unnecessary, this means nobody has to check in the future if they saw this and they're unsure.

src/GovUk.Education.ExploreEducationStatistics.Admin/Services/DataArchiveValidationService.cs

src/GovUk.Education.ExploreEducationStatistics.Admin/Services/FileUploadsValidatorService.cs

mmyoungman · 2024-07-09T15:52:34Z

Replying to Ben's general comment above

I think it's probably too late to change now but it seems like we have a lot of extra complexity for handling zip files and bulk zip files [...]

I did mostly use the same pattern that existing DataZips used for BulkZips. If we didn't do that, then it would have involed refactoring the existing DataZip functionality before starting the BulkZip work.

I agree with that if we instead uploaded files to the Data Processor and that handled all validation and importing that would be simpler, but in terms of getting from where we were initially to having bulk imports, that extra refactoring adds a lot of work - and I mean a lot of work, in my opinion - and I don't think it's justified.

If there is one way I'm inclined to think you may have a point is that maybe refactoring it now would make implementing some future requirement much easier in a way that is difficult to foresee. The reason I discount that is because that's a what-if, while the time saved in implementation is a concrete saving we gain right now. Although not saying you couldn't be right - you have much more experience than me and I'd be a fool to totally discount your opinion :)

I don't think the bulk zip work puts a refactor like you want totally out of the question - but I agree it puts it further out of reach. That said, how much extra work would it be to do later compared to now? While I think refactoring now would have been more efficient than refactoring later, but I'm not sure it's a huge saving.

The validation which is spread across the Admin and Data Processor is also pretty complicated.

If this ends up with Admin dealing with all the validation - say with the screener added - and the Processor with importing, I'm inclined to think that is okay. It doesn't feel to me like the complexity will become out of hand - of the stuff I touched, I found it fairly easy to follow.

So while I am keen to minimise complexity everywhere, if we had extra time to spend, there are many other places where I think the complexity is getting out of control - something like untangling the publication of methodologies and releases (which in turn made redirects a nightmare) would be a much better place to spend extra time, I think.

benoutram · 2024-07-12T13:17:08Z

src/GovUk.Education.ExploreEducationStatistics.Admin/Models/ArchiveDataSetFile.cs

+public class ArchiveDataSetFile(
+    string title,
+    string dataFilename,
+    string metaFilename,
+    long dataFileSize = 1048576,
+    long metaFileSize = 1024)
+{
+    public string Title { get; } = title;
+
+    public string DataFilename { get; } = dataFilename;
+
+    public string MetaFilename { get; } = metaFilename;
+
+    public long DataFileSize { get; } = dataFileSize;
+
+    public long MetaFileSize { get; } = metaFileSize;
+}


As an alternative I think it would be safe to use the following instead

public record ArchiveDataSetFile( string Title, string DataFilename, string MetaFilename, long DataFileSize = 1048576, long MetaFileSize = 1024);

benoutram · 2024-07-12T13:23:26Z

src/GovUk.Education.ExploreEducationStatistics.Common/Services/Interfaces/IFileTypeService.cs

@@ -8,12 +8,12 @@ namespace GovUk.Education.ExploreEducationStatistics.Common.Services.Interfaces
 {
    public interface IFileTypeService
    {
-        Task<string> GetMimeType(IFormFile file);


I was going to suggest this method in FileTypeService could be made private now but I see it's still being used by unit tests and I'm guessing you already considered this and decided to keep them?

I didn't - but now you point it out I'd be inclined to keep the unit tests, even if it's really testing dependencies for the most part.

benoutram · 2024-07-12T13:39:45Z

src/GovUk.Education.ExploreEducationStatistics.Admin/Controllers/Api/ReleasesController.cs

+        [Produces("application/json")]
+        [ProducesResponseType(200)]
+        [ProducesResponseType(404)]
+        [RequestSizeLimit(int.MaxValue)]


I think it would be pretty quick to test just so we can check our understanding of what's going off here. If you take away the annotation completely I think you should hit an error if you try to upload a file bigger than 28.6Mb which is the default Kestrel request limit (we only use Kestrel when running locally).

We haven't overriden serverOptions.Limits.MaxRequestBodySize to increase that global 28.6Mb limit within our existing Kestrel configuration. See

webBuilder.ConfigureKestrel(serverOptions => { serverOptions.AddServerHeader = false; });

Once you've verified that 28.6Mb limit, if adding in the [DisableRequestSizeLimit] annotation allows you to upload bigger files then I think it should be good to swap instead of using [RequestSizeLimit(int.MaxValue)].

Alternatively to enforce a 2Gb limit I think it would be better to make that explicit by using [RequestSizeLimit(2147483648)]

benoutram · 2024-07-12T13:44:41Z

src/GovUk.Education.ExploreEducationStatistics.Common/Services/FileTypeService.cs

+            return mimeType == "application/octet-stream"
+                   || mimeType == "application/zip";


I noticed that the comment // i.e. is zip file looks like it applies to the "application/octet-stream" side of the condition as well, but I think that's for a different type of file, not zip, which is also best suited to be checked using mime detective?

Not a big deal, just an observation while reading the code.

benoutram · 2024-07-12T13:51:29Z

src/GovUk.Education.ExploreEducationStatistics.Admin/Models/ArchiveDataSetFile.cs

After this has been moved from Common to Admin it needs #nullable enable as we still don't have that set globally in the Admin project.

benoutram · 2024-07-12T13:55:06Z

src/GovUk.Education.ExploreEducationStatistics.Admin/Services/DataArchiveValidationService.cs

 using System.Threading.Tasks;
+using CsvHelper;


I think there's a few unused imports in this class including this one.

benoutram · 2024-07-12T14:03:19Z

src/GovUk.Education.ExploreEducationStatistics.Common/Services/FileTypeService.cs

-        }
-
-        private string GuessMagicInfo(Stream fileStream, MagicOpenFlags flag)
+        private string GuessMagicInfo(Stream fileStream,  MagicOpenFlags flag)


That's okay but I regularly format all my changed files and spend time wondering whether to pollute my own changes fixing prexisting formatting issues or undo them. It also doesn't make it easy to work out why/when a change has been made if it's attributed to another issue later on.

src/GovUk.Education.ExploreEducationStatistics.Admin/Controllers/Api/ReleasesController.cs

mmyoungman force-pushed the EES-4932 branch 2 times, most recently from c256f63 to 3d4c0fe Compare July 5, 2024 13:30

benoutram reviewed Jul 9, 2024

View reviewed changes

mmyoungman force-pushed the EES-4932 branch from 3d4c0fe to e754afb Compare July 9, 2024 14:42

mmyoungman force-pushed the EES-4932 branch 3 times, most recently from 3e4d82f to dcca428 Compare July 12, 2024 09:06

benoutram reviewed Jul 12, 2024

View reviewed changes

mmyoungman force-pushed the EES-4932 branch from c892adc to 9b3044e Compare July 15, 2024 13:16

benoutram self-requested a review July 16, 2024 11:21

benoutram approved these changes Jul 16, 2024

View reviewed changes

mmyoungman added the do not merge Don't merge just yet label Jul 16, 2024

mmyoungman force-pushed the EES-4932 branch from 9b3044e to a896b16 Compare July 16, 2024 13:57

mmyoungman added 16 commits July 18, 2024 09:57

EES-4932 Add Bulk ZIP upload option to frontend

c2bac0c

EES-4932 Add bulk uploads to backend

17d27bb

EES-4932 Fix deleting data set when other files have same SourceId

ed5d9c3

ES-4932 Improve error messages

ca7ec34

EES-4932 Add additional bulk zip validation

235e53e

EES-4932 Refactor IsZipFile into IsValidZipFile

db931bd

EES-4932 More import validation error refactoring

4243164

EES-4932 Refactor ValidateSubjectName

8e847fd

EES-4932 More refactoring around subject name validation

9e140d2

EES-4932 Switch to using ArchiveDataSetFile

353a301

EES-4932 Take Streams rather than Func<Task<Stream>>>s to fix unit tests

ede4b20

EES-4932 Refactor Upload methods

9de0d18

EES-4932 Various renaming and cleanup

fec3343

EES-4932 DeflateStreams don't support .Length

86b47bd

EES-4932 Fix Magic mime type / encoding guessing

f3b103b

EES-4932 Fix replacement dup data set name error

7983deb

mmyoungman and others added 15 commits July 18, 2024 09:57

EES-4932 Fix frontend validation

bd99b2b

EES-4932 Outline what unit tests need writing

f831bf2

EES-4932 Update backend unit tests

960b09b

EES-4932 More updates to backend unit tests

84255ed

EES-4932 Even more backend unit test updates

c80c23f

EES-4932 Yet more backend unit test updates

7a2d93b

EES-4932 fix updating upload statuses

532cea3

EES-4932 open all bulk uploaded file details

7fc0de5

EES-4932 use server error messages on data upload form

7c50414

EES-4932 tidying and updating tests

4bdf992

EES-4932 Backend unit test updates

92a3c74

EES-4932 Cleanup and renaming

cdff09c

EES-4932 Trim whitespace from archive data set titles

ef0b3d0

EES-4932 Changes in response to PR comments

4545a5c

EES-4932 More changes in response to PR comments

7640d6a

mmyoungman force-pushed the EES-4932 branch from a896b16 to 7640d6a Compare July 18, 2024 08:57

Merge branch 'dev' into EES-4932

dcb83c2

mmyoungman merged commit 63ffabf into dev Jul 18, 2024
7 checks passed

mmyoungman deleted the EES-4932 branch July 18, 2024 12:22

mmyoungman removed the do not merge Don't merge just yet label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030

EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030

mmyoungman commented Jul 4, 2024 •

edited

Loading

benoutram left a comment

benoutram Jul 8, 2024

benoutram Jul 8, 2024

mmyoungman Jul 11, 2024 •

edited

Loading

benoutram Jul 12, 2024

mmyoungman Jul 15, 2024

benoutram Jul 8, 2024

mmyoungman Jul 11, 2024 •

edited

Loading

mmyoungman commented Jul 9, 2024 •

edited

Loading

benoutram Jul 12, 2024

benoutram Jul 12, 2024

mmyoungman Jul 15, 2024

benoutram Jul 12, 2024

benoutram Jul 12, 2024

benoutram Jul 12, 2024

benoutram Jul 12, 2024

benoutram Jul 12, 2024

		return mimeType == "application/octet-stream"
		\|\| mimeType == "application/zip";

EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030

EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030

Conversation

mmyoungman commented Jul 4, 2024 • edited Loading

Other changes

benoutram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmyoungman Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmyoungman Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

mmyoungman commented Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmyoungman commented Jul 4, 2024 •

edited

Loading

mmyoungman Jul 11, 2024 •

edited

Loading

mmyoungman Jul 11, 2024 •

edited

Loading

mmyoungman commented Jul 9, 2024 •

edited

Loading