Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EES-4932 Bulk import: allow users to import multiple data sets from a single zip file #5030

Merged
merged 32 commits into from
Jul 18, 2024

Conversation

mmyoungman
Copy link
Collaborator

@mmyoungman mmyoungman commented Jul 4, 2024

This PR adds the ability to upload multiple data sets at once when importing new data sets.

image

Bulk zip files contain a dataset_names.csv which includes the data set title for each data set, and a file base name. The base name is used to find the actual data files also included in the zip, which it expects to be named [basename].csv and [basename].meta.csv

Examples of valid and invalid bulk zip files can be found in src/GovUk....Admin.Tests/Resources.

The new ReleaseController#UploadAsBulkZip method can return multiple errors rather than one error at a time as we've done in the past. This is because bulk zips are liable to be large files and it's easy to make a mistake when creating them, so this work was done on the backend as a convenience for the user who might otherwise have to upload-fix-upload-fix the same file multiple times.

Even though this work to support multiple errors has been done in the backend, the frontend still returns only one error at a time - mostly just so we can get this feature out quicker. A ticket will be created to add support for multiple errors to the frontend.

This work also includes a refactor of existing methods in ReleaseDataFileService's Upload and UploadAsZip, mostly how they're validated. UploadAsZip previously wasn't validating all the same cases as Upload.

All upload methods now return multiple errors - which was necessary to create reusable validation methods for all types of upload - FileUploadsValidatorService#ValidateDataSetFilesForUpload is called by them all and handles most validation for a specific data set.

DataImportService#Import now supports all types of upload. The Data Processor basically treats DataZips and BulkDataZips as the same, treating each "import" as an individual data set. BulkDataZips call the Data Processor multiple times, once per data set.

Other changes

  • ReleaseDataFileService#Delete has been updated to not remove the source zip file if other data sets still exist with that zip as a source.
  • FileTypeService has been changed a lot. This was started as I wanted IsValidCsv to accept a stream rather than a stream provider, and then basically cascaded into a major refactoring.
  • Removed an unused ErrorViewModel file from Admin - we're using the one in Common.
  • I've done a lot of renaming - mostly subject/dataSetName to dataSetTitle.

@mmyoungman mmyoungman force-pushed the EES-4932 branch 2 times, most recently from c256f63 to 3d4c0fe Compare July 5, 2024 13:30
Copy link
Collaborator

@benoutram benoutram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment on this PR:

I think it's probably too late to change now but it seems like we have a lot of extra complexity for handling zip files and bulk zip files such as the way we record the zip file as a File entry in the database along with dedicated their dedicated FileType, link it with DataImport, retain the zip file in blob storage, ensure we don't delete a bulk zip unless all of its data sets have been deleted, etc, etc.

The validation which is spread across the Admin and Data Processor is also pretty complicated.

I think the problem spans from the Admin using a mix of the database and the blob storage as part of its communication with the Data Processor to trigger imports, in addition to the queue message.

It feels like the use of zip files should just be as simple as a transport mechanism for large files, and that after extraction we should be able to trigger the creation and processing in the same way as uncompressed file uploads.

I appreciate it's not simple for bulk uploads because you need to check for duplicate files and have a way of providing the data set titles.

If I was starting this work again I would look at whether it would be feasible to change the Admin to proxy the upload request to the Data Processor and let it do all the creation and handling. If the request is not valid, nothing would exist in the database or in blob storage. At most the Admin might peek at the stream to check it's zip encoded before passing on the call to the Data Processor.

The Data Processor would unpack the zip file, identity pairs of data/metadata files, check for duplicates, and check the manifest contains titles if there's more than one pair.

It would not be concerned with the name of the zip file itself because this would never be stored in the database.

If this extraction fails it would neither create anything in the database nor upload anything to blob storage and errors would be returned back up to the user via the HTTP request chain.

If it succeeds it would create the set of release data and metadata files in the database and then queue processing as with a regular upload.

It might be a bit optimistic to say we could get away without needing to stage the file anywhere and it would probably involve some clever handling for large files, but it would get away from using the database for communicating the 'zip' part of the request, which after the import we don't care about.

[Produces("application/json")]
[ProducesResponseType(200)]
[ProducesResponseType(404)]
[RequestSizeLimit(int.MaxValue)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to limit the request size to 2Gb here?

Just wondering because int.MaxValue would be 2147483647 bytes which is 2Gb, but it's not clear why int.MaxValue has been chosen. The value type is long so a value higher than 2147483647 can be set.

I think the annotation [DisableRequestSizeLimit] might be a better option here.

Note that the Admin runs in the in-process model and is fronted by IIS when running in Azure environments (at least until we make a change to containerise the service), so the IIS request limit of 2gb that's set in web.config will be the limit in non-local environments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just wondering if you have done any testing of large zip file uploads for this new bulk upload feature? In the testing for EES-3503 @N-moh found that uploading a 2gb file was taking 9 mins or more, and that would be just for the upload, excluding the processing.

Copy link
Collaborator Author

@mmyoungman mmyoungman Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to limit the request size to 2Gb here?

Frankly, I was just copying the other endpoints. I could switch to DisableRequestSizeLimit as you suggest, but it the current attribute these has been in place for a long time and I'm wary of changing something that could potentially break something, without spending the appropriate time investigating it.

My inclination is to create a ticket to do this separately. But if you're confident in the change, I'm happy to do it - to be honest I'm probably worrying over nothing here.

Also just wondering if you have done any testing of large zip file uploads for this new bulk upload feature?

I haven't - I'm inclined to leave that to the testers. If this goes to the dev branch and we do have a problem, a small migration will be required to revert the change, but otherwise it should be fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be pretty quick to test just so we can check our understanding of what's going off here. If you take away the annotation completely I think you should hit an error if you try to upload a file bigger than 28.6Mb which is the default Kestrel request limit (we only use Kestrel when running locally).

We haven't overriden serverOptions.Limits.MaxRequestBodySize to increase that global 28.6Mb limit within our existing Kestrel configuration. See

                    webBuilder.ConfigureKestrel(serverOptions =>
                    {
                        serverOptions.AddServerHeader = false;
                    });

Once you've verified that 28.6Mb limit, if adding in the [DisableRequestSizeLimit] annotation allows you to upload bigger files then I think it should be good to swap instead of using [RequestSizeLimit(int.MaxValue)].

Alternatively to enforce a 2Gb limit I think it would be better to make that explicit by using [RequestSizeLimit(2147483648)]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to using [DisableRequestSizeLimit], and made the same change across the whole codebase.

I'll leave a note to test for large files on the ticket (although I'm sure they would anyway).

Comment on lines -286 to +287
using var dataFileReader = new StreamReader(await datafileStreamProvider.Invoke());
using var csvReader = new CsvReader(dataFileReader, CultureInfo.InvariantCulture);
await using var stream = await datafileStreamProvider.Invoke();
using var streamReader = new StreamReader(stream);
using var csvReader = new CsvReader(streamReader, CultureInfo.InvariantCulture);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what the benefit of changing this is as it seemed ok before.

Copy link
Collaborator Author

@mmyoungman mmyoungman Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more for clarity of knowing, at a quick glance, that all streams will be closed due to being in a using block.

I think you said in a previous PR that you had to check whether certain streams were getting closed? My thought is this makes it crystal clear.

It seems like calling close on csvReader and dataFileReader was enough, but for completeness, I'd prefer it to be clear that .Close() gets called on the datafileStreamProvider.Invoke() stream too. Even if it is unnecessary, this means nobody has to check in the future if they saw this and they're unsure.

@mmyoungman
Copy link
Collaborator Author

mmyoungman commented Jul 9, 2024

Replying to Ben's general comment above

I think it's probably too late to change now but it seems like we have a lot of extra complexity for handling zip files and bulk zip files [...]

I did mostly use the same pattern that existing DataZips used for BulkZips. If we didn't do that, then it would have involed refactoring the existing DataZip functionality before starting the BulkZip work.

I agree with that if we instead uploaded files to the Data Processor and that handled all validation and importing that would be simpler, but in terms of getting from where we were initially to having bulk imports, that extra refactoring adds a lot of work - and I mean a lot of work, in my opinion - and I don't think it's justified.

If there is one way I'm inclined to think you may have a point is that maybe refactoring it now would make implementing some future requirement much easier in a way that is difficult to foresee. The reason I discount that is because that's a what-if, while the time saved in implementation is a concrete saving we gain right now. Although not saying you couldn't be right - you have much more experience than me and I'd be a fool to totally discount your opinion :)

I don't think the bulk zip work puts a refactor like you want totally out of the question - but I agree it puts it further out of reach. That said, how much extra work would it be to do later compared to now? While I think refactoring now would have been more efficient than refactoring later, but I'm not sure it's a huge saving.

The validation which is spread across the Admin and Data Processor is also pretty complicated.

If this ends up with Admin dealing with all the validation - say with the screener added - and the Processor with importing, I'm inclined to think that is okay. It doesn't feel to me like the complexity will become out of hand - of the stuff I touched, I found it fairly easy to follow.

So while I am keen to minimise complexity everywhere, if we had extra time to spend, there are many other places where I think the complexity is getting out of control - something like untangling the publication of methodologies and releases (which in turn made redirects a nightmare) would be a much better place to spend extra time, I think.

@mmyoungman mmyoungman force-pushed the EES-4932 branch 3 times, most recently from 3e4d82f to dcca428 Compare July 12, 2024 09:06
Comment on lines 3 to 19
public class ArchiveDataSetFile(
string title,
string dataFilename,
string metaFilename,
long dataFileSize = 1048576,
long metaFileSize = 1024)
{
public string Title { get; } = title;

public string DataFilename { get; } = dataFilename;

public string MetaFilename { get; } = metaFilename;

public long DataFileSize { get; } = dataFileSize;

public long MetaFileSize { get; } = metaFileSize;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative I think it would be safe to use the following instead

public record ArchiveDataSetFile(
    string Title,
    string DataFilename,
    string MetaFilename,
    long DataFileSize = 1048576,
    long MetaFileSize = 1024);

@@ -8,12 +8,12 @@ namespace GovUk.Education.ExploreEducationStatistics.Common.Services.Interfaces
{
public interface IFileTypeService
{
Task<string> GetMimeType(IFormFile file);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest this method in FileTypeService could be made private now but I see it's still being used by unit tests and I'm guessing you already considered this and decided to keep them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't - but now you point it out I'd be inclined to keep the unit tests, even if it's really testing dependencies for the most part.

[Produces("application/json")]
[ProducesResponseType(200)]
[ProducesResponseType(404)]
[RequestSizeLimit(int.MaxValue)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be pretty quick to test just so we can check our understanding of what's going off here. If you take away the annotation completely I think you should hit an error if you try to upload a file bigger than 28.6Mb which is the default Kestrel request limit (we only use Kestrel when running locally).

We haven't overriden serverOptions.Limits.MaxRequestBodySize to increase that global 28.6Mb limit within our existing Kestrel configuration. See

                    webBuilder.ConfigureKestrel(serverOptions =>
                    {
                        serverOptions.AddServerHeader = false;
                    });

Once you've verified that 28.6Mb limit, if adding in the [DisableRequestSizeLimit] annotation allows you to upload bigger files then I think it should be good to swap instead of using [RequestSizeLimit(int.MaxValue)].

Alternatively to enforce a 2Gb limit I think it would be better to make that explicit by using [RequestSizeLimit(2147483648)]

Comment on lines 104 to 105
return mimeType == "application/octet-stream"
|| mimeType == "application/zip";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the comment // i.e. is zip file looks like it applies to the "application/octet-stream" side of the condition as well, but I think that's for a different type of file, not zip, which is also best suited to be checked using mime detective?

Not a big deal, just an observation while reading the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this has been moved from Common to Admin it needs #nullable enable as we still don't have that set globally in the Admin project.

using System.Threading.Tasks;
using CsvHelper;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a few unused imports in this class including this one.

}

private string GuessMagicInfo(Stream fileStream, MagicOpenFlags flag)
private string GuessMagicInfo(Stream fileStream, MagicOpenFlags flag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's okay but I regularly format all my changed files and spend time wondering whether to pollute my own changes fixing prexisting formatting issues or undo them. It also doesn't make it easy to work out why/when a change has been made if it's attributed to another issue later on.

@mmyoungman mmyoungman added the do not merge Don't merge just yet label Jul 16, 2024
@mmyoungman mmyoungman merged commit 63ffabf into dev Jul 18, 2024
7 checks passed
@mmyoungman mmyoungman deleted the EES-4932 branch July 18, 2024 12:22
@mmyoungman mmyoungman removed the do not merge Don't merge just yet label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants