Skip to content

Commit

Permalink
feat: Handle partial uploads and add file exclusion patterns
Browse files Browse the repository at this point in the history
This commit introduces two enhancements:

Partial Upload Handling: Addresses issue #251 by verifying the uploaded file size against the object metadata size. This prevents processing incomplete uploads.

File Exclusion Patterns: Adds a fileExclusionPatterns array to the configuration (including the template file) allowing specific files to be ignored during processing. This improves efficiency and avoids unnecessary scans.
  • Loading branch information
steve-mays committed Nov 13, 2024
1 parent 5b19267 commit 6890d22
Show file tree
Hide file tree
Showing 6 changed files with 99 additions and 28 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ CONFIG_JSON: |
"quarantined": "quarantined-bucket-name"
}
],
"ClamCvdMirrorBucket": "cvd-mirror-bucket-name"
"ClamCvdMirrorBucket": "cvd-mirror-bucket-name",
"fileExclusionPatterns": []
}
```
Expand Down Expand Up @@ -111,7 +112,8 @@ resource "google_cloud_run_v2_service" "malware-scanner" {
quarantined = "quarantined-bucket-name"
}
]
ClamCvdMirrorBucket = "cvd-mirror-bucket-name"
ClamCvdMirrorBucket = "cvd-mirror-bucket-name",
fileExclusionPatterns = []
})
}
}
Expand Down
17 changes: 16 additions & 1 deletion cloudrun-malware-scanner/config-env.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,20 @@
# and can be shared across multiple deployments with the appropriate
# permissions.
#
# "fileExclusionPatterns" is a list of regular expressions. Files matching any
# of these patterns will be skipped during scanning. NOTE: These files will remain
# in the "unscanned" bucket and will need to be tidied and/or managed separately.
#
# Example:
#
# "fileExclusionPatterns": [
# "\\.filepart$" (Ignore files ending in ".filepart")
# "^ignore_me.*\\.txt$" (Ignore files starting with "ignore_me" and ending with ".txt")
# ]
#
# Cheat sheet for regular expressions:
# https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet
#
# Shell environmental variable substitution is supported in this file.
# At runtime, JSON will be written to the file /etc/malware-scanner-config.json.
#
Expand All @@ -34,5 +48,6 @@ CONFIG_JSON: |
"quarantined": "quarantined-${PROJECT_ID}"
}
],
"ClamCvdMirrorBucket": "cvd-mirror-${PROJECT_ID}"
"ClamCvdMirrorBucket": "cvd-mirror-${PROJECT_ID}",
"fileExclusionPatterns": []
}
1 change: 1 addition & 0 deletions cloudrun-malware-scanner/config.js
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ const BucketTypes = Object.freeze({
* @typedef {{
* buckets: Array<BucketDefs>,
* ClamCvdMirrorBucket: string,
* fileExclusionPatterns: Array<string>,
* comments?: string
* }} Config
*/
Expand Down
18 changes: 15 additions & 3 deletions cloudrun-malware-scanner/config.json.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,19 @@
"'ClamCvdMirrorBucket' is a GCS bucket used to mirror the clamav database definition files to prevent overloading the Clam servers",
"and being rate limited/blacklisted. Its contents are maintained by the updateCvdMirror.sh script",
"",
"Shell environmental variable substitution is supported in this file.",
"At runtime, it will be copied to /etc",
"'fileExclusionPatterns' is a list of regular expressions. Files matching any",
"of these patterns will be skipped during scanning. NOTE: These files will remain",
"in the 'unscanned' bucket and will need to be tidied and/or managed separately.",
"",
"Example:",
"",
" 'fileExclusionPatterns: [",
" '\\.filepart$' (Ignore files ending in '.filepart')",
" '^ignore_me.*\\.txt$' (Ignore files starting with 'ignore_me' and ending with '.txt')",
" ]",
"",
"Cheat sheet for regular expressions:",
"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet",
"",
"As an alternative to including this file in the container the contents can be passed as an enviroment variable CONFIG_JSON on",
"Cloud Run startup",
Expand All @@ -24,5 +35,6 @@
"quarantined": "quarantined-bucket-name"
}
],
"ClamCvdMirrorBucket": "cvd-mirror-bucket-name"
"ClamCvdMirrorBucket": "cvd-mirror-bucket-name",
"fileExclusionPatterns": []
}
69 changes: 55 additions & 14 deletions cloudrun-malware-scanner/server.js
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ const MAX_FILE_SIZE = 500000000; // 500MiB
const BUCKET_CONFIG = {
buckets: [],
ClamCvdMirrorBucket: '',
fileExclusionPatterns: [],
};

// Create Clients.
Expand Down Expand Up @@ -109,6 +110,7 @@ app.post('/', async (req, res) => {
*/
async function handleGcsObject(req, res) {
const file = req.body;

try {
if (!file?.name) {
handleErrorResponse(res, 200, `file name not specified in ${file}`);
Expand All @@ -118,16 +120,7 @@ async function handleGcsObject(req, res) {
handleErrorResponse(res, 200, `bucket name not specified in ${file}`);
return;
}
const fileSize = parseInt(file.size);
if (fileSize > MAX_FILE_SIZE) {
handleErrorResponse(
res,
200,
`file gs://${file.bucket}/${file.name} too large for scanning at ${fileSize} bytes`,
file.bucket,
);
return;
}

const config = BUCKET_CONFIG.buckets.filter(
(c) => c.unscanned === file.bucket,
)[0];
Expand All @@ -141,6 +134,7 @@ async function handleGcsObject(req, res) {
}

const gcsFile = storage.bucket(file.bucket).file(file.name);

// File.exists() returns a FileExistsResponse, which is a list with a
// single value.
if (!(await gcsFile.exists())[0]) {
Expand All @@ -152,6 +146,47 @@ async function handleGcsObject(req, res) {
return;
}

const [metadata] = await gcsFile.getMetadata();

// Parse the file size from the request body ('file.size') and from the file metadata ('metadata.size').
// Compare the parsed file sizes.
// If the sizes don't match, log an informational message indicating a potential incomplete file upload and return a "ignored" status to the client.
// This check helps avoid scanning partially uploaded files, which might lead to inaccurate scan results.
const fileSize = parseInt(String(file.size));
const metadataSize = parseInt(String(metadata.size));
if (fileSize !== metadataSize) {
logger.info(
`Ignoring file gs://${file.bucket}/${file.name}. File size mismatch (reported: ${fileSize}, metadata: ${metadataSize}). File upload may not be complete.`,
);
res.json({status: 'ignored', reason: 'file_size_mismatch'});
return;
}

// Check if the file is too big to process
if (fileSize > MAX_FILE_SIZE) {
handleErrorResponse(
res,
200,
`file gs://${file.bucket}/${file.name} too large for scanning at ${fileSize} bytes`,
file.bucket,
);
return;
}

// Iterate through the configured file exclusion patterns.
// If the file name matches any of the exclusion patterns, log an informational message and return an "ignored" status to the client.
// This allows specific files to be skipped from the scanning process based on their names.
for (const pattern of BUCKET_CONFIG.fileExclusionPatterns) {
const regex = new RegExp(pattern);
if (regex.test(file.name)) {
logger.info(
`Ignoring file gs://${file.bucket}/${file.name} based on regex: ${pattern}`,
);
res.json({status: 'ignored', reason: 'regex_match'});
return;
}
}

const clamdVersion = await getClamVersion();
logger.info(
`Scan request for gs://${file.bucket}/${file.name}, (${fileSize} bytes) scanning with clam ${clamdVersion}`,
Expand Down Expand Up @@ -336,12 +371,15 @@ async function getClamVersion() {
* @param {!import('./config.js').BucketDefs} config
*/
async function moveProcessedFile(filename, isClean, config) {
const srcfile = storage.bucket(config.unscanned).file(filename);
const destinationBucketName = isClean
? `gs://${config.clean}`
: `gs://${config.quarantined}`;
const srcBucketName = config.unscanned;
const srcfile = storage.bucket(srcBucketName).file(filename);
const destinationBucketName = isClean ? config.clean : config.quarantined;
const destinationBucket = storage.bucket(destinationBucketName);

await srcfile.move(destinationBucket);
logger.info(
`Successfully moved file gs://${srcBucketName}/${filename} to gs://${destinationBucketName}/${filename}`,
);
}

/**
Expand Down Expand Up @@ -387,6 +425,9 @@ async function run() {
}
const config = await readAndVerifyConfig(configFile);

// Ensure ignoreFilespecs is an array (even if empty)
config.fileExclusionPatterns = config.fileExclusionPatterns || [];

Object.assign(BUCKET_CONFIG, config);

await waitForClamD();
Expand Down
16 changes: 8 additions & 8 deletions terraform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ malware-scanner service on cloud run.

The deployment is split into 4 stages:

1. Set up the google cloud project environment and service configuration.
1. Use Terraform to set up the required service accounts and deploy required
infrastructure.
1. Launch cloud build to build the Docker image for the malware-scanner
service.
1. Use Terraform to deploy the malware-scanner service to cloud run, and
connect the service to the infrastructure created in stage 2.
1. Set up the google cloud project environment and service configuration.
1. Use Terraform to set up the required service accounts and deploy required
infrastructure.
1. Launch cloud build to build the Docker image for the malware-scanner service.
1. Use Terraform to deploy the malware-scanner service to cloud run, and connect
the service to the infrastructure created in stage 2.

Follow the instructions below to use Terraform to deploy the malware scanner
service in a demo project.
Expand Down Expand Up @@ -53,7 +52,8 @@ TF_VAR_config_json=$(cat <<EOF
"quarantined": "quarantined-${PROJECT_ID}"
}
],
"ClamCvdMirrorBucket": "cvd-mirror-${PROJECT_ID}"
"ClamCvdMirrorBucket": "cvd-mirror-${PROJECT_ID}",
"fileExclusionPatterns": []
}
EOF
)
Expand Down

0 comments on commit 6890d22

Please sign in to comment.