Skip to content

Architecture

Jonas Almeida edited this page Aug 27, 2024 · 6 revisions

Q. Why preserving this material as a self-contained GitHub repository?

A. No nominal costs!

The original resource at covid19serohub.nih.gov was conceived as a conventional client-server architecture, with a database and an API as its back end. This resource was deemed unsustainable and scheduled for disablement in over a period of a few weeks in August 2024. Preserving datasets as applications with this architecture is fraught with code base and backend persistent issues and, ultimately, high costs. On the contrary, rescuing the reference dataset in a GitHub repository, and then using GitHub pages to rebuild tooling overtime has no persistent nominal costs. To reach that goal without loss of programatica interoperability, while adding the advantage of versioning comes in two steps:

1) Compressed storage.

At the presente (Aug 27, 2024), the raw data file contains 39,636 individual records for 60 variables: 1) age; 2) algorithm_test_comments; 3) analysis_strategy; 4) antibody_isotypes; 5) antigen_target; 6) catchment_area; 7) collection_end; 8) collection_frequency; 9) collection_midpoint; 10) collection_start; 11) collection_state; 12) comments; 13) confidence_interval; 14) corresponding_author_email; 15) ethnicity; 16) eua; 17) funding_agencies; 18) geographic_location; 19) grant_numbers; 20) keywords; 21) lead_author; 22) lead_author_affiliation; 23) manufacturer; 24) manufacturer_standards; 25) methodology; 26) number_of_participants; 27) other; 28) other_antigen_target; 29) overall_study; 30) primary_design; 31) publication_date; 32) race; 33) reference; 34) report_type; 35) round; 36) row; 37) sample_volume; 38) sampling_methodology_description; 39) sensitivity_ci_lower; 40) sensitivity_ci_upper; 41) sensitivity_confidence_level; 42) sensitivity_value; 43) seroprevalence; 44) sex; 45) specificity_ci_lower; 46) specificity_ci_upper; 47) specificity_confidence_level; 48) specificity_value; 49) spike_subsets; 50) status; 51) study_and_report_titles; 52) study_identifier; 53) study_objectives; 54) study_population; 55) target; 56) test_comments; 57) test_name; 58) test_sample_types; 59) test_type; 60) trial_network.

This data was provided as an object array, compressed (zip) as a 2.9 Mb file, at https://episphere.github.io/serohub/seroprevalence.json.zip . From that compressed format a JSON Object Array can be decompressed in memory as a 159 Mb volume, and from that a 78Mb tabular Tab delimited file can be assembled (alternatively, the compressed storage could have targeted the tabular format,a 2.1Mb compressed volume). The important feature to note is that an SDK is also provided to decompress those two textual serializations in-memory. These are time and memory efficient processes, typically under a second and, the important feature, with zero nominal costs. This can be verified by clicking on the JSON or TSV links in he landing page. More to the point of client-side execution, these can be verified by operating on the 2.9 compressed volume using seroHub's Software Development Kit (SDK), provided with and developed for this github repository.

2) Interoperability

In order to preserve the reference data in its compressed state, while allowing for its use by other applications programmatically, a JavaScript SDK was developed. This can be put to the test by loading if different application domains:

seroHub = (await import("https://episphere.github.io/serohub/serohub.mjs")).seroHub

... including those of notebooks.