[FDS-2386] Synapse entity tracking and code concurrency updates #1505

BryanFauble · 2024-09-17T21:02:51Z

I will be adding comments to specific areas of the code that should be looked at and carefully considered. Below covers the overall gist of what is going on here:

Problem:

When an entity is retrieved from the synapse python client it is not stored for further use within the context of a request hitting schematic. The entity is re-retrieved several times throughout the lifecycle of the request causing significant bottlenecks when retrieving data out of Synapse.
When downloading manifests a temporary directory is being used when running on AWS, but not locally, or for integration tests. As a result concurrent requests in those contexts were causing manifest files to be overwritten for new tests and causing previous tests to fail.
A few concepts for great expectations (suites and checkpoints) were constantly being created and torn down, even during requests where the checkpoint/suite was not created. This leads to more concurrency type issues where a checkpoint/suite could be torn down in the middle of a request, or at a point in time it isn't expected.

Solution:

Creating a simple "pull through cache" that can be added via composition to the classes that interact with Synapse. This is acting as a small (albeit hacky) shim layer between schematic and SYNPY. The usage of this entity tracker allows the code to function largely as it has been without the need to more in-depth refactoring. This means that an entity for a user session is pulled out of memory before it is pulled from Synapse again.
Writing manifests to a temporary directory in all cases except for when being used as a CLI tool.
Updating the logic for tearing down the great expectation suite and checkpoints to only consider tearing down the checkpoints that itself created.

Testing:

I have been stepping through the code with the integration tests that have been written to verify functionality.
Will need to have a verbose testing plan in place to verify that schematic continues to function as expected. @linglp @andrewelamb Are you two available to help with the concepts of a plan for helping to test all of these changes?

pyproject.toml

BryanFauble · 2024-09-17T21:04:09Z

pyproject.toml

@@ -75,13 +76,14 @@ Flask-Cors = {version = "^3.0.10", optional = true}
 uWSGI = {version = "^2.0.21", optional = true}
 Jinja2 = {version = ">2.11.3", optional = true}
 asyncio = "^3.4.3"
-jaeger-client = {version = "^4.8.0", optional = true}
-flask-opentracing = {version="^2.0.0", optional = true}


flask/jaeger are deprecated. Swapping over to the otel-api/sdk libraries instead.

BryanFauble · 2024-09-17T21:04:35Z

pyproject.toml

@@ -135,41 +131,3 @@ testpaths = [
 filterwarnings = [
    "ignore::DeprecationWarning"
 ]
-markers = [


These were moved to pytest.ini to stop a bunch of warning from being emitted.

BryanFauble · 2024-09-17T21:05:55Z

schematic/__init__.py

+        FlaskInstrumentor().instrument(
+            request_hook=request_hook, response_hook=response_hook
+        )


Most all the code in this script was just moved except for this FlaskInstruement and the request/response hook. This is coming from this library: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-flask - Which allows auto instrumentation of the flask server that is running here

BryanFauble · 2024-09-17T21:06:32Z

schematic/manifest/commands.py

+        datasetId=dataset_id,
+        downloadFile=True,
+        newManifestName=new_manifest_name,
+        use_temporary_folder=False,


When running in CLI mode we will not be downloading manifest files to a temporary folder, instead continue to use the defined manifest folder from the config.

BryanFauble · 2024-09-17T21:06:50Z

schematic/models/GE_Helpers.py

+        anonymous_usage_statistics = AnonymizedUsageStatisticsConfig(enabled=False)
+


This removes extra calls from leaving schematic.

Should this be a comment?

BryanFauble · 2024-09-17T21:09:14Z

schematic/models/GE_Helpers.py

+        self.expectation_suite_name = f"Manifest_test_suite_{uuid.uuid4()}"
+        expectation_suite = self.context.add_expectation_suite(
            expectation_suite_name=self.expectation_suite_name,
        )
+        self.suite = expectation_suite


I found issues when tests were running concurrently where great expectation test suites were stepping over eachother, causing tests to break in Un-expected places. With this change the JSON file that is created for each test suite run is unique, and cleaned up after the test suite has ran.

No more will another test clean these up.

The same notes about concurrency also apply when this code is running as an API server. If 2 users were sending in a request at the same time this is likely to have caused issues if they lined up just perfectly.

BryanFauble · 2024-09-17T21:10:10Z

schematic/models/GE_Helpers.py

@@ -421,7 +409,7 @@ def build_checkpoint(self):
            adds checkpoint to self
        """
        # create manifest checkpoint
-        self.checkpoint_name = "manifest_checkpoint"
+        self.checkpoint_name = f"manifest_checkpoint_{uuid.uuid4()}"


Using a unique checkpoint name to prevent name collisions with other GE runs.

BryanFauble · 2024-09-17T21:10:57Z

schematic/models/GE_Helpers.py

+        # TODO: Do we need to build data docs here during normal operation?
        self.context.build_data_docs(resource_identifiers=[suite_identifier])


@andrewelamb @linglp @milen-sage - Is this code needed? It builds a bunch of HTML files, but I couldn't understand if they were actually used anywhere?

BryanFauble · 2024-09-17T21:11:26Z

schematic/models/metadata.py

-        # check if suite has been created. If so, delete it
-        if os.path.exists("great_expectations/expectations/Manifest_test_suite.json"):
-            os.remove("great_expectations/expectations/Manifest_test_suite.json")
-


No longer cleaning up the test suite run here. Instead relying on the cleanup to happen in the finally block after the expectation runs.

BryanFauble · 2024-09-17T21:12:04Z

schematic/models/validate_manifest.py

+            if logger.isEnabledFor(logging.DEBUG):
+                t_GE = perf_counter()


Some extra code to help guard against extra things from being tracked in memory (It's super minor)

BryanFauble · 2024-09-17T21:12:18Z

schematic/models/validate_manifest.py

+                        "batch_identifiers": {
+                            "default_identifier_name": f"manifestID_{uuid.uuid4()}"
+                        },


Unique identifier to prevent name collisions

BryanFauble · 2024-09-17T21:12:46Z

schematic/models/validate_manifest.py

-            for error in sorted(v.iter_errors(annotation), key=exceptions.relevance):
+            for sorted_error in sorted(
+                v.iter_errors(annotation), key=exceptions.relevance
+            ):
                errorRow = str(i + 2)
-                errorCol = error.path[-1] if len(error.path) > 0 else "Wrong schema"
-                errorColName = error.path[0] if len(error.path) > 0 else "Wrong schema"
-                errorMsg = error.message[0:500]
-                errorVal = error.instance if len(error.path) > 0 else "Wrong schema"
+                errorColName = (
+                    sorted_error.path[0]
+                    if len(sorted_error.path) > 0
+                    else "Wrong schema"
+                )
+                errorMsg = sorted_error.message[0:500]
+                errorVal = (
+                    sorted_error.instance
+                    if len(sorted_error.path) > 0
+                    else "Wrong schema"
+                )


This is just automatically re-formatted code. I did not make any changes here.

BryanFauble · 2024-09-17T21:13:51Z

schematic/store/synapse_tracker.py

@@ -0,0 +1,103 @@
+"""This script is responsible for creating a 'pull through cache' class that can be


This file is the main star of all these changes. Please read carefully.

BryanFauble · 2024-09-17T21:15:14Z

schematic/store/synapse.py

+            cleanup_temporary_storage(
+                temporary_manifest_storage, time_delta_seconds=3600
+            )


Instead of blanket removing previously downloaded manifest files instead this will have a 1 hour delay to clean them up. This was implemented to prevent a manifest from getting cleaned up that might be actively being used, or used for a successful cache pull (ie: Not downloading the manifest file again).

BryanFauble · 2024-09-17T21:21:04Z

schematic/store/synapse.py

+            download_location = create_temp_folder(
+                path=temporary_manifest_storage,
+                prefix=f"{self.manifest_id}-{time.time()}-",
+            )
        else:
-            download_location = CONFIG.manifest_folder
-        manifest_data = self.syn.get(
-            self.manifest_id,
-            downloadLocation=download_location,
-            ifcollision="overwrite.local",
+            if use_temporary_folder:
+                download_location = create_temp_folder(
+                    path=CONFIG.manifest_folder,
+                    prefix=f"{self.manifest_id}-{time.time()}-",
+                )
+            else:
+                download_location = CONFIG.manifest_folder
+
+        manifest_data = self.synapse_entity_tracker.get(
+            synapse_id=self.manifest_id,
+            syn=self.syn,
+            download_file=True,
+            retrieve_if_not_present=True,
+            download_location=download_location,
        )
+
+        # This is doing a rename of the downloaded file. The reason this is important
+        # is that if we are re-using a file that was previously downloaded, but the
+        # file had been renamed. The file downloaded from the Synapse client is just
+        # a direct copy of that renamed file. This code will set the name of the file
+        # to the original name that was used to download the file. Note: An MD5 checksum
+        # of the file will still be performed so if the file has changed, it will be
+        # downloaded again.
+        filename = manifest_data._file_handle.fileName
+        if filename != os.path.basename(manifest_data.path):
+            parent_folder = os.path.dirname(manifest_data.path)
+            manifest_original_name_and_path = os.path.join(parent_folder, filename)
+
+            self.syn.cache.remove(
+                file_handle_id=manifest_data.dataFileHandleId, path=manifest_data.path
+            )
+            os.rename(manifest_data.path, manifest_original_name_and_path)
+            manifest_data.path = manifest_original_name_and_path
+            self.syn.cache.add(
+                file_handle_id=manifest_data.dataFileHandleId,
+                path=manifest_original_name_and_path,
+                md5=manifest_data._file_handle.contentMd5,
+            )
+
        return manifest_data


This is a potentially impactful change here @andrewelamb @linglp @milen-sage that could use a good look. Basically what this is solving:

(When NOT running on AWS)

When manifest were downloaded the downloadAs name, or the name of the actual file, could conflict with a file that was already in the location being downloaded to. This would overwrite the file and potentially cause the test, or whatever kick off that code to fail.

When running on either AWS or not

When downloading an entity where we already have a local copy of it (Think of a file that exists within the synapse cachemap) it will copy that file over to the requested location. The issue however is that the file that is copied retains whatever name it has been renamed to (Some tests do this, and code in schematic does this). Because of this I have the os.rename logic in place to set the name of the copied file back to the expected file name so that any subsequent code behaves as expected.

BryanFauble · 2024-09-17T21:24:30Z

schematic/store/synapse.py

-            projectName = self.syn.get(projectId, downloadFile=False).name
-            projects.append((projectId, projectName))
+            project_name_from_project_header = project_id_to_name_dict.get(projectId)
+            projects.append((projectId, project_name_from_project_header))


There are a few things to note here on these changes:

I have added a change to SYNPY ([FDS-2386] Add owner id to the credentials to be used later on synapsePythonClient#1130) that stored the ID of the user profile that logged into Synapse. This removes the need to call to getUserProfile() and saves an extra HTTP call.

I am storing the result of getting the paginated list of project headers into the synapse_entity_tracker. It is really slow to crawl this information and it occurs several times. By doing this the content is re-used in the context of the request.

I removed the need to call syn.get() for each of the projects as the paginated list of project headers contained all the data we needed.

BryanFauble · 2024-09-17T21:25:43Z

schematic/store/synapse.py

+        table_parent_id = self.getDatasetProject(datasetId=datasetId)
+        existing_table_id = self.syn.findEntityId(
+            name=table_name, parent=table_parent_id
        )


Swapping to findEntityId prevents the need to walk the project to find the applicable table ID. Instead we can utilize the Synapse Rest API to find the ID for us.

This will return None if not found

BryanFauble · 2024-09-17T21:28:02Z

schematic/store/synapse.py

+            table_entity = self.synapse_entity_tracker.get(
+                synapse_id=existing_table_id, syn=self.syn, download_file=False
+            )
+            annos = OldAnnotations(
+                id=table_entity.id,
+                etag=table_entity.etag,
+                values=table_entity.annotations,
+            )
            annos["primary_key"] = table_manifest["Component"][0] + "_id"
            annos = self.syn.set_annotations(annos)
+            table_entity.etag = annos.etag
+            table_entity.annotations = annos


In this file 2 sets of Annotations from SYNPY are being used:

from synapseclient.models.annotations import Annotations # This is the new set of Annotations that has Async support from synapseclient import Annotations as OldAnnotations # This is the old set of Annotations that works with the current Entity objects

Since all of the Entities returned from syn.get() still use OldAnnotations I am using it in this context.

BryanFauble · 2024-09-17T21:28:51Z

schematic/store/synapse.py

+        try:
+            # Rename the file to file_name_new then revert
+            # This is to maintain the original file name in-case other code is
+            # expecting that the file exists with the original name
+            original_file_path = metadataManifestPath
+            new_file_path = os.path.join(
+                os.path.dirname(metadataManifestPath), file_name_new
+            )
+            os.rename(original_file_path, new_file_path)
+
+            manifest_synapse_file = self._store_file_for_manifest_upload(
+                new_file_path=new_file_path,
+                dataset_id=datasetId,
+                existing_file_name=file_name_full,
+                file_name_new=file_name_new,
+                restrict_manifest=restrict_manifest,
+            )
+            manifest_synapse_file_id = manifest_synapse_file.id
+
+        finally:
+            # Revert the file name back to the original
+            os.rename(new_file_path, original_file_path)
+
+            if manifest_synapse_file:
+                manifest_synapse_file.path = original_file_path
+
+        return manifest_synapse_file_id


Doing this prevents the need to call changeFileMetaData as the file will be uploaded with the expected name. That function was/is super slow so this is a big gain.

BryanFauble · 2024-09-17T21:30:43Z

schematic/store/synapse.py

+                    annos["annotations"]["annotations"][anno_k] = ""
                elif (
                    isinstance(anno_v, str)
                    and re.fullmatch(csv_list_regex, anno_v)
                    and rule_in_rule_list(
                        "list", dmge.get_node_validation_rules(anno_k)
                    )
                ):
-                    annos[anno_k] = anno_v.split(",")
+                    annos["annotations"]["annotations"][anno_k] = anno_v.split(",")
                else:
-                    annos[anno_k] = anno_v
+                    annos["annotations"]["annotations"][anno_k] = anno_v


I pulled in some changes from: #1485 so I could make sure everything was working. I will handle any merge conflicts if that code goes first.

BryanFauble · 2024-09-17T21:31:47Z

schematic/store/synapse.py

-            config = self.synStore.syn.getConfigFile(CONFIG.synapse_configuration_path)
+            config = get_config_file(CONFIG.synapse_configuration_path)


Moved to the non-deprecated function (Same behavior)

BryanFauble · 2024-09-17T21:33:19Z

schematic_api/api/routes.py

-def trace_function_params():
-    """capture all the parameters of API requests"""
-
-    def decorator(func):
-        """create a decorator"""
-
-        @wraps(func)
-        def wrapper(*args, **kwargs):
-            """create a wrapper function. Any number of positional arguments and keyword arguments can be passed here."""
-            tracer = trace.get_tracer(__name__)
-            # Start a new span with the function's name
-            with tracer.start_as_current_span(func.__name__) as span:
-                # Set values of parameters as tags
-                for i, arg in enumerate(args):
-                    span.set_attribute(f"arg{i}", arg)
-
-                for name, value in kwargs.items():
-                    span.set_attribute(name, value)
-                # Call the actual function
-                result = func(*args, **kwargs)
-                return result
-
-        return wrapper
-
-    return decorator
-
-


This is no longer needed with the request hooks that I added for the FlaskInstrumentor

BryanFauble · 2024-09-17T21:34:10Z

schematic_api/api/routes.py

@@ -182,7 +103,7 @@ def convert_df_to_csv(self, df, file_name):
        """

        # convert dataframe to a temporary csv file
-        temp_dir = tempfile.gettempdir()
+        temp_dir = create_temp_folder(path=tempfile.gettempdir())


This was failing with high concurrency as tempfile.gettempdir() only returned back /tmp, and for tests where the name of the uploaded file was the same was causing errors. By doing this we are guaranteed to get back a unique temporary directory.

I wonder if this impacts deployment, not sure if the /tmp directory is mounted in any way (probably not)

…s-2386-synapse-refactoring

BryanFauble · 2024-09-19T20:17:36Z

.github/workflows/api_test.yml

-      #----------------------------------------------
-      #       load cached venv if cache exists
-      #----------------------------------------------
-      - name: Load cached venv
-        id: cached-poetry-dependencies
-        uses: actions/cache@v4
-        with:
-          path: .venv
-          key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}-${{ matrix.python-version }}
-
+


I pulled the changes from #1507 - As my build started to fail. I will pull in develop when this is merged in.

thomasyu888 · 2024-09-20T22:16:28Z

schematic/store/synapse_tracker.py

+        if project_headers:
+            return project_headers
+
+        all_results = syn.restGET(


Nit: There is a syn._GET_Paginated if that makes it easier

thomasyu888

I did a quick initial scan but I'll leave it to @BWMac / @jaymedina / @linglp / @andrewelamb / @GiaJordan for further comments before I go deeper into it.

I found your self review very helpful!

sonarcloud · 2024-09-21T00:23:22Z

Quality Gate passed

Issues
16 New issues
0 Accepted issues

Measures
0 Security Hotspots
82.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

BryanFauble added 2 commits September 17, 2024 13:22

Adding changes for synapse entity tracking

b369ef4

Create unique test suites for each run

30def7e

BryanFauble requested review from jaymedina and BWMac September 17, 2024 21:03

BryanFauble commented Sep 17, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

BryanFauble commented Sep 17, 2024

View reviewed changes

Remove extra try-catch

093e179

BryanFauble commented Sep 17, 2024

View reviewed changes

Remove todo

05e936c

BryanFauble commented Sep 17, 2024

View reviewed changes

linglp and others added 13 commits September 19, 2024 12:02

list out config

881fe6d

enforce to use python 3.10

e75b41c

update workflow

8339da7

update path

7fee916

run again

90671c2

remove poetry shell

242037b

remove caching dependencies

75520dd

remove caching dependencies

d586da8

remove caching dependencies

a19c933

remove caching

41b5ebe

Merge branch 'develop' into fds-2386-synapse-refactoring

6b77396

Patching for issues

906a298

Merge remote-tracking branch 'origin/develop-fix-gh-workflow' into fd…

a7fab56

…s-2386-synapse-refactoring

BryanFauble commented Sep 19, 2024

View reviewed changes

BryanFauble added 8 commits September 19, 2024 14:33

Bug patching

fe9b27e

Merge branch 'develop' into fds-2386-synapse-refactoring

b445853

Adding tests for io_utils changes

f10640e

Log tests to CLI (Temp)

38eb3ec

Create dir if not exsists

51bc8fc

Merge branch 'develop' into fds-2386-synapse-refactoring

058eed4

Don't log the cli

14fdc96

Update to released version of synapse python client

cd60d3c

BryanFauble marked this pull request as ready for review September 20, 2024 20:27

BryanFauble requested review from andrewelamb, GiaJordan and linglp as code owners September 20, 2024 20:27

Add docstrings

662461d

thomasyu888 reviewed Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FDS-2386] Synapse entity tracking and code concurrency updates #1505

[FDS-2386] Synapse entity tracking and code concurrency updates #1505

BryanFauble commented Sep 17, 2024 •

edited

Loading

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

thomasyu888 Sep 20, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

BryanFauble Sep 17, 2024

thomasyu888 Sep 20, 2024

BryanFauble Sep 19, 2024

thomasyu888 Sep 20, 2024

thomasyu888 left a comment

sonarcloud bot commented Sep 21, 2024

		anonymous_usage_statistics = AnonymizedUsageStatisticsConfig(enabled=False)

		# TODO: Do we need to build data docs here during normal operation?
		self.context.build_data_docs(resource_identifiers=[suite_identifier])

		@@ -0,0 +1,103 @@
		"""This script is responsible for creating a 'pull through cache' class that can be

		config = self.synStore.syn.getConfigFile(CONFIG.synapse_configuration_path)
		config = get_config_file(CONFIG.synapse_configuration_path)

[FDS-2386] Synapse entity tracking and code concurrency updates #1505

Are you sure you want to change the base?

[FDS-2386] Synapse entity tracking and code concurrency updates #1505

Conversation

BryanFauble commented Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Sep 21, 2024

Quality Gate passed

BryanFauble commented Sep 17, 2024 •

edited

Loading