All API calls described below are intended to run on the client side. In other words, a separate client process must communicate requests to the OIAD HTTP service running in Docker.

Offline Processing (Corpus Seeding, Initial Clustering)

Before the clustering service can return meaningful cluster suggestions, a domain-relevant corpus must be provided to seed the clustering algorithm. The seeding process can be time-consuming; in our experiments it took 4.83 hours on a machine with 8 cores, using 16 threads to complete the seeding process for 256 analytical articles.

Seed Job Submission

To submit a new corpus for seeding, send an HTTP POST request to api/v1/concept_discovery/offline_processing/submit with json content in the request body. An example curl call is:

curl --request POST \
  --url http://HTTPENDPOINT/api/v1/concept_discovery/offline_processing/submit \
  --header 'Content-Type: application/json' \
  --data @./offline_processing_5doc.json

The content of offline_processing_5doc.json should contain:

{
    "relevant_doc_uuids":[  strings of doc uuids ],
    "allowed_words":[ strings of words from saliency list ],
    "ontology_metadata": "string of ontology_metadata file"
}

And will return:

{
  "job_id": "0c70a96f-b34b-40ab-b588-7a38f7d4b8f3",
  "message": "Job submitted",
  "success": true
}

Note: job_id will vary. The job_id is hard-coded above to demonstrate the querying process (below).

Polling Seed Job Status

To poll for clustering job status, send an HTTP GET request to /api/v1/concept_discovery/offline_processing/job with query job_id=0c70a96f-b34b-40ab-b588-7a38f7d4b8f3

curl --request GET \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/offline_processing/job?job_id=0c70a96f-b34b-40ab-b588-7a38f7d4b8f3'

Depending on the status of the seeding process, there are several potential responses:

The job is still running

{"success":true, "message":"Your session is not ready yet.","job_id":"job_id string","is_finished":false,"without_error":false,"job_log":"some debugging string for you to trace the progress"}

In short, is_finished: False means the job is running on the backend. The result is not yet ready.

The job_log field contains the stdout+stderr output from the backend side for developer usage. It will be refreshed as you call the endpoint, and can be used to help estimate the remaining time of the job. The corpus seeding process calls multiple batch sequences on the backend and execution time will vary between steps.

The job is no longer running

There are three possibilities here

The job exited without error
The job exited with error
The job was killed (for example, if the docker container unexpectedly exited), or you provided an invalid job_id

The job exited without error

You'll see

{"success":true, "message":"Your session is ready.","job_id":job_id,"is_finished":true,"without_error":true,"job_log":"some debugging string for you to trace the progress"}

The is_finished: True, without_error: True setting combination indicates success. At this point, you can start making online clustering service requests.

The job exited with error

{"success":true, "message":"Your session cannot be prepared.","job_id":job_id,"is_finished":true,"without_error":false,"job_log":"some debugging string for you to trace the progress"}

The is_finished: True, without_error: False setting combination indicates an error state.

The job was killed or you provided an invalid job_id

{"success":false, "message":"We lost track the job status, it may be you use the wrong job_id, or our backend has restarted with previous run killed. ","job_id":job_id,"is_finished":false,"without_error":false,"job_log":"some debugging string for you to trace the progress"}

With HTTP error code 404. In practice, the success: False, is_finished:False settting combination should rarely occur.

To Kill a Corpus Seeding Job in Progress

We do not enforce a singleton pattern on corpus seeding jobs and in theory multiple runs can process concurrently. However, multiple jobs will likely run out of CPU/memory and cause undefined behavior.

You can attempt to kill a previously scheduled job with a delete request:

curl --request DELETE \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/offline_processing/job?job_id=0c70a96f-b34b-40ab-b588-7a38f7d4b8f3'

It will return

{"success":true, "message":"We killed your job.","job_id":job_id}

Or

{"success":false, "message":"We cannot kill your job, your job_id is valid but the process is not in running state.","job_id":job_id}

With ERROR code 500.

Online Processing (Clustering On Demand)

Clustering Result Format

Please find an example output at example_oiad_clusters.json.

To Fetch Initial Clustering Result (after offline processing/corpus seeding)

We use a special uuid (00000000-0000-0000-0000-000000000000) for retrieving the clustering results from corpus seeding.

The results can be fetched with this call:

curl --request GET \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/clustering/result?job_id=00000000-0000-0000-0000-000000000000'

Reclustering Process

We use an asynchronous API for reclustering because the clustering algorithm does not return instantaneously.

Reclustering job submission

curl --request POST \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/clustering/submit' \
  --header 'Content-Type: application/json' \
  --data '{"allowed_words":[],"ontology_metadata":""}'

allowed_words are the words (or phrases joint by _) that you want the clustering algorithm to consider. ontology_metadata is the ontology metadata in string form.

It will return

{
  "job_id": "1620326564",
  "message": "Job submitted",
  "success": true
}

For simplicity, we continue to use the same job_id (1620326564) for all of the following example requests.

Reclustering job status

curl --request GET \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/clustering/job?job_id=1620326564'

There will be two possible return values:

{
  "is_ready": false,
  "job_id": "1620326564",
  "message": "Your job is still pending.",
  "success": true
}

This status means that either the job is still running or it finished with an error. Becuase we don't explicitly track the status of the subprocess, the client should implement a timeout mechanism that assumes the reclustering has failed if the pending status persists for more than 10 minutes. A typical clustering request will finish in 4 minutes.

{
  "is_ready": true,
  "job_id": "1620326564",
  "message": "Your result is ready",
  "success": true
}

This means the clustering algorithm has returned with valid results.

Get reclustering result

curl --request GET \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/clustering/result?job_id=1620326564'

Rescoring Process

Just like for reclustering, we use an asynchronous API. The service will rescore/rerank the existing clustering results. API calls/responses are similar to the reclustering endpoints.

Rescoring job submission

curl --request POST \
 --url 'http://HTTPENDPOINT/api/v1/concept_discovery/rescoring/submit' \
 --header 'Content-Type: application/json' \
 --data '{"ontology_metadata":""}'

This time the rescoring endpoint doesn't take a set of allowed words. That's because the rescoring call won't change the set of clusters returned in any way, other than reordering them.

Rescoring job status

curl --request GET \
 --url 'http://HTTPENDPOINT/api/v1/concept_discovery/rescoring/job?job_id=d605864b-6ee4-46d2-9bc1-3c117b91f554'

Get rescoring result

curl --request GET \
  --url 'http://HTTPENDPOINT/api/v1/concept_discovery/rescoring/result?job_id=d605864b-6ee4-46d2-9bc1-3c117b91f554'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OIAD_API.md

OIAD_API.md

Offline Processing (Corpus Seeding, Initial Clustering)

Seed Job Submission

Polling Seed Job Status

The job is still running

The job is no longer running

The job exited without error

The job exited with error

The job was killed or you provided an invalid job_id

To Kill a Corpus Seeding Job in Progress

Online Processing (Clustering On Demand)

Clustering Result Format

To Fetch Initial Clustering Result (after offline processing/corpus seeding)

Reclustering Process

Reclustering job submission

Reclustering job status

Get reclustering result

Rescoring Process

Rescoring job submission

Rescoring job status

Get rescoring result

Files

OIAD_API.md

Latest commit

History

OIAD_API.md

File metadata and controls

Offline Processing (Corpus Seeding, Initial Clustering)

Seed Job Submission

Polling Seed Job Status

The job is still running

The job is no longer running

The job exited without error

The job exited with error

The job was killed or you provided an invalid job_id

To Kill a Corpus Seeding Job in Progress

Online Processing (Clustering On Demand)

Clustering Result Format

To Fetch Initial Clustering Result (after offline processing/corpus seeding)

Reclustering Process

Reclustering job submission

Reclustering job status

Get reclustering result

Rescoring Process

Rescoring job submission

Rescoring job status

Get rescoring result