Skip to content

Latest commit

 

History

History
914 lines (639 loc) · 41.8 KB

USERMANUAL.md

File metadata and controls

914 lines (639 loc) · 41.8 KB
  • Different TaskRoles compose a Framework
  • Same Tasks compose a TaskRole
  • A User Service is executed by all Tasks in its corresponding TaskRole
  • A YARN Application is an execution attempt of a Framework
  • A YARN Container is an execution attempt of a Task
  1. Prepare Framework

    1. Upload Framework Executable to HDFS

      Upload the Example Framework Executable to HDFS:

       hadoop fs -mkdir -p /ExampleFramework/
       hadoop fs -put -f ExampleFramework.sh /ExampleFramework/
      
    2. Write Framework Description File

      Just use the Example Framework Description File.

       Example Framework Description Explanation:
       • The Example Framework with Version 1 contains 1 TaskRole named LRSMaster.
       • LRSMaster contains 2 Tasks and they will be executed for LRSMaster's TaskService.
       • LRSMaster's TaskService with Version 1 is defined by its EntryPoint, SourceLocations and Resource.
       • The EntryPoint and SourceLocations defines the Service's corresponding Executable which needs to be ran inside Containers. 
       • The Resource defines the Container Resource Guarantee / Limitation.
      
  2. Launch Framework

    Launcher Service need to be started before Launch Framework.

    See README to Start Launcher Service

    See Root URI to get {LauncherAddress}

    HTTP PUT the Framework Description File as json to:

     http://{LauncherAddress}/v1/Frameworks/ExampleFramework
    

    For example, with curl, you can execute below cmd line:

     curl -X PUT http://{LauncherAddress}/v1/Frameworks/ExampleFramework -d @ExampleFramework.json --header "Content-Type: application/json"
    
  3. Monitor Framework

    Check all the Requested Frameworks by:

     http://{LauncherAddress}/v1/Frameworks
    

    Check ExampleFramework by:

     http://{LauncherAddress}/v1/Frameworks/ExampleFramework
    

Architecture

LauncherInterfaces:

  • RestAPI
  • Submit Framework Description

LauncherService:

  • One Central Service
  • Manages all Frameworks for the whole Cluster.

LauncherAM:

  • Per-Framework Service
  • Manage Tasks for a single Framework by customized feature requirement

Pipeline

Launcher Service can be configured by LauncherConfiguration. You can check the Type, Specification and FeatureUsage inside it.

And we also provide a default configuration for you to refer: Default LauncherConfiguration File.

  • All APIs are IDEMPOTENT and STATELESS, to allowed trivial Work Preserving Client Restart. In other words, User do not need to worry about call one API multiple times by different Client instance (such as Client Restart, etc).
  • All APIs are DISTRIBUTED THREAD SAFE, to allow multiple distributed Client instances to access. In other words, User do not need to worry about call them at the same time in Multiple Threads/Processes/Nodes.
  • LauncherService can only handle a finite, limited request volume. User should try to minimize its overall request frequency and payload, so that the LauncherService is not overloaded. To achieve this, User can centralize requests, space out requests, filter respond and so on. Moreover, to get informations in a more scalable way than RestAPI, see HDFS Published Informations.
  • Completed Frameworks will ONLY be retained in recent FrameworkCompletedRetainSec, in case Client miss to delete the Framework after FrameworkCompleted. One exclusion is the Framework Launched by DataDeployment, it will be retained until the corresponding FrameworkDescriptionFile deleted in the DataDeployment. To avoid missing the CompletedFrameworkStatus, the polling interval seconds of Client should be less than FrameworkCompletedRetainSec. Check the frameworkCompletedRetainSec by GET LauncherStatus.

Configure it as webServerAddress inside LauncherConfiguration File.

Headers Description
UserName Specifies which User send the Request. It is effective iff webServerAclEnable is true, see Framework ACL.

Request

PUT /v1/Frameworks/{FrameworkName}

Type: application/json

Body: FrameworkDescriptor

Description

Add a NOT Requested Framework or Update a Requested Framework.

  1. Add a NOT Requested Framework: Framework will be Added and Launched (Now it is Requested).
  2. Update a Requested Framework:
    1. If FrameworkVersion unchanged:
      1. Framework will be Updated to the FrameworkDescription on the fly (i.e. Work Preserving).
      2. To Update Framework on the fly, it is better to use the corresponding PartialUpdate (such as PUT TaskNumber) than PUT the entire FrameworkDescription here. Because, partially update the FrameworkDescription can avoid the Race Condition (or Transaction Conflict) between two PUT Requests. Besides, the behaviour is undefined when change parameters in FrameworkDescription which is not supported by PartialUpdate.
    2. Else:
      1. Framework will be NonRolling Upgraded to new FrameworkVersion. (i.e. Not Work Preserving).
      2. NonRolling Upgrade can be used to change parameters in FrameworkDescription which is not supported by PartialUpdate (such as Framework Queue).
      3. NonRolling Upgrade should be triggered by change FrameworkVersion, instead of DELETE then PUT with the same FrameworkVersion.
  3. User is responsible and free to specify the FrameworkName of the Framework, however, the FrameworkName should respect the Framework ACL.
  4. After Accepted Response, its corresponding Status (such as FrameworkStatus and AggregatedFrameworkStatus) exists immediately, too. However, the Status may not be updated according to the Request (FrameworkDescriptor) immediately. So, to check whether it has been updated, Client still needs to poll the GET Status APIs.

Response

HttpStatusCode Body Description
Accepted(202) NULL The Request has been recorded for backend to process, not that the processing of the Request has been completed.
BadRequest(400) ExceptionMessage The Request validation failed. So, Client is expected to not retry for this non-transient failure and then correct the Request.
Forbidden(403) ExceptionMessage The Request authorization failed. So, Client is expected to not retry for this non-transient failure and then correct the Request or ask Administrator to grant the Request privilege. This Response may happen only if webServerAclEnable is true, see Framework ACL.
TooManyRequests(429) ExceptionMessage The Request is rejected due to the New Total TaskNumber will exceed the Max Total TaskNumber if backend accepted it. So, the Client is expected to retry for this transient failure or migrate the whole Framework to another Cluster.
ServiceUnavailable(503) ExceptionMessage The Request cannot be recorded for backend to process. In our system, this only happens when target Cluster's Zookeeper is down for a long time. So, the Client is expected to retry for this transient failure or migrate the whole Framework to another Cluster.

Request

DELETE /v1/Frameworks/{FrameworkName}

Description

Delete a Framework, no matter it is Requested or not.

Notes:

  1. Framework will be Stopped and Deleted (Now it is NOT Requested).
  2. After Accepted Response, its corresponding Status does not exist immediately, too.
  3. Only recently completed Frameworks will be kept, if Client miss to DELETE the Framework after FrameworkCompleted. One exclusion is the Framework Launched by DataDeployment, it will be kept until the corresponding FrameworkDescriptionFile deleted in the DataDeployment.

Response

HttpStatusCode Body Description
Accepted(202) NULL Same as PUT Framework
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/Frameworks/{FrameworkName}/FrameworkStatus

Description

Get the FrameworkStatus of a Requested Framework

Recipes:

  1. User Level RetryPolicy (Based on FrameworkState, ApplicationExitCode, ApplicationExitDiagnostics, ApplicationExitType)
  2. Directly Monitor Underlay YARN Application by YARN CLI or RestAPI (Based on ApplicationId or ApplicationTrackingUrl)

Response

HttpStatusCode Body Description
OK(200) FrameworkStatus
NotFound(404) ExceptionMessage Specified Framework has not been Requested yet. So, Client is expected to not retry for this non-transient failure and then PUT the corresponding Framework first.
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

PUT /v1/Frameworks/{FrameworkName}/TaskRoles/{TaskRoleName}/TaskNumber

Type: application/json

Body: UpdateTaskNumberRequest

Description

Update TaskNumber for a Requested Framework

Response

HttpStatusCode Body Description
Accepted(202) NULL Same as PUT Framework
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
TooManyRequests(429) ExceptionMessage Same as PUT Framework
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

PUT /v1/Frameworks/{FrameworkName}/ExecutionType

Type: application/json

Body: UpdateExecutionTypeRequest

Description

Update ExecutionType for a Requested Framework

Notes:

  1. It is ignored to change Framework ExecutionType from STOP to START, if previous STOP has already caused the Framework of current FrameworkVersion entered FINAL_STATES, i.e. FRAMEWORK_COMPLETED. So, to ensure starting the Framework again after STOP, just change the FrameworkVersion instead.

Response

HttpStatusCode Body Description
Accepted(202) NULL Same as PUT Framework
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

PUT /v1/Frameworks/{FrameworkName}/MigrateTasks/{ContainerId}

Type: application/json

Body: MigrateTaskRequest

Description

Migrate a Task from current Container to another Container for a Requested Framework And new Container and old Container will satisfy the AntiAffinityLevel constraint.

Notes:

  1. User is responsible for implement Health/Perf Measurement of the Service based on Monitoring TaskStatuses or self-contained communication. And if found some Health/Perf degradations, User can migrate it by calling this API with corresponding ContainerId as parameter.
  2. Currently, only support Any AntiAffinityLevel.

Response

HttpStatusCode Body Description
Accepted(202) NULL Same as PUT Framework
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

PUT /v1/Frameworks/{FrameworkName}/ApplicationProgress

Type: application/json

Body: OverrideApplicationProgressRequest

Description

Update ApplicationProgress for a Requested Framework

Notes:

  1. If User does not call this API. Default ApplicationProgress is used, and it is calculated as CompletedTaskCount / TotalTaskCount.
  2. User is responsible for implement Progress Measurement of the Service based on Monitoring Task logs or self-contained communication. And then feedback the Progress by calling this API to Override the default ApplicationProgress.

Response

HttpStatusCode Body Description
Accepted(202) NULL Same as PUT Framework
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/Frameworks/{FrameworkName}

Description

Get the FrameworkInfo of a Requested Framework

FrameworkInfo = SummarizedFrameworkInfo + AggregatedFrameworkRequest + AggregatedFrameworkStatus

Response

HttpStatusCode Body Description
OK(200) FrameworkInfo
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/Frameworks
QueryParameter Description
UserName Filter the result to only return Frameworks whose UserName equals the given value.

Description

Get the SummarizedFrameworkInfos of all Requested Frameworks

A Framework's SummarizedFrameworkInfo consists selected fields from its Status and Request

Response

HttpStatusCode Body Description
OK(200) SummarizedFrameworkInfos
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/Frameworks/{FrameworkName}/AggregatedFrameworkStatus

Description

Get the AggregatedFrameworkStatus of a Requested Framework

AggregatedFrameworkStatus = FrameworkStatus + all TaskRoles' (TaskRoleStatus + TaskStatuses)

TaskStatuses Recipes:

  1. ServiceDiscovery (Based on TaskRoleName, ContainerHostName, ContainerIPAddress, ServiceId)
  2. TaskLogForwarding (Based on ContainerLogHttpAddress)
  3. MasterSlave and MigrateTask (Based on ContainerId)
  4. DataPartition (Based on TaskIndex) (Note TaskIndex will not change after Task Restart, Migrated or Upgraded)

Response

HttpStatusCode Body Description
OK(200) AggregatedFrameworkStatus
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/Frameworks/{FrameworkName}/FrameworkRequest

Description

Get the FrameworkRequest of a Requested Framework

Current FrameworkDescriptor for the Framework is included in FrameworkRequest and it can reflect latest updates.

Response

HttpStatusCode Body Description
OK(200) FrameworkRequest
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/Frameworks/{FrameworkName}/AggregatedFrameworkRequest

Description

Get the AggregatedFrameworkRequest of a Requested Framework

AggregatedFrameworkRequest = FrameworkRequest + all other feedback Request

Response

HttpStatusCode Body Description
OK(200) AggregatedFrameworkRequest
NotFound(404) ExceptionMessage Same as GET FrameworkStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/LauncherRequest

Description

Get the LauncherRequest

Response

HttpStatusCode Body Description
OK(200) LauncherRequest
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/LauncherStatus

Description

Get the LauncherStatus

Current LauncherConfiguration is included in LauncherStatus and it can reflect latest updates.

Response

HttpStatusCode Body Description
OK(200) LauncherStatus
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

PUT /v1/LauncherRequest/ClusterConfiguration

Type: application/json

Body: ClusterConfiguration

Description

Update the ClusterConfiguration for all Frameworks on the fly

Besides the cluster information provided by YARN, Administrator can use this API to provide external information about current cluster configuration, which helps Launcher to schedule Task based on that. And below features depend on it:

  1. taskGpuType

Response

HttpStatusCode Body Description
Accepted(202) NULL Same as PUT Framework
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/LauncherRequest/ClusterConfiguration

Description

Get the ClusterConfiguration

Response

HttpStatusCode Body Description
OK(200) ClusterConfiguration
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

PUT /v1/LauncherRequest/AclConfiguration

Type: application/json

Body: AclConfiguration

Description

Update the AclConfiguration

It takes effects immediately after the Response iff webServerAclEnable is true, see Framework ACL.

Response

HttpStatusCode Body Description
OK(200) NULL
BadRequest(400) ExceptionMessage Same as PUT Framework
Forbidden(403) ExceptionMessage Same as PUT Framework
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

Request

GET /v1/LauncherRequest/AclConfiguration

Description

Get the AclConfiguration

Response

HttpStatusCode Body Description
OK(200) AclConfiguration
ServiceUnavailable(503) ExceptionMessage Same as PUT Framework

You can check the Type, Specification and FeatureUsage inside Launcher Data Model.

For example:

A Framework is Defined and Requested by FrameworkDescriptor data structure. To find the FeatureUsage inside FrameworkDescriptor, you can refer the comments inside FrameworkDescriptor.

Launcher sets up below EnvironmentVariables for each User Service to use:

  1. Used to locate itself during the whole Framework life cycle. So, they will not be changed after Migration or Restart.
EnvironmentVariable Description
LAUNCHER_ADDRESS
FRAMEWORK_NAME
FRAMEWORK_VERSION
TASKROLE_NAME
TASK_INDEX
SERVICE_NAME
SERVICE_VERSION
  1. Used to locate itself during a specific execution attempt. So, they will be changed to a different one after Migration or Restart.
EnvironmentVariable Description
APP_ID Framework's current associated Application ID.
CONTAINER_ID Task's current associated Container ID.
CONTAINER_IP Task's current associated Container IP address.
  1. Used to access the Launcher managed public HDFS store.
EnvironmentVariable Description
HDFS_USER_STORE_ROOT_DIR For User to store data under the directory which can be GC automatically after the Framework Deleted.
HDFS_FRAMEWORK_INFO_FILE See HDFS Published Informations: FrameworkInfo File.
  1. Used to get the assigned Resource, only valid when corresponding feature is enabled.
EnvironmentVariable Description
CONTAINER_GPUS Only valid when gpuNumber is greater than 0. It is a number, each bit of this number represents a Gpu, for example, 3 represents gpu0 and gpu1.
CONTAINER_PORTS Only valid when portDefinitions is not empty. It is a string, a format example is: "portLabel1:port1,port2;portLabel2:port3,port4;".

Launcher publishes informations to public HDFS files for User to fetch in a more scalable way than RestAPI.

Notes:

  1. HDFS Published Informations are subsets of RestAPI.
  2. HDFS Published Informations are mainly used to improve the scalability to fetch informations. For example, fetch from a per Framework HDFS FrameworkInfo File is more scalable than fetch from a global single instance LauncherService by RestAPI.
  3. HDFS Published Informations are not so stable and up-to-date as RestAPI, so, Client is suggested to first fetch the HDFS Published Informations, if failed after several retries, then fallback to RestAPI.

Check the hdfsRootDir by GET LauncherStatus.

HDFS Path

/{LauncherHDFSRootDir}/{FrameworkName}/FrameworkInfo.json

And it is also set by below EnvironmentVariable, see EnvironmentVariables:

{HDFS_FRAMEWORK_INFO_FILE}

Description

FrameworkInfo File contains the FrameworkInfo of a Requested Framework in Json format as GET Framework.

Notes:

  1. The FrameworkInfo File may not be up-to-date if the Framework is not in APPLICATION_RUNNING state. So, Client is suggested to only fetch this file inside the Containers of the Framework.

You can check all the defined ExitStatus by: ExitType, ExitDiagnostics.

Recipes:

  1. Your LauncherClient can depend on the ExitStatus Convention
  2. If your Service failed, the Service can optionally return the ExitCode of USER_APP_TRANSIENT_ERROR and USER_APP_NON_TRANSIENT_ERROR to help FancyRetryPolicy to identify your Service's TRANSIENT_NORMAL and NON_TRANSIENT ExitType. If neither ExitCode is returned, the Service is considered to exit due to UNKNOWN ExitType.

RetryPolicy can be configured for the whole Framework and each TaskRole to control:

  1. Framework RetryPolicy:
    The conditions to retry the whole Framework after the Framework's current associated Application completed.
    It can also be considered as Framework CompletionPolicy, i.e. the conditions to complete the whole Framework.

  2. Task RetryPolicy:
    The conditions to retry a single Task in the TaskRole after the Task's current associated Container completed.
    It can also be considered as Task CompletionPolicy, i.e. the conditions to complete a single Task in the TaskRole.

For details, please check: RetryPolicyDescriptor, RetryPolicyState.

Notes:

  1. Italic Conditions can be inherited from the DEFAULT RetryPolicy, so no need to specify them explicitly.
  2. For the definition of each ExitType, such as transient failure, see ExitStatus Convention.
FrameworkType Framework RetryPolicy TaskRole Task RetryPolicy Description
DEFAULT FancyRetryPolicy = false
MaxRetryCount = 0
TaskRole1 FancyRetryPolicy = false
MaxRetryCount = 0
The default RetryPolicy:
Never Retry for any failure or success.
TaskRole2 FancyRetryPolicy = false
MaxRetryCount = 0
Service FancyRetryPolicy = false
MaxRetryCount = -2
TaskRole1 FancyRetryPolicy = false
MaxRetryCount = -2
Always Retry for any failure or success.
Blind Batch Job FancyRetryPolicy = false
MaxRetryCount = -1
TaskRole1 FancyRetryPolicy = false
MaxRetryCount = -1
Always Retry for any failure.
Never Retry for success.
Batch Job with Task Fault Tolerance FancyRetryPolicy = true
MaxRetryCount = 3
TaskRole1 FancyRetryPolicy = true
MaxRetryCount = 3
Always Retry for transient failure.
Never Retry for non-transient failure or success.
Retry up to 3 times for unknown failure.
Batch Job without Task Fault Tolerance FancyRetryPolicy = true
MaxRetryCount = 3
TaskRole1 FancyRetryPolicy = false
MaxRetryCount = 0
For Framework RetryPolicy, same as "Batch Job with Task Fault Tolerance".
For Task RetryPolicy, because the Task cannot tolerate any failed Container, such as it cannot recover from previous failed Container, so Never Retry Task for any failure or success.
Debug Mode FancyRetryPolicy = true
MaxRetryCount = 0
TaskRole1 FancyRetryPolicy = true
MaxRetryCount = 0
Always Retry for transient failure.
Never Retry for non-transient failure or unknown failure or success.
This can help to capture the unexpected exit of User Service itself.

ApplicationCompletionPolicy can be configured for each TaskRole to control:

  1. The conditions to complete the Application.
  2. The ExitStatus of the completed Application.

For details, please check: TaskRoleApplicationCompletionPolicyDescriptor.

Notes:

  1. Italic Conditions can be inherited from the DEFAULT ApplicationCompletionPolicy, so no need to specify them explicitly.
FrameworkType TaskRole ApplicationCompletionPolicy Description
DEFAULT TaskRole1 MinFailedTaskCount = 1
MinSucceededTaskCount = null
The default ApplicationCompletionPolicy:
Fail the Application immediately if any Task failed.
Succeed the Application until all Tasks succeeded.
TaskRole2 MinFailedTaskCount = 1
MinSucceededTaskCount = null
Service TaskRole1 MinFailedTaskCount = 1
MinSucceededTaskCount = null
Actually, any ApplicationCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Examples.
MapReduce Map MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1
MinSucceededTaskCount = null
A few failed Tasks is acceptable, but always want to wait all Tasks to succeed:
Fail the Application immediately if the failed Tasks exceeded the limit.
Succeed the Application until all Tasks completed and the failed Tasks is within the limit.
Reduce MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1
MinSucceededTaskCount = null
TensorFlow ParameterServer MinFailedTaskCount = 1
MinSucceededTaskCount = null
Succeed a certain TaskRole is enough, and do not want to wait all Tasks to succeed:
Fail the Application immediately if any Task failed.
Succeed the Application immediately if Worker's all Tasks succeeded.
Worker MinFailedTaskCount = 1
MinSucceededTaskCount = {Worker.TaskNumber}
Arbitrator Dominated Arbitrator MinFailedTaskCount = 1
MinSucceededTaskCount = 1
The ApplicationCompletionPolicy is fully delegated to the Application's single instance arbitrator:
Fail the Application immediately if the arbitrator failed.
Succeed the Application immediately if the arbitrator succeeded.
TaskRole1 MinFailedTaskCount = null
MinSucceededTaskCount = null
TaskRole2 MinFailedTaskCount = null
MinSucceededTaskCount = null
First Completed Task Dominated TaskRole1 MinFailedTaskCount = 1
MinSucceededTaskCount = 1
The ApplicationCompletionPolicy is fully delegated to the Application's first completed Task:
Fail the Application immediately if any Task failed.
Succeed the Application immediately if any Task succeeded.

Framework ACL specifies which Users/Groups are able to access a specific FrameworkName, no matter the FrameworkName exists or not. So, essentially, it is the ACL for Users/Groups against FrameworkName's Namespace.

Framework ACL helps to:

  1. Avoid one User/Group to occupy the FrameworkName (by Add Framework) reserved for other User/Group.
  2. Avoid one User/Group to modify the Framework (by Update Framework) launched by other User/Group.
  1. No naming conflict among UserNames and GroupNames.
  2. UserNames and GroupNames satisfy regex ^[A-Za-z0-9\-._]{1,254}$.

Framework ACL is enabled iff the webServerAclEnable is true, check it by GET LauncherStatus.

  1. If it is disabled, any User/Group can Read and Write the whole namespace. We assume it is enabled for simplicity below.
  2. Administrator can always Read and Write the whole namespace, this fact will be omitted for simplicity below.

Namespace Privilege:

Namespace Write Privilege: Add or Update Framework in the Namespace

Namespace Read Privilege: Get Framework in the Namespace

Namespace Mechanism:

{Namespace}~(AnyName)
  1. It is pre-created, so no need to create the namespace in advance.
  2. All Users/Groups can Read the namespace.
  3. Initially, only the User named {Namespace} can Write the namespace. To grant the Write Privilege to more Users, see PUT AclConfiguration.
  4. Based on the Assumption and Namespace Mechanism, the suggested Usage Pattern for Users/Groups can be derived below.

User Usage Pattern:

The private namespace for the User named {UserName} is:

{UserName}~(AnyName)
  1. It is pre-created, so no need to create the namespace in advance.
  2. Only this User can Write the namespace. However, all other Users/Groups can Read the namespace.

For example, User UA can fully control the namespace:

UA~(AnyName)

Group Usage Pattern:

The private namespace for the Group named {GroupName} is:

The shared namespace for the Users belongs to {GroupName} is:

{GroupName}~(AnyName)
  1. It is pre-created, so no need to create the namespace in advance.
  2. Initially, no User can Write the namespace. Administrator needs to add the UserNames belongs to the {GroupName} to the namespace {GroupName}, see PUT AclConfiguration. Only then, these Users can Write the namespace. However, all other Users/Groups can Read the namespace.

For example, Administrator adds User UA and UB belongs to Group GA to the namespace GA, and then UA and UB can work together to fully control the namespace:

GA~(AnyName)

Launcher does not enforce Users/Groups how to further partition his private namespace or how to avoid naming conflict within his private namespace. Different Users/Groups might choose to use his private namespace differently, depending on their exact requirement, scenario and assumption.

So, here, we just provide the best practices for two scenarios:

Batch Framework:

Batch Framework User tends to Add a new Framework each time instead of Update the existing one.

So, he should ensure the name is not reused within his private namespace each time to call PUT Framework.

Service Framework:

Service Framework User tends to frequently Update the existing Framework instead of Add a new one. And he tends to specify a well-known name which he might want to expose to the Users of the Service itself.

So, he should ensure the name is reused within his small set well-known Services each time to call PUT Framework.

Anyway, If he really want to ensure to Add a new one, he needs to check his small set well-known Services, and then pick a new one.

  1. The Initial Working Directory of your EntryPoint is the root directory of the EntryPoint. Your Service can read data anywhere, however it can ONLY write data under the Initial Working Directory with the Service Directory excluded. And if the Source is a ZIP file, it will be uncompressed before starting your Service. For example:

     EntryPoint=HbaseRS.zip/start.bat
     SourceLocations=hdfs:///HbaseRS.zip, hdfs:///HbaseCom <- HbaseRS.zip is a ZIP file
    

    The two Sources HbaseRS.zip and HbaseCom will be downloaded (and uncompressed) to local node as below structure:

     ./   <- The Initial Working Directory
     ├─HbaseRS.zip <- Service Directory <- HbaseRS.zip is a directory uncompressed from original ZIP file
     └─HbaseCom <- Service Directory
    
  2. Launcher may restart the exited Service on another node even if the original node is free and healthy. So, if you want to always restart the exited Service on the same node, you need to warp the original EntryPoint by another script, such as:

     while true; do
         # call the original EntryPoint
     done
    
  3. Increase the replication number your data and binary on target HDFS (Higher ReplicationNumber means faster downloading, higher availability and higher durability).

     hadoop fs -setrep -w <ReplicationNumber> <HDFS Path>
    
  4. Do not modify your data and binary on target HDFS. To use new data and binary, upload them to a different HDFS Path and then change the FrameworkVersion and SourceLocations.