在YARN上开发长服务,需要注意fault-tolerance,本篇文章对appmaster的平滑重启的一个参数做了解析,如何设置可以有助于达到appmaster平滑重启。

在yarn-site.xml有个参数

/**
   * The maximum number of application attempts.
   * It's a global setting for all application masters.
   */
yarn.resourcemanager.am.max-attempts

一个全局的appmaster重试次数的限制,yarn提交应用时,还可以为单独一个应用设置最大重试次数

/**
   * Set the number of max attempts of the application to be submitted. WARNING:
   * it should be no larger than the global number of max attempts in the Yarn
   * configuration.
   * @param maxAppAttempts the number of max attempts of the application
   * to be submitted.
   */
  @Public
  @Stable
  public abstract void setMaxAppAttempts(int maxAppAttempts);

当attempt失败时,如果设置keepContainersAcrossAppAttempts了,resource manager会决定上个attempt的container是否仍然保留着。

boolean keepContainersAcrossAppAttempts = false;
switch (finalAttemptState) {
  case FINISHED:
  {
    appEvent = new RMAppFinishedAttemptEvent(applicationId,
        appAttempt.getDiagnostics());
  }
  break;
  case KILLED:
  {
    // don't leave the tracking URL pointing to a non-existent AM
    appAttempt.setTrackingUrlToRMAppPage();
    appAttempt.invalidateAMHostAndPort();
    appEvent =
        new RMAppFailedAttemptEvent(applicationId,
            RMAppEventType.ATTEMPT_KILLED,
            "Application killed by user.", false);
  }
  break;
  case FAILED:
  {
    // don't leave the tracking URL pointing to a non-existent AM
    appAttempt.setTrackingUrlToRMAppPage();
    appAttempt.invalidateAMHostAndPort();

    if (appAttempt.submissionContext
      .getKeepContainersAcrossApplicationAttempts()
        && !appAttempt.submissionContext.getUnmanagedAM()) {
      // See if we should retain containers for non-unmanaged applications
      if (!appAttempt.shouldCountTowardsMaxAttemptRetry()) {
        // Premption, hardware failures, NM resync doesn't count towards
        // app-failures and so we should retain containers.
        keepContainersAcrossAppAttempts = true;
      } else if (!appAttempt.maybeLastAttempt) {
        // Not preemption, hardware failures or NM resync.
        // Not last-attempt too - keep containers.
        keepContainersAcrossAppAttempts = true;
      }
    }
    appEvent =
        new RMAppFailedAttemptEvent(applicationId,
          RMAppEventType.ATTEMPT_FAILED, appAttempt.getDiagnostics(),
          keepContainersAcrossAppAttempts);

  }
}

关注appAttempt.maybeLastAttempt这个变量,rs如何判断是否这次attempt是最后一次呢?

private void createNewAttempt() {
    ApplicationAttemptId appAttemptId =
        ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1);
    RMAppAttempt attempt =
        new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
          submissionContext, conf,
          // The newly created attempt maybe last attempt if (number of
          // previously failed attempts(which should not include Preempted,
          // hardware error and NM resync) + 1) equal to the max-attempt
          // limit.
          maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq);
    attempts.put(appAttemptId, attempt);
    currentAttempt = attempt;
  }

在每次构造新的attempt时候,maxAppAttempts == (getNumFailedAppAttempts() + 1)会决定,已经失败的次数+1,是否已经达到了maxAppAttempts的限制了。

而maxAppAttempts这个参数是由global和individual两个配置取min,决定的。

int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
        YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS);
    int individualMaxAppAttempts = submissionContext.getMaxAppAttempts();
    if (individualMaxAppAttempts <= 0 ||
        individualMaxAppAttempts > globalMaxAppAttempts) {
      this.maxAppAttempts = globalMaxAppAttempts;
      LOG.warn("The specific max attempts: " + individualMaxAppAttempts
          + " for application: " + applicationId.getId()
          + " is invalid, because it is out of the range [1, "
          + globalMaxAppAttempts + "]. Use the global max attempts instead.");
    } else {
      this.maxAppAttempts = individualMaxAppAttempts;
    }

总结:

如果希望appmaster可以达到不断重启,而且可以接管之前的container,需要把yarn.resourcemanager.am.max-attempts这个参数尽量调大,比如设置为10000,并且提交app时候设置submit context的最大次数,以及刷新窗口,这样基本就可以满足长服务应用在yarn上面的运行需求了。