Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up logs for batch segment allocation #14727

Closed
wants to merge 5 commits into from

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Aug 1, 2023

Motivation

Batch segment allocation is often difficult to debug due to the actual error messages being hidden in the logs.

Changes

  • Update signature of method IndexerMetadataStorageCoordinator.allocatePendingSegments to return List<SegmentAllocateResult> instead of just the allocated SegmentIdWithShardSpec
  • Propagate the encountered error all the way back to the SegmentAllocationQueue where it is logged when abandoning a request.
  • Simplify the logs and error messages
  • Reduce log level and remove logs where information is not useful for debugging
  • Rename some methods for clarity
  • No functional change

Old logs in case of a conflicting segment

2023-08-01T14:09:15,209 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource [wiki], interval [2022-01-01T00:00:00.000Z/2022-01-01T00:30:00.000Z]
2023-08-01T14:09:15,290 WARN [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Cannot allocate new segment for dataSource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:30:00.000Z], existingVersion[2023-08-01T14:09:15.046Z]: conflicting segment[wiki_2022-01-01T00:00:00.000Z_2022-01-01T01:00:00.000Z_2023-08-01T14:09:15.046Z].
2023-08-01T14:09:15,290 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:09:15,291 INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Allocation failed on attempt [1] due to error [Storage coordinator could not allocate segment.]. Can still retry action [SegmentAllocateAction{dataSource='wiki', timestamp=2022-01-01T00:00:00.000Z, queryGranularity={type=period, period=PT1S, timeZone=UTC, origin=null}, preferredSegmentGranularity={type=period, period=PT30M, timeZone=UTC, origin=null}, sequenceName='noop_2023-08-01T14:09:15.017Z_7d445db1-640c-4943-af64-3ed055493eae', previousSegmentId='null', skipSegmentLineageCheck=false, partialShardSpec=org.apache.druid.timeline.partition.NumberedPartialShardSpec@557b6a37, lockGranularity=TIME_CHUNK}].
2023-08-01T14:09:15,291 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource [wiki], interval [2022-01-01T00:00:00.000Z/2022-01-01T00:15:00.000Z]
2023-08-01T14:09:15,295 WARN [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Cannot allocate new segment for dataSource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:15:00.000Z], existingVersion[2023-08-01T14:09:15.046Z]: conflicting segment[wiki_2022-01-01T00:00:00.000Z_2022-01-01T01:00:00.000Z_2023-08-01T14:09:15.046Z].
2023-08-01T14:09:15,295 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:09:15,296 INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Allocation failed on attempt [2] due to error [Storage coordinator could not allocate segment.]. Can still retry action [SegmentAllocateAction{dataSource='wiki', timestamp=2022-01-01T00:00:00.000Z, queryGranularity={type=period, period=PT1S, timeZone=UTC, origin=null}, preferredSegmentGranularity={type=period, period=PT30M, timeZone=UTC, origin=null}, sequenceName='noop_2023-08-01T14:09:15.017Z_7d445db1-640c-4943-af64-3ed055493eae', previousSegmentId='null', skipSegmentLineageCheck=false, partialShardSpec=org.apache.druid.timeline.partition.NumberedPartialShardSpec@557b6a37, lockGranularity=TIME_CHUNK}].
2023-08-01T14:09:15,296 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource [wiki], interval [2022-01-01T00:00:00.000Z/2022-01-01T00:10:00.000Z]
2023-08-01T14:09:15,299 WARN [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Cannot allocate new segment for dataSource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:10:00.000Z], existingVersion[2023-08-01T14:09:15.046Z]: conflicting segment[wiki_2022-01-01T00:00:00.000Z_2022-01-01T01:00:00.000Z_2023-08-01T14:09:15.046Z].
2023-08-01T14:09:15,299 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:09:15,300 INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Allocation failed on attempt [3] due to error [Storage coordinator could not allocate segment.]. Can still retry action [SegmentAllocateAction{dataSource='wiki', timestamp=2022-01-01T00:00:00.000Z, queryGranularity={type=period, period=PT1S, timeZone=UTC, origin=null}, preferredSegmentGranularity={type=period, period=PT30M, timeZone=UTC, origin=null}, sequenceName='noop_2023-08-01T14:09:15.017Z_7d445db1-640c-4943-af64-3ed055493eae', previousSegmentId='null', skipSegmentLineageCheck=false, partialShardSpec=org.apache.druid.timeline.partition.NumberedPartialShardSpec@557b6a37, lockGranularity=TIME_CHUNK}].
2023-08-01T14:09:15,300 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource [wiki], interval [2022-01-01T00:00:00.000Z/2022-01-01T00:05:00.000Z]
2023-08-01T14:09:15,304 WARN [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Cannot allocate new segment for dataSource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:05:00.000Z], existingVersion[2023-08-01T14:09:15.046Z]: conflicting segment[wiki_2022-01-01T00:00:00.000Z_2022-01-01T01:00:00.000Z_2023-08-01T14:09:15.046Z].
2023-08-01T14:09:15,305 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:09:15,305 ERROR [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Exhausted max attempts [4] for allocation with latest error [Storage coordinator could not allocate segment.]. Completing action [SegmentAllocateAction{dataSource='wiki', timestamp=2022-01-01T00:00:00.000Z, queryGranularity={type=period, period=PT1S, timeZone=UTC, origin=null}, preferredSegmentGranularity={type=period, period=PT30M, timeZone=UTC, origin=null}, sequenceName='noop_2023-08-01T14:09:15.017Z_7d445db1-640c-4943-af64-3ed055493eae', previousSegmentId='null', skipSegmentLineageCheck=false, partialShardSpec=org.apache.druid.timeline.partition.NumberedPartialShardSpec@557b6a37, lockGranularity=TIME_CHUNK}] with a null value.
2023-08-01T14:09:15,305 INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Successfully processed [0 / 1] requests in batch [{datasource='wiki', groupId='group_1', batchId=0, lock=TIME_CHUNK, allocInterval=2022-01-01T00:00:00.000Z/2022-01-01T00:30:00.000Z, skipLineageCheck=false}].

New logs in case of a conflicting segment

2023-08-01T14:06:23,709 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:30:00.000Z].
2023-08-01T14:06:23,774 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:06:23,775 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:15:00.000Z].
2023-08-01T14:06:23,781 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:06:23,781 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:10:00.000Z].
2023-08-01T14:06:23,787 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:06:23,787 INFO [main] org.apache.druid.indexing.overlord.TaskLockbox - Allocating [1] segments for datasource[wiki], interval[2022-01-01T00:00:00.000Z/2022-01-01T00:05:00.000Z].
2023-08-01T14:06:23,793 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Created [0] new segments for [1] allocate requests.
2023-08-01T14:06:23,793 ERROR [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Failed allocation attempt[4/4] for task[noop_2023-08-01T14:06:23.541Z_647e2788-8667-43cc-b762-6a8aa3b74697], batch[{datasource='wiki', groupId='group_1', batchId=0, lock=TIME_CHUNK, allocInterval=2022-01-01T00:00:00.000Z/2022-01-01T00:30:00.000Z, skipLineageCheck=false}]: Segment[wiki_2022-01-01T00:00:00.000Z_2022-01-01T01:00:00.000Z_2023-08-01T14:06:23.577Z] in conflicting interval[2022-01-01T00:00:00.000Z/2022-01-01T01:00:00.000Z] has higher version[2023-08-01T14:06:23.577Z].
2023-08-01T14:06:23,794 INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocationQueue - Successfully processed [0/1] requests in batch[{datasource='wiki', groupId='group_1', batchId=0, lock=TIME_CHUNK, allocInterval=2022-01-01T00:00:00.000Z/2022-01-01T00:30:00.000Z, skipLineageCheck=false}].

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

existingVersion,
overallMaxId
return SegmentAllocateResult.failure(
"Segment[%s] in conflicting interval[%s] has higher version[%s].",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is inaccurate. It's also possible that the interval corresponding to overallMaxId isn't the same as the request's interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, I will break it up.

* <p>
* This class does not implement equals() or hashCode() so that each request is
* treated as unique even if all the parameters are the same.
* TODO: why??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pending or a leftover comment?

);
return null;
} else if (committedMaxId != null
&& committedMaxId.getShardSpec().getNumCorePartitions()
== SingleDimensionShardSpec.UNKNOWN_NUM_CORE_PARTITIONS) {
log.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this log needed if the SegmentAllocationResult's failure logs the message as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Copy link

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the [email protected] list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 16, 2024
@kfaraz kfaraz removed the stale label Feb 23, 2024
Copy link

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the [email protected] list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 24, 2024
@kfaraz kfaraz removed the stale label Apr 26, 2024
@kfaraz kfaraz marked this pull request as draft April 26, 2024 15:02
Copy link

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the [email protected] list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 26, 2024
@kfaraz kfaraz closed this Jun 27, 2024
@kfaraz kfaraz reopened this Jun 27, 2024
@kfaraz kfaraz removed the stale label Jun 27, 2024
@kfaraz kfaraz closed this Jun 27, 2024
@kfaraz
Copy link
Contributor Author

kfaraz commented Jun 27, 2024

Closed this PR as there are several merge conflicts now.

@kfaraz kfaraz deleted the fix_seg_alloc_message branch June 27, 2024 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants