contrib.training.FailureTolerator
tf.contrib.training.FailureTolerator
class tf.contrib.training.FailureTolerator
Defined in tensorflow/contrib/training/python/training/failure_tolerator.py
.
Helper for tolerating certain exceptions.
When encountering a handled exception inside tolerator.forgive(), it is suppressed (but logged). A subsequent call to tolerator.forgive() will sleep for a period of time before continuing, with exponential backoff on multiple exceptions. (The delay avoids retrying too quickly -- a subsequent attempt will often only succeed after a transient failure has resolved itself.)
If more than limit
exceptions have been encountered, the error will not be suppressed.
Exceptions occurring more than forgive_after_seconds
ago (excluding time spent waiting between retries) are forgiven and no longer count towards the limit.
An example loop using FailureTolerator to retry until a successful session.run(...)
would look like:
failure_tolerator = FailureTolerator() while True: with failure_tolerator.forgive(): session = make_session_somehow() while not should_stop(): session.run(...) break # session.run was successful
By using FailureTolerator, failures are logged, there are delays between retries, and there's a ceiling on the maximum number of retries available. (In the case of persistent errors, the task won't just loop forever!)
Methods
__init__
__init__( limit=5, init_delay=5.0, backoff_factor=2.0, forgive_after_seconds=6000, handled_exceptions=None )
Creates a FailureTolerator.
The result will pause for init_delay * (backoff_factor^(failure_count-1))
when re-entering forgive()
after a failure.
Args:
-
limit
: The maximum number of suppressed, unforgiven, failures. -
init_delay
: How long to pause once the first failure is encountered. Defaults to five seconds. -
backoff_factor
: Each subsequent failure grows the pause by this factor. -
forgive_after_seconds
: Failures older than this are forgiven. -
handled_exceptions
: The exceptions to forgive. Defaults to(errors.AbortedError,)
.
forgive
forgive( *args, **kwds )
© 2017 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/api_docs/python/tf/contrib/training/FailureTolerator