The most complex endpoint does the following:
- Performs a query for a series of records within a date range - currently, ~35 records are returned.
- Loops through them to see if they meet a few simple conditions that couldn’t be expressed in the query.
- Uses a UnitOfWork to bulkCreate a slimmed-down version of the records into another table, and add a relation from the new records to the original records, in 100-record chunks (although right now there are only a few records.)
- Writes a master record to a different table, that summarizes the new records. No relation used in this step, just a simple write.
- Reads a counter field from yet another table, then increments it.
This operation is performed ONCE per day, it’s not repetitive. Yesterday’s invocation failed with a timeout error yesterday, although it apparently made it past step 4. The end-user tried to re-initiate the operation, but that is impossible because the partially written records written use unique indexes.
The other endpoint that failed yesterday, as reported by users, does this:
- Read one record.
- Read another record from a different table.
- If conditions apply, write an update to the first record.
- Increment a field in another table.
- Write a log entry to another table.
- One time in a hundred, maybe send a notification via the Twilio service (none were sent yesterday.)
I understand that I’m sharing bandwidth and can sometimes expect slower performance, and frankly, I don’t even care how long these services take, a few seconds is acceptable.
The real problems is that an out-of-band execution timeout error is impossible to handle. Operations that you absolutely depend on being atomic can be interrupted at any time, leaving you with no idea of state. There is no way to restart the operations in the examples above, for example, because you have no idea what step was interrupted. You’d have to save state along the way - but anything you add to mitigate the problem adds more execution time and makes the problem worse!
On more than one occasion I have had to manually inspect the database and patch up corrupt tables by hand.
I’m looking for suggestions on how to program around the possibility of a random execution timeout. I would love to make all of my endpoints re-entrant and break them up into small, atomic, bomb-proof blocks. The one thing that is missing is a programmable asynchronous callback. A Cache timeout event, that passes the cached value, would be precisely PERFECT. I’ve seen other requests in the support database to add a Cache timeout event to the current implementation of the Cache, but there doesn’t seem to be much excitement on your end. It would sure add a lot of value. The Cache currently does everything necessary, except call an endpoint when a cached item times out. If it did, I could break up my second example above into FOUR re-entrant steps, and the state would be baked into the Cache entry. If it is not possible to change the Cache behavior, I would really appreciate advise on how to write my own similar async, serializable, callback system WITHOUT a continuous timer loop.
Thank you, Oleg, and sorry for length, it’s a character flaw.