Service invocations failing with timeouts

Hello, I have users reporting timeout errors today, and my log file is showing about 10 service timeouts today. I already have the Enhanced Script/Code Execution Time (20 sec) add-on from the marketplace, what else can I do, please?

Kelly O
app id 8149FCCE-07DA-F988-FFBB-6154872C9500

Hi.
Because obvious problems are not visible, I’ve created the inner task for deeper investigation.
[BKNDLSS-29967]

Could you say please, what is the most complex DB request you make? I mean the requests you perform in your custom service.

Hi Oleg,

The most complex endpoint does the following:

  1. Performs a query for a series of records within a date range - currently, ~35 records are returned.
  2. Loops through them to see if they meet a few simple conditions that couldn’t be expressed in the query.
  3. Uses a UnitOfWork to bulkCreate a slimmed-down version of the records into another table, and add a relation from the new records to the original records, in 100-record chunks (although right now there are only a few records.)
  4. Writes a master record to a different table, that summarizes the new records. No relation used in this step, just a simple write.
  5. Reads a counter field from yet another table, then increments it.

This operation is performed ONCE per day, it’s not repetitive. Yesterday’s invocation failed with a timeout error yesterday, although it apparently made it past step 4. The end-user tried to re-initiate the operation, but that is impossible because the partially written records written use unique indexes.

The other endpoint that failed yesterday, as reported by users, does this:

  1. Read one record.
  2. Read another record from a different table.
  3. If conditions apply, write an update to the first record.
  4. Increment a field in another table.
  5. Write a log entry to another table.
  6. One time in a hundred, maybe send a notification via the Twilio service (none were sent yesterday.)

I understand that I’m sharing bandwidth and can sometimes expect slower performance, and frankly, I don’t even care how long these services take, a few seconds is acceptable.

The real problems is that an out-of-band execution timeout error is impossible to handle. Operations that you absolutely depend on being atomic can be interrupted at any time, leaving you with no idea of state. There is no way to restart the operations in the examples above, for example, because you have no idea what step was interrupted. You’d have to save state along the way - but anything you add to mitigate the problem adds more execution time and makes the problem worse!

On more than one occasion I have had to manually inspect the database and patch up corrupt tables by hand.

I’m looking for suggestions on how to program around the possibility of a random execution timeout. I would love to make all of my endpoints re-entrant and break them up into small, atomic, bomb-proof blocks. The one thing that is missing is a programmable asynchronous callback. A Cache timeout event, that passes the cached value, would be precisely PERFECT. I’ve seen other requests in the support database to add a Cache timeout event to the current implementation of the Cache, but there doesn’t seem to be much excitement on your end. It would sure add a lot of value. The Cache currently does everything necessary, except call an endpoint when a cached item times out. If it did, I could break up my second example above into FOUR re-entrant steps, and the state would be baked into the Cache entry. If it is not possible to change the Cache behavior, I would really appreciate advise on how to write my own similar async, serializable, callback system WITHOUT a continuous timer loop.

Thank you, Oleg, and sorry for length, it’s a character flaw.

Kelly

Hi @Kelly_Oglesby

Thank you for these details, but we need to ask a few more questions:

  1. what lang do you use in the CloudCode? Codeless/JS/Java?
  2. can you please add measurements for each operation to see where exactly it takes too long?

Regards, Vlad

I’m having the same error in lots of different places, randomly. Every aspect of the app that touches the cloud code fails with timeouts, then works, then fails. Right now, I can’t make it fail.

I instrumented one function as requested, and found that the simple query it contains finishes repeatedly in between 3.5 and 6 seconds. But between the hours of 3pm MST and 4pm MST it was failing consistently with the error:

Custom business logic execution has been terminated because it did not complete in permitted time - 35 seconds

This is a simple query that uses QueryBuilder to do a find operation. I don’t think the problem is my code. My services are all nodejs, by the way.

Thank you,
Kelly

PS - Another weirdness - even with all of these failures, my daily log file is empty.

Hello @Kelly_Oglesby,

Thanks for clarifying. Our backend dev will investigate this issue as soon as possible.

Regards,
Olha

Thank you, Olha! The question remains, however, what sort of error handling can one add to a service to prevent corruption when the service can be forcibly removed from execution, since this situation is obviously going to occur from time to time?

I can advise you to buy the additional execution time or change the logic from one big task to several smaller.
Unfortunately I cannot give any advise about handling timeouts as everything is individual for each case.

I already bought the additional execution time, and the task that is timing out immediately above does one database read, it can’t get any smaller. I think I have probably entirely misunderstood what the use cases are for services, and will re-think.

One final question, do you think there is any chance a Cache timeout event could be added to backendless’s Cache API? A call to a service when Cache data expires passing the cache data would make your recommendation to “change the logic from one big task to several smaller” very simple. Something like:

Backendless.Cache.put(key, obj, timeToLive, [serviceName, functionName])

Thank you!

Hi @Kelly_Oglesby ,

the task that is timing out immediately above does one database read, it can’t get any smaller.

Could you give me example of your DB query?

do you think there is any chance a Cache timeout event could be added to backendless’s Cache API?

Unfortunately it is not possible to implement such event due to the internal implementation of the cache. We use Redis “under hood” with keys TTL and since Redis does not send any events on key expiration we do not know when cache entry expires.

Regards, Andriy

Hi Andriy,

Here is the service function that I added timers to. It usually finishes in 3 to 6 seconds, and currently returns 10 records out of the 31 in the table.

async getComplianceCounts() {
    let counts = [0, 0, 0]

    let records = await this.Helpers.getAllRecords(
        "complianceItems", ["canSkip", "target"], null, 1, ["order asc"],
        "enabled = true and target = '0' and dataType != 'help' and dataType != 'schema'"
    )
    records.forEach((ci) => {
        if (!ci.canSkip) counts[parseInt(ci.target)]++
    })
    return counts
}

Here is the function that does the real work.

class Helpers {
    async getAllRecords (table, cols, rels, depth, sort, where, groupby) {
        let lastSize = 0
        let coll = []
        let query = Backendless.DataQueryBuilder.create()
        let count = 0

        query.setPageSize(100)

        if (depth) query.setRelationsDepth(depth)
        if (sort) query.setSortBy(sort)
        if (cols && cols.length) query.addProperties(cols.join(","))
        if (rels && rels.length) query.setRelated(rels)
        if (where) query.setWhereClause(where)
        if (groupby) query.setGroupBy(groupby)
        
        do {
            const items = await Backendless.Data.of(table).find(query)
            lastSize = items.length
            coll = coll.concat(items)
            if (lastSize == 100) {
                query.prepareNextPage()
            }
        } while (lastSize >= 100)
        return coll
    }
}
module.exports = Backendless.ServerCode.addType(Helpers);

Hello @Kelly_Oglesby,

the code looks good (btw you can put and canSkip != true to your where clause for not retrieving unwanted records).

I just tried to invoke this method a few times, and response was very fast.
A few days ago we had a small issue with resources on our server that might have caused an issue like this.
Could you please let us know if you still experience this problem or not?

Regards,
Stanislaw