Using Logistic Regression to predict/prevent server failure

Discussion in 'Bukkit Discussion' started by Timatooth, May 7, 2013.

Thread Status:
Not open for further replies.
  1. Offline

    Timatooth

    Was I the only one that always waked up in the morning to find a non responsive server after it hit it's peak while asleep and then 'froze' for who knows reason. Was it a bug? Infinite loop? Thread deadlock? Java virtual machine out of memory? Server received a malicious network attack? Too many players?

    As we know there are so many 'predictor' variables that cause this. So I've had the idea to collect some server performance data to create a statistical model that uses logistic regression to predict the probability under certain conditions whether a server might be about to crash.

    Possible influential variables off the top of my head
    • Player count
    • Java memory
    • Minecraft java socket tx/rx rates. And current connections
    • Current scheduled tasks
    • TPS - but more the variance of the TPS, the more varying jitter might indicate an issue.
    • Thread count - and the state of the threads ie. Waiting, Blocking, Running, New etc.
    There could be many more important variables that are influential or the above I have list may NOT be influential.By collecting data an analyzing it using a statistical program such as "R" I might be about to figure out which variables affect server failure the most.

    I already have a plugin that collects some of this data (but haven't implemented a storage database yet). I plan to write a JavaScript library for my mineload webinterface (already on GitHub) that uses the model which can produce an odds ratio (basically a probability) that the server will fail given the the certain conditions at one point in time.

    How it sort of works
    If 1 is defined to be [server failure]
    Pr([Server failure]) given {tps: 9, jvm_memory_used: 98%, players: 120, ..., ... } = 95% failure likelihood. The values get substituted into the model which produces the ratio.

    ... And alert the server admin (or program) in advance that it could be all about to hit the fan.

    I don't know how well this would work, but it starts with getting good data to create the model. Would be interesting to see if it works or not.
     
    MrBluebear3 likes this.
  2. Offline

    Milkywayz

    For my server, 90% of the crashes are fully unpredictable, ex. a map saving bug which corrupts that chunk. After that crash is exposed, it's predictable but it happened quite literally out of the blue and crashed the server.

    I'd say just have a standalone program running on the machine and 'pinging' a plugin you have every so often and if theres no response then it restarts the server.
     
    breezeyboy likes this.
  3. Offline

    Timatooth

    Pinging the server works ok but is quite opaque. It could just be a network packet drop. I think it would still be cool to predict the odds of server failure due to live running variables.

    Does Bukkit collect server data based on crashes, profiling, performance data etc? Could be interesting to run a few models through it.
     
  4. Offline

    evilmidget38

    Timatooth There are far more variables to the server running than what you've listed. Additonally, like Milkywayz said, a server crash is rather unpredictable. If you were able to accurately predict the circumstances that cause a crash, then you'd most likely just prevent those circumstances from happening.

    Regarding stat collection, I don't think that any data is gathered by Bukkit. The only web connection in CB(not including the snooper) that I'm aware of is CB's update-checker.
     
    lukegb likes this.
  5. Offline

    Milkywayz

    The program I mentioned would run server side and if it does not receive a 'ping-back' it could resend another 5 seconds later to verify. Theres a proper way to handle it to ensure it's accuracy when determining a hung server.

    Edit: Could also implement a few remote checks, all pinging to ensure the server is up and OK.
     
Thread Status:
Not open for further replies.

Share This Page