Saturday, February 5, 2011

Fault tolerant GGS

Now we have some fault tolerance in GGS. Currently the server 'crashes' when a client exits, and this causes a new socket to be opened, which is good, because the old one was closed by the client.

The {ReferenceID, JSVM} tuple stored in the ggs_server state was previously lost when the ggs_server crashes, but this is now stored in ggs_backup as well, and this is where the fault tolerance takes place.

ggs_server may now crash at pretty much any time, and resume its state from ggs_backup. This works according to the following:
  1. ggs_sup starts
  2. mnesia_ctrl starts
  3. ggs_server_sup starts
  4. ggs_backup starts
  5. ggs_server starts
In 5. when ggs_server starts, it always asks ggs_backup for any previous state, even if this is the initial start of the whole supervisor hierarchy. When the application is first run, there is no state, and therefore ggs_backup returns an atom signaling that it has no state to back up from, and that ggs_server should initialize a new state and back it up.

If ggs_backup is to crash, the ggs_server_sup will restart it, just as it restarts ggs_server, and the state is reloaded from ggs_server, unless both ggs_server and ggs_backup crash at the exact same time.

For some reason though, the JSVMs stored in the ggs_server state do not live through the migration to ggs_backup and back, most likely due to them being spawned from inside ggs_server and not being supervised properly. This problem should be easily solved by supervising erlang_js on a higher level.

UPDATE: Moving erlang_js to a higher hierachy is not the solution to the JSVM death issue. The JSVMs disappearing when ggs_server crashes is a design choice coming down to the new/3 function in js_driver.erl in erlang_js.

new/3 creates a connection to the erlang_js driver using an open_port command, and this command registers the calling process (ggs_server) as its port owner. Looking at the documentation for ports we can see that when the port owner dies, the port is supposed to go down as well. I see two possible solutions to this:
  1. Change the owner of the port to ggs_backup when ggs_server crashes, and then back again
  2. Set the owner of the port to a process which should not ever go down (thanks)
Now, according to the SO link above, option 2 is hacky, so option 1 is to be preferred. This means we start yet another gen_server process somewhere in the hierachy, which acts somewhat like a wrapper for erlang_js. This is probably the future of js_runner.

Update 2: A different approach is to incorporate the "wrapper" inside erlang_js so we don't need to see it, we just do our js:call/? calls and those are handed of to a "persistent" process inside erlang_js which registers its JSVMs to that process.


Time report
Today: 5 hours
Updating day: 1 hour

No comments:

Post a Comment