Job Details
Resiliency :
Systems/nodes/data centers can fail anytime, the show must still go on
We have a tool that is similar to Chaos monkey (https://githubcom/Netflix/SimianArmy/wiki/Chaos-Monkey) & it is under active development to account for newer failures
Examples of our usual work :
We eat CPU/Memory on servers
We block ports on couchbase/rabbitmq to test
We crash processes, kill JVMs
We test and automate the above
Performance Engineering :
We simulate customer scenarios on various test environments (at times against production env)
Load the system-under-test with 10000s of audio-video participants who are using various protocols (sip, h323, webrtc, hls )
Load test other in-meeting and non-in-meeting features (sockjs, http )
Monitor various systems, analyze dashboards of metrics, debug stability issues, profile & benchmark performance
Recommend production sizing for future, based on various tests
Examples of our usual work :
We load the geolocated system-under-test from geolocated clients worldwide
We manage and use environments in AWS and google compute engine
Monitor various metrics of couchbase / rabbitmq /zookeeper/Cassandra on various load conditions, to predict upcoming issues
Tooling :
A lot of tooling - load generation, failure simulation, monitoring
Post execution analysis - collect metrics, analyze, report
Some examples
Bring up/down aws/rackspace test environment on a need basis
Gather various metrics from different nodes in system-under-test, like couchbase/zk ops per second
Key Skills
SYSTEM Software Testing H323 Systems CAN Engineering BS OPS Databases Production Computer Science Software Drive Testing BASIS WHO