Improve Performance with C# Job System and Burst Compiler in Unity
This article is original written in Chinese under my work account here.
During developing games, It’s common to scale down many things in favor of getting better performance. Things like number of objects on scene. Distance of drawing, or maybe the frequency of simulation…etc
Here I’ll demonstrate how to modify a boids (Flocking behavior) script through implementing jobsystem and burst compiler to this project. Result in handle more boid on the scene but also require a lot fewer cpu processing time.
Let’s start from the boids script. Nowadays, games often use the rules designed by Craig Reynolds in the Boids program(1987) to micmic the flocking behavior of birds and packed fish. There are different kind of implementation online. In this blog post I’ll modify Kejiro’s Boid script(2014) in order to support job system and burst compiler.
The rules applied in the simplest Boids world are as follows:
Separation — Steer to avoid crowding local flockmates (a short distaince repulsion force)
Alignment — Steer towards the average heading of local flockmates
Cohesion — steer to move towards the average position (center of mass) of local flockmates (a long distance attraction force)
Kejiro implement these part into the BoidBehaviour scripts
At runtime. There will be N objects with Boidbehaviour component. Here I just refer those object as Agent.
This is how the demo scene look like:
Seems fine for 30 agents. We can still hold up more then 150FPS. Let’s try to tune the agent number up to 100.
The CPU process time become longer and the FPS has drop to around 90 after adjust the agent number to 100. Of course we can convince ourself that 90FPS is good enough. For now. But what if our scene require many pack of fish or birds in the same time? What if the numbers of agents goes up to 300?
After tuned up the agents number to 300. CPU process time is more than 30ms right now. Which means the fps can never hit 30, since we probably will have more objects on the screen aside from boid agents(FX, character, ui, building…etc). Right now, there is no way we can have 300 agents on the screen while also maintain a good enough frame rate.
Optimization
If you are somewhat familiar with Unity. Or had read the 10000 Update() calls blog before. You might think of merge all those Update call of BoidBehaviour together — Use a manager class that hold all references of agents. Then move calculations that was inside BoidBehaviour.Update into manager’s Update. Then update all agent states once every frame during Manager’s Update.
That definitely helped, especially when you have thousands of agents. But that’s not enough. The cpu bottleneck occur way before I create thousands of agents on the scene. Even my macbook pro cannot handle only 500 agents.
Let’s find the bottleneck with the help of profiler. Turns out the most costly place seems like the Physics-related calls. After commented out L71-L81. The processing time has donw from 20ms to 1.36ms!
It’s save to say the bottleneck is right here. Let’s take a look of those lines.
var nearbyBoids = Physics.OverlapSphere(.....)
foreach (var boid in nearbyBoids){
if (boid.gameObject == gameObject) continue;
var t = boid.transform;
separation += GetSeparationVector(t);
alignment += t.forward;
cohesion += t.position;
}
In order to get vectors for movement calculations(separation、alignment、cohesion). We will need each neighbor agent’s world position and forward direction. The program rely onPhysics.OverlapSphere
api to check if a agent is treated as a neighbor.
Physics.OverlapSphere() API : boost up GC amount along with the increased colliders on scene.
The transform.postion and boid.transform has cpu overhead as well. Once the transform numbers rise up. It could be very costly as well.
Since we identified the bottleneck. Let’s modify the program with C# Job system by create a TranformJob which Implement IJobParallelForTransform.
Job System?
The Unity C# Job System lets you write simple and safe multithreaded code that interacts with the Unity Engine for enhanced game performance. The hightligh is the safety and the ability to scheduling bewtween jobs.
One thing to note is that you can only use value type within job and manipulate data through native container(which is native memory). By using native container. There will be no GC but you need to release memory manually0
For detail info you can check the officual manual。
IJobParallelFor?
Job system will try to execute the job that implement IJobParallelFor as parallel as possible. Suppose we have a parallel job with length 100 and batch size 10. The job system will separate and schedule indexes 0–9, 10–19, 20–29…etc to different worker thread and execute it in parallel. This is very useful when we want to do same calculation on each elements of a fixed-length native container.
Job can be execute on all threads. So all the Unity APIs that can only be invoke from main thread are all prohibited. Things like Time.deltaTime, Physics.XXX…etc. Reference type are also blocked because the safety system will eliminate all possibility to create race condition. So codes like boidBehaviour L67, L68:
var alignment = controller.transform.forward;
var cohesion = controller.transform.position;
are not able to use. To work around this we need to copy all required datas before we new the job.
Instead of pass a agent reference to calculation. We only extract the data required for calculation into arrays and pass to the job. Which is a more data-oriented approach.
For example :
private void Update(){
.....
.....
m_transJob = new TransformJob()
{
BoidVelocities = m_boidVelocities,
BoidPositions = m_boidPositions,
BoidRotations = m_boidRotations,
ControllerFoward = transform.forward,
ControllerPosition = transform.position,
RotationCoeff = rotationCoeff,
DeltaTime = Time.deltaTime,
NeighborDist = neighborDist,
Speed = velocity,
};
.....
.....
}public struct TransformJob : IJobParallelForTransform
{
[ReadOnly]
public NativeArray<Vector3> BoidPositions;
[ReadOnly]
public NativeArray<Quaternion> BoidRotations;
[ReadOnly]
public Vector3 ControllerFoward;
[ReadOnly]
public Vector3 ControllerPosition;
[ReadOnly]
public float RotationCoeff;
[ReadOnly]
public float DeltaTime;
[ReadOnly]
public float NeighborDist;
[ReadOnly]
public float Speed;
public void Execute(int index, TransformAccess trans)
{
..........
..........
..........}
}
You may notice that although we don’t have agent’s reference anymore. We can now use index N to retrieve the datas of agent N.
Finally. Move BoidBehaviour’s movement logic into job’s Execute :
We can’t use Physic.OverlapSphere inside the job. In order to find neighbor agents. I use a very brute force appoarch — Calculate the distance between current agent and others. Mark it as neighbor if the distance is smaller than a certain length.
This is extreamly inefficiency(N agents means calculate distance N*N times). But with the help of Burst Compiler. It become a doable approach.
Burst Compiler?
Burst is a compiler, it translates from IL/.NET bytecode to highly optimized native code using LLVM,Every job that decorated with [BurstCompile]attribute will be compiled by the compiler.
Note that BoidPositons and BoidRotations are both marked as [ReadOnly].
Since indexes will execute on different worker thread. If we try to change the value of elements of those arrays. A race condition might occur(Unity will show warning though). In order to update the latest agent state to these 2 arrays. We need to schedule another job to update those array.
[BurstCompile]public struct UpdateArrayJob : IJobParallelForTransform{
[WriteOnly]
public NativeArray<Vector3> BoidPositions;
[WriteOnly]
public NativeArray<Quaternion> BoidRotations;public void Execute(int index, TransformAccess trans){
BoidPositions[index] = trans.position;BoidRotations[index] = trans.rotation;}}
After set up all jobs. Schedule the jobs and wait for Complete.
void Update()
{
.....
m_transJob = new TransformJob()
{
BoidVelocities = m_boidVelocities,
BoidPositions = m_boidPositions,
BoidRotations = m_boidRotations,
ControllerFoward = transform.forward,
ControllerPosition = transform.position,
RotationCoeff = rotationCoeff,
DeltaTime = Time.deltaTime,
NeighborDist = neighborDist,
Speed = velocity,
};
.....
..... //Job system will schedule/execute the job once we invoke schedule
m_updateArrayHandle = m_updateArrayJob.Schedule(m_boidsTransformArray); m_JobHandle = m_transJob.Schedule(m_boidsTransformArray, m_updateArrayHandle);//JobHandle.Complete() will make sure that all depended job has been //complete.m_JobHandle.Complete();//Usually we call JobHandle.Complete() inside lateUpdate. I only put //it here because it's easier to observe the changes through //profiler.}
the profiling result for the modified boid script :
Compare to the original script without job. We now save 97% cpu time.
It’s even faster than when we commented out BoidBehaviour L71-L81.( Well. get rid of 300 update calls certainly help a little bit.). And the program has zero GC thanks for the native container. It could be faster if we remove all colliders on agent transforms. Since we no longer use physics.OverlapSphere.
Profiling Results :
300 agents:IJobParallelForTransform (With Burst) -> 0.47ms
IJobParallelForTransform (No Burst) -> 8.5ms
300 Update (BoidBehaviour) -> 18ms1000 agents:IJobParallelForTransform (With Burst) -> 4.55ms
IJobParallelForTransform (No Burst) -> 71ms
1000 update (BoidBehaviour) -> 67.9ms
Without Burst. The N*N distance calculations are still too slow. maybe I can use some sort of partition/grouping or using world position as key to check nearby key in a hashmap. But for now. 500 agent is enough for my scene.
Conclusion
With the release of Job System and Burst compiler. Unity user now have more options to write more performance efficient code. And since the learning curve is relatively low(thanks to the safety system). It let applications made by unity can be more efficient and with larger scale.
And for people like me that start learning game dev with Unity. Its a great chance to be able to study things like Multithread, C# managed type , DO…etc. There are lots of post online that has deeper discussion about job system that really open my eyes.
After Thoughts
Simply implement job into the script is not enough. Here I list 3 feature we can add into it.
Obstacle Avoidance
One concept for boid that is really important is the behavior of avoid obstacles. We can use RaycastCommand and SpherecastCommand apis to perform raycast and spherecast to determine the heading direction with and without obstacle.
Both commands are GC free. And since it perform on a parallel job so it perform better than a normal raycast way.
Optimize more with Mathematics
I still use Unity’s Vector3, Quaternion and the old Mathf inside the job.
Unity had release a package called Mathematics. It is a C# math library providing vector types and math functions with a shader like syntax. Used by the Burst compiler to compile C#/IL to highly efficient native code. According to other blog. By switching to variable type to those Mathematics provides in a job. we can gain 5~10% performance increase.
Optimize more with GPU Instancing
In fact, we no longer need instantiate per gameobject for each agent anymore.
We already extract all datas we need into arrays. Instead of instantiate gameobject for each agent. We can use DrawMeshInstanced or DrawMeshInstancedIndirect to draw agents by passing positions, rotations and meshes data directly to GPU. We can reduce tons of cpu time by saving the time of processing gameobject and vertex data on the cpu side.
Referencese
Unity Official
Job System Manual
Burst User Guide
On DOTS: C++ & C#
Unity at GDC — C# to Machine Code
Unite Europe 2017 — C# job system & compiler
JacksonDunstan.com
How to Write Faster Code Than 90% of Programmers
Job System Tutorial
C# Tasks vs. Unity Jobs
Free Performance with Unity.Mathematics
I recommend you watch both [Unite Europe 2017 — C# job system & compiler] and [On DOTS: C++ & C#].
If you are interested in the project. You can download the Unity package here.