Background
Object tracking in aerial imagery is largely an unsolved problem in the AI overmatch future powered by Generative AI and platforms like Scale Donovan; expert algorithms ATR and chain-of-custody play a crucial role as tools enabling the models to expand out from a pure text and structured data domain.
Although there exists extensive research on object tracking in the academic realm, those solutions are largely targeted at accurate assignment of track ID’s on top of fixed object detector outputs. These outputs come from a nearly perfect detector on common commercial footage. This footage typically comes from a camera that is mounted front-facing on the ground in a fixed position with fixed zoom levels.
However, the needs of many Federal customers do not align with these cookie-cutter solutions. They often require aerial imagery and full motion video capabilities, which present additional challenges, on top of accurate assignment of track ID’s as in the commercial realm:
-
The base object detectors are less performant, meaning there is significant room for the tracker to use temporal information to reduce false positives and false negatives.
-
The camera moves quite a bit in terms of position (i.e., panning) and zoom levels.
-
Long-term re-identification of targets is non-trivial and mission-critical.
Scale’s Solution: A Suite of Tracker Post-Processors
To address these problems, Scale offers a suite of tracker post-processors, each of which can be turned on or off. We optionally package these on top of our own custom, base tracker which achieves state-of-the-art balance between precision and recall when tracking in the presence of camera movement.
These post-processors can be used alone to enhance an existing tracker, or used as an end-to-end tracking package on top of an existing detector. This package can be called in one line of code per frame after each object detector inference (for users using the package) or base tracker inference (for users using just the post-processors).
Sample Demo MP4 Results
Below, we compare our Scale Tracker packaged with our custom base tracker (25FPS on the left) with two of the best publicly available trackers - one that is faster (BYTETrack at 30FPS in the middle) which has comparatively poor track ID permanence, and one that is slower with slightly better track ID permanence than BYTETrack (SmileTrack at 10FPS on the right), although still not as good as Scale Tracker’s track ID performance:
The Scale tracker:
-
Flickers less -- this creates more visually persistent tracks thereby creating a better user experience
-
Is able to maintain track ID’s through periods of erroneous track dropping -- this creates tracks that exhibit more permanence
Empirical Results
Academic results
Note: The metrics for the other methods in the above chart were taken directly from the StrongSORT paper
Even though our tracker is aimed at Federal problems, it actually improves upon the performance of trackers in the commercial realm and achieves state-of-the-art trade-off between speed and performance when paired with BYTETrack, as evaluated by IDF1 (a mixture between precision and recall) on common commercial tracking dataset MOT17.
The chart above demonstrates that ScaleTrack (packaged as our post-processors paired with BYTETrack) achieves a better speed/performance tradeoff than all other methods. BYTETrack and OC-sort are slightly faster, but significantly less performant than ours.
Why the emphasis on the comparison to BYTETrack (and not StrongSORT or SmileTrack)? Because any tracker that cannot be run at near-real time is nearly useless for Full Motion Video (FMV) workflows because most FMV users want algorithms that can be run as they review mission videos. And most cameras collect around 20FPS, meaning only the algorithms to the right of the red line can be of high value to FMV workflows.
Federal results
The above graph is from evaluating an academic dataset. On mission-specific datasets, we outperform even more. As visual evidence of our outperformance over state-of-the-art in Federal datasets, please see the above MP4 sample results.
How Did We Achieve These Results? (Technical Deep Dive)
Reduction of flickering
Detectors basically never have 100% recall in the Federal domain. They often miss objects. Most trackers halt operations when this happens and kill the track. Certain trackers keep the object around in a state that is half dead/half alive for about 5 frames, and provide them the option of being revived if the object appears again.
We went with a different approach. We become skeptical of the detector when a track isn’t propagated to the next frame and we try to persist the track by querying the image directly through a two-step process. This two-step process performs bounding box location prediction functions for a given track. The first prediction function is based on kalman information.
However, kalman filters fail sometimes. As a result, Scale did a thorough, detailed study of these failure cases and created a second, custom prediction function in bounding box search space. Given these bounding box hypotheses (from the kalman filter prediction function and our custom prediction function), we use a highly specialized visual feature function (which is not customer-locked or security-locked and therefore can be used by any customer at any security level) to select the bounding box which is most visually consistent with that track’s visual history. Based on the level of visual consistency, we either persist the track to the next frame at that selected bounding box’s location, or enter it into a state that is not yet dead but not entirely alive (i.e., a bank of tracks from which ReID can draw).
Ability to maintain track ID's through short periods of erroneous track dropping
Sometimes, the flickering reduction mechanisms fail. But it’s important that when the track shows up again, it maintains the same ID. So, for a set amount of frames after a track is last seen, we use its kalman filter to project the track over every frame. If there ever exists a new track that is visually consistent with the dead track and that new track exists in a location that is highly likely under the dead track’s motion prediction distribution, we assign the dead track’s ID to the new track. Thereby we would “revive” the old track in the new track’s location, which allows us to maintain the track ID even though the object was missing for a short while. We refer to this as “Local ReID.”
Outside of handling algorithmic false negatives, this also addresses short-term occlusions (e.g. a vehicle driving underneath an occluding tree).
Ability to maintain track ID's through long periods in which the track wasn't present
This addresses the following use cases:
-
Objects coming in and out of the frame, potentially at an entry location significantly different from its exit location
-
Objects that are occluded for long periods of time (e.g., a person coming in and out of a building)
-
Camera zoom ins, zoom outs, and pans
In these cases, traditional motion prediction breaks down, meaning traditional motion information prediction algorithms cannot be used to link new tracks with dead tracks. However, the objects rarely significantly change their looks within the span of a video. So, we introduce “Global ReID” which attempts to link dead tracks to new tracks strictly through a harsh visual filter.
Ability to maintain track ID's through medium-length periods in which the track wasn't present
We further address use-cases 1 and 2 in the above section using a ReID function that is based on custom motion functions mixed with the above mentioned visual filter with time between dead and new tracks as a salient input to the mixture function. We do so in such a way that the larger the gap between these tracks, the less the linking function relies on spatial information and the more it relies on visual information, as there is increased uncertainty in motion as the time gap becomes larger.