In addition, Hubble writes trace points to an in-memory
ring buffer that is only flushed when a problem is detected.
This allows it to run continuously and capture information
just-in-time leading up to a failure. By designing a concise,
variable-length encoding, such that most trace points occupy
eight bytes, a small (32MB) ring buffer is enough to capture
sufficient debugging information.
Third, Hubble’s performance-sensitive instrumentation
logic is written in assembly. This ensures that performance
is optimal even on a device’s low-power (little) cores, which
cannot perform out-of-order execution or have small instruc-
tion reordering buffers. In addition, this decouples Hubble
from the Android compiler’s compilation flow, so it avoids
having the compiler affect the correctness of the tracing logic,
and eases maintainability.
Finally, Hubble avoids using expensive synchronization
primitives [14] in two ways: threads write trace points to
thread-local buffers, avoiding inter-thread synchronization;
and, Hubble communicates with these threads by using a
purpose-built lock-free synchronization protocol.
The end result is a highly efficient method-tracing system
sufficient for debugging intermittent performance bugs. In
our microbenchmarks, each trace point costs less than one
nanosecond for nearly empty methods, and tracing overheads
are quickly amortized when methods perform meaningful
operations. Hubble’s tracing overhead is also unnoticeable in
Huawei’s continuous-integration performance testing infras-
tructure, which includes a variety of workloads and devices.
Hubble’s memory overhead is approximately 64 MB by de-
fault, accounting for two 32 MB ring buffers. As of 2021,
Huawei’s lower-end smartphones have at least 4 GB of RAM,
while higher-end ones can have up to 12 GB. Therefore, Hub-
ble’s memory overhead is less than 2%.
Hubble also strives to protect user privacy. Similar to ex-
isting error reporting systems such as WER [21], MacOS [2]
and Mozilla [34] crash reports, Hubble’s traces are only col-
lected with user consent. However, these other systems collect
a minidump of the memory image, whereas Hubble’s traces
are far less sensitive: they only consist of method names and
timestamps and do not contain any variable values.
Hubble has been integrated into Huawei’s core Android
OS codebase, deployed across a wide range of smartphone
and tablet product lines, since August, 2020. Older devices
may receive Hubble’s functionalities via an over-the-air OS
update. Since deployment, Hubble has significantly eased
the debugging of intermittent performance problems. In fact,
engineers were able to quickly resolve many performance
problems that remained unresolved for months.
This paper makes the following contributions:
•
The design and implementation of Hubble, a highly efficient
method tracing subsystem for Android, that satisfies a set
of unique, practical constraints, some of which are rarely
mentioned by existing literature.
•
Integration of Hubble’s traces with existing debugging
tools, like Perfetto [40] which can show call charts. This
significantly improved the trace’s utility, where developers
can cross-examine Hubble traces with other runtime data.
•
Case studies on how Hubble diagnoses real-world perfor-
mance bugs which cannot be resolved without it.
Hubble also has the following limitations. First, it can
only embed tracing logic into executions that go through
the Android compiler or interpreter (from bytecode); Hub-
ble cannot trace native libraries like those invoked through
the Java Native Interface (JNI). In addition, Hubble’s trace
buffer could pollute the CPU cache and slow down cache-
optimized workloads (e.g., loop tiling [8]). However, while
cache-optimization is commonplace in server workloads, it
is uncommon on smartphones, especially in the interactive
UI-thread. Nonetheless, we evaluate this effect in §8.
2 Related Work
Record and replay (R&R) tools [10,15,16,29,30,35,36,38,50,
57] work by recording a user’s input and all non-deterministic
events (e.g., scheduling), so that the execution can be faith-
fully replayed. R&R tools do not meet our requirements for
a few fundamental reasons. The first is overhead. Among
all R&R tools, Reverb [35] reported the best performance,
yet its overhead is still 5.5% on average (the worst-case is
not reported). It works only on JavaScript web applications,
where threads communicate using a message-passing inter-
face. When threads share memory, R&R incurs even higher
overhead. For instance, DoublePlay [57] reported a worst-case
overhead of 11% for network-bound workloads (Apache web-
server), 19% for disk-bound workloads (MySQL), and 278%
for CPU-bound workloads (SPLASH-2 ocean). To achieve
low overhead, some tools [33,38,45] do not record all non-
determinism which prevents accurate replay. Second, since
intermittent performance bugs may take days to occur, R&R
traces will grow untenably large. While checkpointing could
allow replay from a partial trace, the checkpointing operation
itself is expensive [50]. Compared to a call chart, an R&R
trace also imposes much larger privacy concerns. Finally,
R&R tools require deep integration with the Android runtime
and compiler. For instance, applying DoublePlay’s approach
to Android would require the runtime to run a parallel execu-
tion of the application, checkpoint and compare state between
the two processes, and so on. Hence, R&R tools would be
difficult to maintain within Android.
An attractive alternative is to use hardware-support, like In-
tel PT or ARM ETM, to record branch-level traces [12,28,60].
These tools have a worst-case runtime overhead of 1–2%.
However, there are two challenges on ARM devices. First,
the semantic gap on Android’s runtime complicates the de-
coding of the branch-level trace, as it only provides the traces
of the runtime’s execution instead of the application. Second,
USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 789