Hubble: Performance Debugging with In-Production, Just-In-Time

This paper is included in the Proceedings of the

16th USENIX Symposium on Operating Systems

Design and Implementation.

July 11–13, 2022 • Carlsbad, CA, USA

978-1-939133-28-1

Open access to the Proceedings of the

16th USENIX Symposium on Operating

Systems Design and Implementation

is sponsored by

Hubble: Performance Debugging

with In-Production, Just-In-Time Method

Tracing on Android

Yu Luo and Kirk Rodrigues, University of Toronto; Cuiqin Li, Feng Zhang,

Lijin Jiang, and Bing Xia, Huawei Technologies Co., Ltd.;

David Lion and Ding Yuan, University of Toronto

htt ps://www.usenix.org/conference/osdi22/presentation/luo

Hubble: Performance Debugging with

In-Production, Just-In-Time Method Tracing on Android

Yu Luo

University of Toronto

Kirk Rodrigues

University of Toronto

Cuiqin Li

Huawei Technologies Co., Ltd.

Feng Zhang

Huawei Technologies Co., Ltd.

Lijin Jiang

Huawei Technologies Co., Ltd.

Bing Xia

Huawei Technologies Co., Ltd.

David Lion

University of Toronto

Ding Yuan

University of Toronto

Abstract

Hubble is a method-tracing system shipped on all supported

and upcoming Android devices manufactured by Huawei, in

order to aid in debugging performance problems. Hubble in-

struments every non-inlined bytecode method’s entry and exit

to record the method’s name and a timestamp. Instead of per-

sisting all data, trace points are recorded into an in-memory

ring buffer where older data is constantly overwritten. This

data is only persisted when a performance problem is detected,

giving engineers access to invaluable, detailed runtime data

Just-In-Time before the detected anomaly. Hubble is highly

efﬁcient, with its tracing inducing negligible overhead in real-

world usage and each trace point taking less than one nanosec-

ond in our microbenchmark. Hubble signiﬁcantly eases the

debugging of user-experienced performance problems and

has enabled engineers to quickly resolve many bug tickets

that were open for months before Hubble was available.

1 Introduction

Today, Android devices are pervasive and tightly integrated

into people’s daily lives, yet users still experience perfor-

mance problems when using these devices. Unlike Apple’s

iOS and iPhone, the Android platform is far from a tightly-

coupled monolithic ecosystem—the hardware (manufactured

by OEMs), infrastructure system software (maintained by

Google and customized by OEMs), and applications are pro-

vided by different parties, and all layers are released in a rapid

yet uncoordinated development cycle. This open platform

makes testing enough combinations of hardware, systems soft-

ware, and applications particularly challenging. Thus, many of

the performance bugs that escape current testing practices are

intermittent, manifesting across multiple components main-

tained by different entities.

When end users experience an issue, it is often systems

vendors that shoulder the blame, before the root cause is ex-

posed [49]. This is particularly true for Android given its huge

user base, many of whom are not tech-savvy. When such users

experience an intermittent performance problem, they quickly

assume that their device is at fault, simply because they could

not immediately reproduce the issue on another device. How-

ever, the root cause could be in the application itself, only

triggered under speciﬁc conditions or inputs. To combat these

assumptions, device vendors are forced to devote ample engi-

neering and support resources to these issues.

Yet, diagnosing performance problems that occur on a

user’s device is extremely challenging, owing to a lack of suf-

ﬁcient runtime information. While approaches like Windows

Error Reporting (WER) [21] are widely adopted, they can

only record runtime information after a problem is detected.

Oftentimes this is too late, as it misses crucial information

just before and during the problem. This is exacerbated for

performance problems, especially intermittent ones, because

the issue may vanish after being detected, before recording

starts. Indeed, the primary use of WER is not to record enough

information to debug an issue, but to collect error statistics

that are then used to prioritize debugging effort.

Recording debugging information before the problem oc-

curs is challenging. We cannot accurately predict when a

problem will occur, so the only option is to continuously

trace the system during normal execution. However, over-

head is a concern. Unlike servers, mobile devices are heavily

resource-constrained and their workloads are overwhelmingly

interactive. Sampling-based proﬁling tools are available, but

their trade-off between informativeness and performance is

poor. Non-sampling-based proﬁling tools, on the other hand,

are too heavyweight for continuous tracing. For example,

existing proﬁling tools on Android like Systrace [25] and

Android Studio’s CPU Proﬁler [24] can trace every method

call of an application. However, enabling this type of tracing

noticeably slows down an application, sometimes by more

than 10

, which is unsuitable for continuous use in produc-

tion. Individual applications may implement their own in-app

tracing [18,23,46], but such traces are typically only available

to those applications themselves.

As a result, problems reported to Android device vendors

typically only include system logs, sampled statistical metrics,

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 787

Design Unnoti- No Main- big.

ceable src. tain LITTLE

Instrumentation via JIT 3 3

Ring-buffer & encoding 3

Hand-optimized asm 3 3 3

Lock-free control 3

Table 1: Hubble’s designs and the requirements they satisfy. The

headings for the requirements are truncated as follows: “Unnotice-

able” refers to having unnoticeable overhead. “No src.” refers to

not requiring source code. “Maintain” refers to being maintainable.

“big.LITTLE” refers to supporting both big and little cores.

sparse Systrace traces, and details recorded after a problem

has occurred, like the device model, application name, and

symptom. Most times, this is not enough to be useful in de-

bugging intermittent performance problems, and engineers

are left “debugging in the dark.” Consequently, many bug tick-

ets are left open for months without any hope of resolution.

Worse yet, many bugs cannot even be properly triaged, and

after rounds of ﬁnger-pointing, it is often the low-level system

engineers that bite the bullet.

1.1 Challenges and Opportunities

Therefore, a production tracing system that can provide ﬁne-

grained observability is desperately needed. However, con-

tinuous tracing in production is challenging; it needs to sat-

isfy a number of stringent requirements. First, the worst-case

overhead must be unnoticeable (it cannot exceed 3% or in-

crease the number of performance regressions throughout

the deployment cycle), regardless of whether the application

is running on the powerful (big) or weaker (little) cores in

ARM’s big.LITTLE architecture. In addition, the tool should

trace applications without access to their source. Finally, it

needs to be easy to maintain, and easy to merge with every

new (and often feature-breaking) Android release.

These goals and constraints are stricter than what is of-

fered by existing solutions. For instance, while record and

replay (R&R) can faithfully replay the entire execution, we

are not aware of any R&R system that can achieve worst-case

overhead below 3%. In fact, most literature [30,33,35,57] em-

phasizes the average overhead; for production tracing tools

on Android devices, engineers are primarily concerned with

the worst-case instead of the average. In addition, R&R tech-

niques typically require deep integration with the Android

runtime which means that they cannot be easily maintained.

Another challenge offered by the Android runtime environ-

ment is the semantic gap between an application written in

a high-level language (Java) and its native execution, which

renders a rich set of system proﬁling tools such as gprof [27]

ineffective without the runtime’s support. When applied to

runtime workloads, these proﬁlers only proﬁle the runtime’s

execution instead of the applications running on top of it. For

example, applying gprof to a runtime workload only provides

the call graph of the runtime itself (including the interpreter,

GC, and JIT-compiled code), instead of the call graph of the

Java application.

Android [26] and other runtimes [7] can output symbol

information during execution so that system proﬁling tools

can be applied to proﬁle language-level executions. This ap-

proach does not completely close the semantic gap for a few

reasons. First, each proﬁling tool must support using these

symbols; currently only the sampling-based perf [39] tool

supports using the symbols, and only for JIT-compiled code.

Android extended and integrated perf such that it can also pro-

ﬁle the interpreter’s execution at the language-level [26]. In

addition, perf expects every symbol to have a unique memory

address, which is not always true; for instance, the runtime

may update JIT-compiled code with application hot-patching

or recompilation based on new proﬁling information, thus

unloading old mapped code and reusing the page [26].

Yet, the runtime environment also presents a unique op-

portunity: trace points can be embedded and removed trans-

parently by the runtime without modifying the application’s

source. This opportunity remains under-exploited despite the

popularity of managed languages (the ﬁve most popular lan-

guages on GitHub in 2021 were runtime languages). To the

best of our knowledge, none of the existing language runtimes

offer detailed tracing tools that can be used continuously in

production. For example, the OpenJDK JVM provides a pow-

erful JVMTI debugging interface that can embed breakpoints

in applications. However, this means that execution has to be

deoptimized and run in the interpreter (rather than JIT com-

piled). Therefore, it is mostly suitable for use in development

environments. Many runtimes also provide sampling-based

proﬁling features that show “hot” code paths, but none provide

continuous method-level tracing suitable for production.

1.2 Contributions

This paper presents the design and implementation of Hubble

that satisﬁes the aforementioned goals. Hubble can capture

most method entry and exit points of any application’s threads,

just-in-time before a failure. We designed Hubble by combin-

ing several well-known techniques in a novel way that takes

advantage of the Android platform. Table 1 shows Hubble’s

major designs and the requirements they satisfy.

First, Android applications are typically downloaded as

bytecode and then either compiled or interpreted on the de-

vice; Hubble leverages this runtime environment to automat-

ically embed its tracing logic into the compiled binary or

interpreted logic. This enables efﬁcient tracing, as the trac-

ing logic can be inlined into the application, avoiding more

expensive trampolines (i.e., jumps in control ﬂow) that are

common in other tracing tools. In addition, this means that

Hubble is a purely black-box approach that does not depend

on the application’s source code.

788 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

In addition, Hubble writes trace points to an in-memory

ring buffer that is only ﬂushed when a problem is detected.

This allows it to run continuously and capture information

just-in-time leading up to a failure. By designing a concise,

variable-length encoding, such that most trace points occupy

eight bytes, a small (32MB) ring buffer is enough to capture

sufﬁcient debugging information.

Third, Hubble’s performance-sensitive instrumentation

logic is written in assembly. This ensures that performance

is optimal even on a device’s low-power (little) cores, which

cannot perform out-of-order execution or have small instruc-

tion reordering buffers. In addition, this decouples Hubble

from the Android compiler’s compilation ﬂow, so it avoids

having the compiler affect the correctness of the tracing logic,

and eases maintainability.

Finally, Hubble avoids using expensive synchronization

primitives [14] in two ways: threads write trace points to

thread-local buffers, avoiding inter-thread synchronization;

and, Hubble communicates with these threads by using a

purpose-built lock-free synchronization protocol.

The end result is a highly efﬁcient method-tracing system

sufﬁcient for debugging intermittent performance bugs. In

our microbenchmarks, each trace point costs less than one

nanosecond for nearly empty methods, and tracing overheads

are quickly amortized when methods perform meaningful

operations. Hubble’s tracing overhead is also unnoticeable in

Huawei’s continuous-integration performance testing infras-

tructure, which includes a variety of workloads and devices.

Hubble’s memory overhead is approximately 64 MB by de-

fault, accounting for two 32 MB ring buffers. As of 2021,

Huawei’s lower-end smartphones have at least 4 GB of RAM,

while higher-end ones can have up to 12 GB. Therefore, Hub-

ble’s memory overhead is less than 2%.

Hubble also strives to protect user privacy. Similar to ex-

isting error reporting systems such as WER [21], MacOS [2]

and Mozilla [34] crash reports, Hubble’s traces are only col-

lected with user consent. However, these other systems collect

a minidump of the memory image, whereas Hubble’s traces

are far less sensitive: they only consist of method names and

timestamps and do not contain any variable values.

Hubble has been integrated into Huawei’s core Android

OS codebase, deployed across a wide range of smartphone

and tablet product lines, since August, 2020. Older devices

may receive Hubble’s functionalities via an over-the-air OS

update. Since deployment, Hubble has signiﬁcantly eased

the debugging of intermittent performance problems. In fact,

engineers were able to quickly resolve many performance

problems that remained unresolved for months.

This paper makes the following contributions:

•

The design and implementation of Hubble, a highly efﬁcient

method tracing subsystem for Android, that satisﬁes a set

of unique, practical constraints, some of which are rarely

mentioned by existing literature.

•

Integration of Hubble’s traces with existing debugging

tools, like Perfetto [40] which can show call charts. This

signiﬁcantly improved the trace’s utility, where developers

can cross-examine Hubble traces with other runtime data.

•

Case studies on how Hubble diagnoses real-world perfor-

mance bugs which cannot be resolved without it.

Hubble also has the following limitations. First, it can

only embed tracing logic into executions that go through

the Android compiler or interpreter (from bytecode); Hub-

ble cannot trace native libraries like those invoked through

the Java Native Interface (JNI). In addition, Hubble’s trace

buffer could pollute the CPU cache and slow down cache-

optimized workloads (e.g., loop tiling [8]). However, while

cache-optimization is commonplace in server workloads, it

is uncommon on smartphones, especially in the interactive

UI-thread. Nonetheless, we evaluate this effect in §8.

2 Related Work

Record and replay (R&R) tools [10,15,16,29,30,35,36,38,50,

57] work by recording a user’s input and all non-deterministic

events (e.g., scheduling), so that the execution can be faith-

fully replayed. R&R tools do not meet our requirements for

a few fundamental reasons. The ﬁrst is overhead. Among

all R&R tools, Reverb [35] reported the best performance,

yet its overhead is still 5.5% on average (the worst-case is

not reported). It works only on JavaScript web applications,

where threads communicate using a message-passing inter-

face. When threads share memory, R&R incurs even higher

overhead. For instance, DoublePlay [57] reported a worst-case

overhead of 11% for network-bound workloads (Apache web-

server), 19% for disk-bound workloads (MySQL), and 278%

for CPU-bound workloads (SPLASH-2 ocean). To achieve

low overhead, some tools [33,38,45] do not record all non-

determinism which prevents accurate replay. Second, since

intermittent performance bugs may take days to occur, R&R

traces will grow untenably large. While checkpointing could

allow replay from a partial trace, the checkpointing operation

itself is expensive [50]. Compared to a call chart, an R&R

trace also imposes much larger privacy concerns. Finally,

R&R tools require deep integration with the Android runtime

and compiler. For instance, applying DoublePlay’s approach

to Android would require the runtime to run a parallel execu-

tion of the application, checkpoint and compare state between

the two processes, and so on. Hence, R&R tools would be

difﬁcult to maintain within Android.

An attractive alternative is to use hardware-support, like In-

tel PT or ARM ETM, to record branch-level traces [12,28,60].

These tools have a worst-case runtime overhead of 1–2%.

However, there are two challenges on ARM devices. First,

the semantic gap on Android’s runtime complicates the de-

coding of the branch-level trace, as it only provides the traces

of the runtime’s execution instead of the application. Second,

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 789

hardware support for tracing is restricted to development plat-

forms (most ARM processors on production Android devices

do not support the feature) [4].

Only a limited set of bytecode method tracing tools are

available on the Android platform. Android Studio’s CPU

Proﬁler can trace every method call, but its overhead is incred-

ibly high (a worst-case of 921

in our evaluation), because

instead of embedding the tracing logic into the compiled bi-

nary, it jumps into the Android runtime after every method

call. Internal tracing utilities within Android mostly leverage

Java Agent, JVMTI, or equivalent ART instrumentation in-

terfaces to perform method tracing. These mechanisms are

also expensive as they force applications to be interpreted

only. Aspect-oriented frameworks such as Tai Chi [53] and

Logan [55] are also available to intercept method calls at run-

time to execute arbitrary tracing code. However, they either

require modiﬁcations to the application’s source code or root

access. The fastest available method tracing utility that we are

aware of, Nanoscope [54], primarily targets method tracing

inside an x86 Android emulator, costing up to 10

higher

memory usage and performance overhead, so it is mostly

useful in an application development environment.

Some tools are able to perform in-application tracing with

low overhead in production. For instance, Firebase perfor-

mance monitoring [23] collects various metrics (e.g., startup

time) and allows developers to insert additional trace points.

AppInsight [41] instruments Windows Phone application bi-

naries to log whenever the runtime calls into and returns from

application methods. The instrumentation has sufﬁcient de-

tail to allow a server to reconstruct how a user request was

processed across different application threads and what the

critical path is. These tools typically trace the entire run of an

application, but at a low enough granularity that the trace does

not grow untenably large. As a result, they are useful for ap-

plication developers to locate bottlenecks in their application;

but the coarseness of the trace may necessitate additional de-

bugging information to locate the exact root cause, especially

if the bug is in the underlying systems which are not traced.

Timecard [42] goes beyond tracing by using AppInsight’s

traces to adjust the server’s computation quality (in real-time)

to meet an end-to-end response deadline.

There are also a few high-performance logging solu-

tions like NanoLog [58] and Log20 [59] that can provide

nanosecond-level logging. Both write data to thread-local

ring buffers and NanoLog uses a specialized encoding to save

space. NanoLog uses only the existing log statements in the

application while Log20 can be used to determine where best

to place log statements based on proﬁling the application’s

usage pattern.

In any case, the generated trace is only as

detailed as the developers’ instrumentation.

Outside of the Android platform, there are many call pro-

ﬁling tools like gprof [27], Fay [17], ftrace [51], perf [39],

In fact, the initial goal of this project was to integrate Log20 into

Huawei’s Android platforms.

DTrace [9], and SystemTap [52]. These tools support various

degrees of tracing from periodically sampling the call stack

to calling user-deﬁned methods using dynamic instrumenta-

tion. However, to capture traces that are detailed enough to

diagnose intermittent bugs, these tools incur overhead that

prevents them from tracing continuously in production sys-

tems. These tools typically require calling a method in their

instrumentation, whereas Hubble directly inlines the tracing

code into each method.

There are a large number of tools designed to trace each

request in a distributed system. Examples include Project5 [1],

MagPie [5], X-Trace [20], Dapper [47], ÜberTrace [11], and

Pivot Tracing [31], as well as commercial tools like Data-

dog [13] and New Relic [44]. These tools typically embed

trace points in critical system or network events, such as RPCs,

and record an ID that is unique to each request.

3 Case Studies

We present two case studies to showcase how Hubble helped

in diagnosing real-world intermittent performance problems.

The ﬁrst issue was within AppX, a third-party multipurpose

messaging, social media, and mobile payment application

with over a billion monthly active users. Occasionally, AppX

users experienced intermittent UI freezes (janks) of up to two

seconds. Engineers detected this problem by monitoring the

traces that Systrace continuously collects—namely, perfor-

mance alerts, sparse trace points, and metrics sampled at low

frequencies. Figure 1 (A) shows the available trace points ren-

dered as a method call chart in the Perfetto trace-visualization

tool. For the UI thread, this consists of only a few high-level

methods within the Android framework. The only conclusion

engineers can infer from this data is that the UI thread was

blocked for about two seconds during which it was supposed

to prepare the layout and content for rendering.

In contrast, the call chart based on Hubble’s trace, shown

in Figure 1 (B), accurately captures every method call in both

the application and the Android runtime. From the canonical

method names displayed in the chart, engineers were able

to quickly reconstruct the events that occurred before, dur-

ing, and after the UI jank. First, the user swiped back on the

device’s screen within AppX (

). Then, AppX initialized

a software keyboard to respond to the user’s action (

However, to display the keyboard, the scrollable chat compo-

nent must be resized (

), and this became the bottleneck.

Drilling down further, we can observe the series of method

calls responsible for generating the list of on-screen content

(

). Speciﬁcally, we can see that the UI thread is primarily

blocked by various long-running methods belonging to AppX.

Now with concrete evidence, our engineers concluded that

the root cause was within AppX, and initiated a meaningful

collaboration with AppX’s developers.

The second issue was a longstanding performance bug

790 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

i android.view.V... a android.view.View android.w... andr...

i android.view.V... a android.view.View com.AppX... andr...

void android.widget.HwAbsListView.onLayout(boolean, int, int, int, int)

void com.AppX.ui.base.MMPullDownView.onLayout(boolean, int, int, int, int)

void com.AppX.pluginsdk.ui.chat.ChattingContent.onLayout(boolean, int, int,...

void com.AppX.pluginsdk.ui.chat.ChattingScrollLayout.onLayout(boolean, int, ...

void com.AppX.ui.widget.DrawnCallBackLinearLayout.onLayout(boolean, int, int...

void com.AppX.ui.KeyboardLinearLayout.onLayout(boolean, int, int, int, int)

void com.AppX.ui.LayoutListenerView.onLayout(boolean, int, int, int, int)

void com.AppX.ui.widget.SwipeBackLayout.onLayout(boolean, int, int, int, int)

void com.AppX.ui.widget.DrawnCallBackLinearLayout.onLayout(boolean, int, int,...

co... jav... v java.lang... void com.AppX... j v

Choreographer#doFrame

traversal

layout

obtainView

o setupListItem obtai...

obtainView

bo...

vo...

in...

b j v

void com.AppX...

bo...

vo...

in...

b i

b v

vvo...v v

v v

+631.9 ms

+1.1 s

+1.6 s

+2.1 s

+2.6 s

+3.1 s

+3.6 s

+4.1 s

Method:

java.lang.CharSequence

com.AppX.pluginsdk.i.i.c(

android.content.Context,

long, Boolean)

Duration: 358ms 973us

UI jank greater than 2 seconds

Figure 1: Screenshot of method call charts in Perfetto for the UI thread, which performs all UI and Android framework operations. (A) Traces

generated by Systrace, (B) Traces with Hubble. Circled in red are 3rd-party application methods with long execution time. (A) includes all of

Systrace’s trace points recorded during this time period, whereas (B) is ﬁltered to render only approximately 10% of all available methods.

within an internal business teleconferencing application. Af-

ter the end of a teleconference, the application occasionally

froze for up to a second on a small number of user devices.

This annoyed users but it was not until months later that a

particularly vocal employee reported the issue to manage-

ment, who then opened a support ticket requesting that the

issue be resolved. Our device support engineers attempted to

reproduce the problem on their own, but all attempts were

unsuccessful. The only method call captured by Systrace was

binder_transaction()

, which does not explain why the issue

occurred. Further efforts to collaborate with the disgruntled

users were also ineffective as most users were either too busy

or otherwise unable to provide more detailed reproduction in-

structions. A few users were even invited to collaborate with

an engineer to reproduce the problem, but the intermittent

issue could not be reproduced after multiple attempts.

Several months later, Hubble, in pre-beta at the time, was

available for internal use. The disgruntled employees hap-

pily consented to deploying Hubble onto their mobile device

via an over-the-air Android OS update. Within a few days,

performance anomalies were detected and their associated

trace data was automatically collected. After a quick glance

at Hubble’s call chart, the support engineers identiﬁed that the

teleconferencing software was calling

Thread.sleep()

from

the UI thread after sending an Android Binder (IPC) system

service call. Closer inspection revealed that immediately after

a conference call ended, a series of method calls related to the

Audio Manager were performed, prior to the

Thread.sleep()

This behavior was unexpected and if not for the complete

method call trace, which contained both the application and

Android framework layers, we would still be stuck with many

of our initial theories; e.g., the application could be collecting

and sending meeting summary data back to the teleconference

service or an unexplained scheduling issue.

With this new information, we brought in a developer with

expertise in the Android audio stack. After examining Hub-

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 791

ble’s call chart, the developer immediately identiﬁed the root

cause. The problem can only be reproduced under very spe-

ciﬁc conditions where users must be connected to Bluetooth

headsets using a special mode prior to ending the meeting.

After the meeting ended, the application immediately rerouted

audio to Bluetooth devices connected over the A2DP stream-

ing protocol. This rerouting process requires re-initialization

of Bluetooth’s SCO (synchronous connection-oriented) link

where the

Thread.sleep()

was invoked to wait for the link

to be established. We were unfamiliar with these details, but

with the help of a developer with the necessary domain knowl-

edge, the issue was promptly ﬁxed by moving the connection

and rerouting logic into an asynchronous event handler.

4 Background and Overview

This section ﬁrst discusses Hubble’s design goals and the

role it plays in the failure diagnosis process, which helps to

understand Hubble’s design. We then provide an overview of

Hubble, leaving the details to the subsequent sections.

4.1 Goals and Requirements

Hubble’s performance overhead and resource usage must be

undetectable in all real-world usage scenarios. In practice,

this translates to two requirements: Hubble’s worst-case over-

head in real-world scenarios, in terms of both latency and

memory usage, should be less than 3%. This target was set

by our quality assurance team since they cannot reliably mea-

sure overhead below 2–3% on mobile platforms, even under

ideal conditions. Nonetheless, this is similar to the target set

by other practitioners; Google, for example, reported a 2%

overhead budget to deploy tracing tools in production server

workloads [32,48]. The second requirement is that the over-

head budget should be respected regardless of whether Hubble

is tracing workloads on big or little cores. Besides not being

as fast as big cores, little cores also tend to lack advanced fea-

tures like out-of-order execution. Thus, they enforce stricter

restrictions on the tolerable overhead for Hubble. In any case,

satisfying the target overhead only allows a tool to pass the

deployment planning review. To be deployed in production,

the tool needs to go through a systematic procedure consisting

of three phases:

1. Internal testing

. We simulate users using our devices

by sending a stream of pseudo-random inputs to a large

ﬂeet of physical devices. Each device collects various met-

rics like application startup times, the number of dropped

frames, and so on. Each metric forms a statistical distribu-

tion over a large number of trials. We compare the distri-

bution before Hubble was added with the one after. If the

differences are statistically insigniﬁcant, Hubble has not

caused a noticeable change, and we move to phase 2.

2. Internal beta release

. We push engineering builds to a

small group of internal beta testers on their daily-use de-

vices. We ask these users to report any performance re-

gressions they notice and any new performance issues will

need to be resolved before moving to the next phase.

3. Public beta release

. A build is pushed to all beta users

(tens of thousands of users), and we monitor all new per-

formance anomaly reports. We only consider Hubble’s

overhead as undetectable when the beta build does not

show a statistically signiﬁcant increase in reports. Only

then can the tool be further released to the entire public.

Android applications are typically distributed as bytecode

compiled from high-level languages like Java. Once down-

loaded, this bytecode is either ahead-of-time (AOT) com-

piled, or executed within the Android runtime (similar to the

Java Virtual Machine). The Android runtime compiles fre-

quently executed code Just-in-Time (JIT), using the same

AOT compiler. An already-compiled application could also

be re-compiled, if runtime proﬁling reveals new optimization

opportunities. Applications could also contain native libraries,

i.e., code that was already compiled into native instructions.

Thus, Hubble must be able to operate with only access to the

downloaded or generated bytecode or native instructions.

Easy maintainability across Android versions is required.

Android is typically updated every six to twelve months, with

each new release potentially breaking features or making

large-scale changes internally. Thus, Hubble should be modu-

larized and decoupled from the upstream source.

Finally, as the case studies highlight, Hubble needs to be

able to trace both the executions of the application and the An-

droid framework to be useful. Ideally, device vendors would

only be responsible for analyzing and debugging bugs within

Android, and application developers would only be respon-

sible for bugs within the application. The reality, however,

is that bugs in the Android framework may manifest them-

selves in the application and vice versa. Furthermore, system

traces are not available to application developers (in order to

maintain users’ security) and in-application traces may not

be available to or easily understandable by device vendors.

Exacerbating the issue, application developers and even in-

ternal developers at Huawei are reluctant to investigate bug

reports without clear evidence that the bug is in their code.

Whole-system method traces allow engineers to infer roughly

what the application and framework are doing, together, so

that the problem scope can be narrowed down to speciﬁc

call chains and system services. Essentially, Hubble needs to

bridge the gap between system and application developers,

which in turn, will signiﬁcantly ease triaging and debugging

for both parties.

Overall, these requirements highlight the practical chal-

lenges of designing and deploying tracing tools onto a com-

plex user-device platform such as Android.

792 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

4.2 The Failure Diagnosis Process

To understand Hubble’s utility, we ﬁrst need to overview the

failure diagnosis process. Android devices ship with a set

of anomaly detectors to detect common issues like lags in

the UI. When an anomaly detector ﬁres, the system saves

several pieces of data such as logs, metrics, and traces. At an

appropriate time, these data will be uploaded to the device

vendor for analysis.

4.2.1 Anomaly Detection

Since Hubble’s utility depends on an anomaly detector ﬁring,

we ﬁrst provide background on the detectors available in

the Android Open Source Project (AOSP) and our version

of Android. There are two branches of anomaly detection

mechanisms that device vendors can use in the production

environment: Those implemented by Android itself and those

implemented by in-house engineering teams. Both branches

use information gathered either from the Android runtime

layer or from the Linux kernel. In addition, both branches are

generally tuned to be conservative to reduce the number of

false positives. However, if a severe performance issue occurs,

a signal will most likely be raised.

The anomaly detectors implemented in the AOSP have

been continuously developed for over a decade. For exam-

ple, the most frequently used anomaly detector is the UI jank

(lag) detector, which has an extremely close correlation to

user-observable performance issues. It will alert if a number

of consecutive display frames are delayed longer than a pre-

deﬁned threshold. Android ofﬁcially groups all its tracing,

proﬁling, and anomaly detectors under one umbrella term

known as systrace. In production environments, most of these

anomaly detection signals and alerts are continuously cap-

tured and analyzed in real time.

Internally, we utilize a number of additional black-box

anomaly detectors which monitor for a number of kernel level

indicators and hardware events. For example, we implemented

a system-level, HCI-based detector: Studies show users start

to perceive a delay after 400-600ms. So by instrumenting the

runtime where (1) a touch is detected by the screen, (2) when

the signal is delivered to the application, and (3) the appli-

cation generates a response, we can accurately measure the

delay between (1) and (3) and ﬁre an alert when the delay is

longer than 400ms. Furthermore, we can attribute the delay to

either signal delivery in the runtime or within the application.

Other black-box anomaly detectors could be as simple as

monitoring whether the device has entered the thermal throt-

tling mode. Most detectors, however, don’t rely on a single

metric. Instead, they correlate multiple metrics. For example,

if a detector detects that the current GPU memory bandwidth

utilization is high, it then checks other metrics such as the ren-

dering queue backlog length; only if multiple of them suggest

an anomaly does the detector ﬁre a warning. Experimental

anomaly detectors may further leverage real-time machine-

learning monitoring Android runtime metrics like the number

of locks held, memory allocation and garbage collection fre-

quency, and so on.

4.2.2 The Utility of Hubble

When Hubble’s traces are collected, they are integrated into

systrace and Perfetto when presented to engineers with other

runtime data. Perfetto and systrace are powerful debugging

tools that can visualize a variety of runtime data, including

visualizing the method trace as a call chart or ﬂame graph.

The tools also have search and analytics (e.g., using SQL)

capabilities that allow developers to correlate data from dif-

ferent sources. For instance, developers can cross-examine

traces with logs and hardware metrics. Developers can also

alert based on traces. For example, one use case of Hubble

is to search for the call stack that matches a speciﬁc method

invocation order, get an average runtime, and alert when it

exceeds a threshold. As a result, Hubble is not a standalone

tool, nor the only debugging tool. Instead, developers usually

start debugging by ﬁrst examining the data from existing logs

and metrics, and some bugs can be resolved with these alone.

However, the remaining bugs—typically hard-to-diagnose,

intermittent issues—require more insight, which is where

Hubble excels.

Key to Hubble’s success is the visibility it provides into

application and framework-level behaviour, without which en-

gineers cannot triage issues. Hubble’s detailed method traces

also allow developers to better understand how a bug can

be reproduced; with a reproduction, developers can repeat-

edly reproduce the bug in a development environment (with

heavyweight tracing) until the issue is understood.

Nonetheless, there are some limitations to Hubble’s utility.

We have found Hubble’s traces are not as useful in the follow-

ing cases: (1) if the bug is in the system’s native code (which

is not traced), (2) if the method-level trace is not ﬁne-grained

enough (e.g., an inﬁnite loop without making any function

calls), or (3) if a bug is caused by incorrect data-ﬂow (i.e.,

an incorrect variable value) that does not affect the call path

(otherwise it could be inferred by Hubble’s trace). However,

Hubble’s traces can still help developers to signiﬁcantly nar-

row down the problem scope (e.g., they can locate the method

that contains the inﬁnite loop). In theory, if the distance be-

tween the root cause and the symptom is too long, Hubble

could miss the cause due to the ring buffer size. However, we

have not yet encountered such a case in practice.

We do not have an exact number of issues exclusively resolved by Hub-

ble, because Hubble’s traces are integrated into existing debugging tools

with other traces. However, we noticed the number of bug tickets containing

intermittent and difﬁcult to reproduce bugs quickly dropped after Hubble

was ﬁrst made available.

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 793

4.3 Overview of Hubble

Hubble modiﬁes the compiler and interpreter to instrument

tracing logic at the entry and exit of every non-inlined byte-

code method, whether it is interpreted, ahead-of-time com-

piled, JIT compiled, or recompiled. Portions of the Android

framework itself and factory installed apps, i.e., the apps that

are packaged by the OEM vendor, could be already in com-

piled form instead of bytecode; for these cases, the trace points

are embedded at the vendor’s site. Hubble can also trace calls

made using the JNI (when applications calls into the native li-

braries and the returns). However, function calls made within

native libraries cannot be traced by Hubble.

Hubble adds one system thread, the trace control thread,

to each application’s process that can turn tracing on or off

for any thread in the same process. Although Hubble instru-

ments all bytecode methods, by default, the control thread

only turns on tracing for the UI thread, which performs all

UI and Android framework operations. At every method en-

try and exit, Hubble’s tracing code writes an entry to a ﬁxed

size in-memory ring buffer. When the buffer is full, the buffer

pointer will wrap around so the oldest data will be overwritten.

When a performance anomaly detector detects a perfor-

mance problem, the control thread will be notiﬁed. It then

notiﬁes the UI thread to stop tracing, preventing useful de-

bugging data prior to the problem from being overwritten.

Once tracing has stopped, the control thread ﬂushes the ring

buffer to disk, before restarting tracing. The saved trace ﬁle

could be sent back to Huawei to aid postmortem debugging,

or post-processed and analyzed on the device, off the critical

path, if a summary needs to be sent.

Each traced thread writes to a private ring buffer local to

itself. Hubble keeps at most

buffers in the system, from the

threads that most recently executed in the foreground. Older

buffers will be reclaimed by the system.

is conﬁgurable and

the method trace logic can be programmatically enabled and

disabled for individual threads, either via the runtime or by

the user application itself. This means that any background

threads from almost any process, even short lived ones, can

be traced. However, if there are too many concurrent threads

being traced, Hubble will run into memory usage issues. To

solve this, we could have a ring buffer per core rather than per

thread; to differentiate trace points from different threads, we

could record the thread’s ID (available from a register in the

runtime) in each trace point. By default,

is set to 2. This

is sufﬁcient to capture both the current foreground and most

recent background application’s UI threads.

5 In-memory Tracing

This section describes the design and implementation of Hub-

ble’s tracing logic. We ﬁrst explain the information recorded

in each trace point and its encoding. We then discuss how we

integrated the tracing code into Android’s optimizing com-

ts ptr

ts 0x1

Entry

Exit

ptr

ts ptr

ts 0x0

64-bit mode

32-bit mode

(A)

(B)

ts 0x0

Figure 2:

Format and encoding of trace points

at method entry

and exit, and in 64-bit and 32-bit execution modes. “

” and “

ptr

”

are timestamp (generic timer count) and method pointer. A solid

bordered box represents a 64-bit slot. Underscores represent lossy

encodings of timestamps.

piler so that compiler optimizations do not affect our instru-

mentation.

5.1 Data Format and Encoding

Figure 2 shows the format of each trace point. As shown,

method entry points have a varying encoding depending on

the CPU’s execution mode and other factors explained later.

The CPU will change mode when executing a 32-bit or 64-bit

application. Method entry trace points contain a timestamp

and a method pointer, while exit points contain a timestamp

and the constant 0x0.

For timestamps, Hubble uses the Generic Timer [3] count

instead of the standard system clock. A Generic Timer is

a high resolution clock (nanosecond precision) and its tick

value can be directly read from a register on modern ARM

SoCs. It ticks at a constant frequency regardless of the CPU

operation speed and the counter value starts at 0 when reset.

When the trace is persisted, Hubble records the current time,

which can be used to reconstruct the absolute timestamp of

each trace point from the Generic Timer count.

The method pointer is the memory address of a metadata

object, ArtMethod, that describes each loaded class-method

and can be used to decode a method’s canonical name. As

part of the ClassLoader initialization process in Android’s run-

time (ART), an array of ArtMethods is allocated in a memory

region outside the managed heap (ignored by garbage collec-

tion). ArtMethods can only be added to this array and never

be modiﬁed nor removed. ART ensures that immediately af-

ter entering a method, the address of its ArtMethod is stored

in register r0. Since the lifecycle of the main ClassLoader,

which is responsible for loading all of the executed bytecode

methods, spans the entire duration of the application, we can

safely store the ArtMethod pointer in the trace buffer and

reconstruct the method name after the trace data is persisted,

so long as this happens before the application exits. Note that

applications could use additional custom ClassLoaders with

shorter lifecycles. If we persist the trace data after the custom

ClassLoader exits, we could dereference pointers that are no

longer valid. To avoid this, we install a cleanup hook for cus-

tom ClassLoaders to invalidate the trace buffer (or optionally

persist the trace data).

794 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

0x8888888800000003

0x8888888800000001

0x88888887FFFFFFFF

0x8888888800000002

Overflow

Detected

STOP

Figure 3: Iteratively Recover Truncated Timestamps.

For each thread that is traced, the control thread allocates

storage for the trace in the traced thread’s local storage (Java

ThreadLocal [37]). This includes a ring buffer and metadata

such as where content in the buffer begins and ends. The ring

buffer is carved into an array of 64-bit wide integers in both

64-bit and 32-bit mode.

For the timestamp in each trace point, Hubble only stores

the lower 32 bits of the Generic Timer counter, regardless of

execution mode. (Even in 32-bit mode, the Generic Timer

counter is 64 bits wide because the value is fetched from a

co-processor that is not subject to the mode change.) Thus,

the recorded timestamp may wrap around, which we handle

during decoding.

Figure 3 shows how Hubble reconstructs the accurate times-

tamp from truncated ones. The last timestamp is a reference

timestamp (

), which is the complete 64-bit Generic Timer

counter value recorded when the trace is persisted. Using tr,

we can iteratively reconstruct the upper 32 bits of the previous

three timestamps: if a previous timestamp has a lower value

than the current one (e.g.,

versus

), we assume it has the

same upper 32 bits; if it has a higher value (e.g.,

versus

), we assume a wrap around occurred and the upper 32 bits

should be decremented by one.

Theoretically, this could lead to an error: if between two

consecutive trace points more than

ticks occur, the re-

constructed timestamp will be inaccurate. However, this is

unlikely to happen in reality. It takes 223.7 seconds on a Qual-

comm ARM SoC and a little over 37 minutes on a Huawei-

designed SoC for the lower 32-bit Generic Timer counter to

tick

times. So only if a method executes for more than

223.7 seconds, without calling another method or returning,

will an inaccuracy occur.

5.1.1 Format under 64-bit Mode

Hubble uses a variable-width encoding for the ArtMethod

pointer when executing in 64-bit mode. In this mode, the

pointer is 64 bits; but for real-world applications, the vast

majority of the pointers’ upper 32 bits have the value

0x0

We exploited this observation to increase encoding efﬁciency.

When the upper 32 bits are

0x0

, Hubble only records the lower

32 bits of the pointer (Figure 2 (A)). Together with the lower

32 bits of the timer count, a method entry trace point occupies

a single 64-bit buffer slot. If the upper 32 bits of the method

pointer are not

0x0

, a method entry trace point occupies two

buffer slots (Figure 2 (B)). The ﬁrst 64-bit slot is used to save

the complete 64-bit method pointer; in the second slot, the

upper 32 bits store the timer count and the lower 32 bits store

the constant 0x1.

The method exit trace point occupies a single 64-bit slot.

The upper 32 bits store the timer count, and the lower 32-bit

stores 0x0, indicating it is a method exit trace point.

Traces in this format can always be

unambiguously de-

coded

in reverse. To decode each trace point, Hubble ﬁrst

checks the lower 32 bits of the previous slot. Depending on

whether its value is

0x0

0x1

, or another value, Hubble knows

that this trace point is either a method exit, a method entry that

is two slots wide (Figure 2 (B)), or a method entry that is one

slot wide (Figure 2 (A)).

0x0

and

0x1

cannot be method point-

ers since they are invalid method pointer memory addresses. A

method exit point is matched with the corresponding method

entry point in a LIFO manner (implemented using a stack).

Note that the decoding occurs server-side, after the persisted

trace has been sent back.

5.1.2 Format under 32-bit Mode

In 32-bit mode, both method entry and exit trace points use

a single buffer slot. The upper 32 bits are always the lower

32 bits of the timer count, like in 64-bit mode. For method

entry points, the lower 32 bits store the method pointer, and

for method exit points, the lower 32 bits store 0x0.

5.1.3 Efﬁcient Recording

The tracing logic can be efﬁciently implemented by a few

assembly instructions. For example, Hubble uses only two

assembly instructions to store the method entry trace point

under 32-bit execution mode:

1 MRRC ( al , scratch1 , scratch0 , 0 b0001 , 0 b1111 , 0 b111 0 );

2 STRD ( r0 , scratch1 , M e mOp era nd ( buffer , 8 , Po s tIn d ex ));

The ﬁrst

MRRC

instruction is used to fetch the 64-bit Generic

Timer counter value into two 32-bit CPU registers:

scratch1

and

scratch0

(readers can ignore the other operands). Then

STRD

instruction is used to (1) store

scratch1

, which con-

tains the lower 32-bits of the Generic Timer counter, and

which contains the ArtMethod pointer, to the memory address

stored in

buffer

by 8 bytes

after the memory operation completes. So after this store

instruction, buffer will point to the next buffer slot.

Hubble’s tracing assembly is directly inlined in the basic

block at each method entry and exit. Comparatively, in other

proﬁlers that use compiler instrumentations, the instrumented

code will call a special tracing function. For example,

gcc -pg

instruments a call to the special function

mcount()

, which is

required for tools like

gprof

. While easier to maintain and

more portable, the added function call introduces overhead.

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 795

When tracing is stopped, the valid portion of the ring buffer

is ﬂushed to disk using an fwrite call. Three metadata ﬁles

are generated. First, a complete 64-bit Generic Timer counter

value (i.e., the reference timestamp) and the absolute system

timestamp are collected at the same time; this facilitates the

reconstruction of the actual, non-relative timestamp of each

trace point if needed. Then the current buffer position and size

are recorded. Finally, Hubble computes a symbol table, map-

ping each unique ArtMethod pointer value to the method’s

canonical name.

5.1.4 Alignment

Each trace point is always eight-byte (a word on 64-bit de-

vices) aligned. Eight-byte aligned memory accesses are cru-

cial to achieving the highest performance in both 32-bit and

64-bit mode on modern ARM SoCs. Unaligned accesses take

at least one more cycle than a properly aligned memory ac-

cess. In the worst case, a single unaligned access can cross a

cache-line boundary and generate two cache misses or even

two consecutive page faults. Worse yet, unaligned memory

accesses are an unsupported operation on low-power or older

ARM processors, so additional memory accesses and mas-

saging logic are required. Accordingly, we use 32 bits to

represent the constants

0x0

and

0x1

, since the performance

gains of aligned accesses outweigh encoding inefﬁciency.

5.2 Hand-optimized Assembly

There are a few reasons to write the tracing logic in assem-

bly. First, it decouples Hubble from the Android compiler’s

compilation ﬂow. If written in C++, the compiler could move,

reorder, or even remove the tracing logic (e.g., the tracing

logic accesses global variables without a memory barrier (§6),

which is an undeﬁned behavior). By writing the logic in as-

sembly, we can insert it after the compilation stage, bypassing

any optimizations that are at odds with the tracing. To do

so, early in the compilation stage, instead of generating the

actual tracing code, we simply insert a special placeholder

instruction at every method entry and exit (including exits

due to exceptions); we then conﬁgure the Android compiler

to exempt this instruction from its later optimization stages.

After all the optimizations are performed, we replace this

placeholder instruction with the actual tracing instructions.

This also makes Hubble easy to maintain, as it is decoupled

from any compiler changes that are not backward compatible.

Using assembly also allows us to optimize for both big and

little cores. The Android compiler’s optimization is heavily bi-

ased toward the big core. For example, the compiler skips the

architecture-speciﬁc optimizations when they are unnecessary

on big cores that support out-of-order execution. However,

the little cores do not support out-of-order execution, so run-

ning the compiled code will result in poor performance. For

instance, each trace point needs to check if we are at the end

Trace point:

7 if (start)

8 trace...

9 if (stop) {

10 stop=buffer;

11 start=0x0;

12 }

// Initialization

start=0x0; stop=0x0;

T1:L1

Control thread:

1 start=buffer;

2 wait(signal);

3 stop = 0x1;

while(stop==0x1)

5 sleep(..);

6 persist(..);

T3:L3

T6:L1

Traced thread

Control thread

Figure 4: Lock-free Synchronization Protocol.

of the ring buffer (and if so, we need to wrap around). This

check requires fetching the value of the ring buffer pointer

from memory. If we manually prefetch this pointer (in assem-

bly), it results in an approximate speedup of 35% on the little

core. The compiler, however, did not perform this prefetching,

because it expects the big core will perform the prefetching

automatically.

Finally, because we have domain knowledge of the tracing

logic and processor microarchitecture, we can perform better

optimizations than the compiler, regardless of whether it is

on the big or little core.

6 Tracing Control

Recall a system thread is responsible for notifying the traced

thread to turn tracing on or off. The traced thread (e.g., the UI

thread) is only responsible for (1) checking whether tracing

is turned on, and if so, (2) writing the trace points into the

trace buffer, and (3) turning tracing off if necessary. The rest

of this section describes how the two threads communicate

efﬁciently without synchronization primitives.

Figure 4 shows the communication between the control

thread and the traced thread. Lines 1–6 are the control thread’s

logic, whereas lines 7–12 are executed at every trace point

in the traced thread. Hubble uses two eventually-consistent,

shared variables,

start

and

stop

start

is unidirectional, i.e.,

it is set by the control thread and read by the traced thread,

and

stop

is bidirectional, as it can be set and read by both

threads. Initially, both variables are set to

0x0

. To start tracing,

the control thread sets

start

to the address of the next buffer

slot (line 1 in Figure 4), and waits for a signal to stop tracing.

Therefore, the value of

start

indicates two things: whether

tracing is on or off, or the buffer position. At each trace point,

the traced thread ﬁrst checks if

start

0x0

, and only proceeds

with tracing if it is not (line 7).

To turn tracing off, the control thread sets

stop

0x1

(line 3 in Figure 4), and then enters a polling loop until

stop

is changed to a value greater than

0x1

(lines 4–5). In the

meantime, the traced thread performs tracing and evaluates

796 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

the value of

stop

at the end of every trace point (line 9). Once

the traced thread detects that

stop

was changed to a non-zero

value, it enters the logic to stop tracing. The traced thread ﬁrst

sets

stop

to the address of the current buffer pointer, i.e., the

end position of the buffer, at line 10. So

stop

also serves dual

purposes: whether tracing should stop (with value 0 or 1),

or the buffer end position. (Note that

0x0

and

0x1

are invalid

buffer memory addresses, so after line 10,

stop

will be greater

than 0x1.) Then the traced thread sets start to 0x0 at line 11,

to guarantee that tracing will be disabled immediately. Finally,

the control thread detects that the traced thread has stopped

tracing, so it can persist the trace or clean up the ring buffer.

Figure 4 also shows an example trace control-ﬂow. Each

circle represents a trace point, with ﬁlled and blank shading

indicating whether trace data is written or not. At the begin-

ning, tracing is off. At T1, the control thread turns tracing

on at line 1 (L1) by setting

start

to a non-zero value. This

new value is propagated to the traced thread at time T2, as the

result of eventual consistency in the memory cache coherence

protocol. Then, the following three trace points are written to

buffer. At T3, the control thread turns tracing off by setting

stop

0x1

, which is propagated to the traced thread at T4.

The traced thread then executes lines 10–11, and at T5, the

control thread detects that

stop

was changed to a value greater

than

0x1

; so it breaks out of the polling loop and persists the

trace. After the trace is persisted, the control thread restarts

tracing at time T6 (line 1).

This design is highly efﬁcient. Each trace point needs to

check the values of

start

and

stop

only if the trace has been

started.

start

and

stop

are regular shared variables that are

almost always cached. In comparison, any alternative design

that uses synchronization primitives or atomic variables would

introduce much higher overhead in each trace point, which is

on the critical path.

Since tracing is stopped and the current ring buffer location

is written to the

stop

variable by the traced thread itself, no

additional trace point will be written to the buffer afterwards

and the buffer metadata will be consistent. For example, if

the last trace point is a 64-bit method entry occupying two

slots, it is guaranteed that both slots are written with the buffer

pointer correctly incremented before tracing is stopped.

If the traced thread is executing native code, either through

the JNI or a custom ClassLoader, it cannot respond to the

control thread’s stop tracing request, because the logic to stop

tracing is only instrumented in bytecode methods. Therefore,

the control thread further checks whether the traced thread

is in native execution when it attempts to stop tracing. If so,

the control thread will ﬁrst obtain ART’s state transition lock

that prevents the traced thread’s execution from changing

state, i.e., from native execution back to the bytecode world

(either the interpreter or compiled code). Then the control

thread forcibly copies the buffer position to

stop

, and sets

start

0x0

, followed by a memory fence. Finally, the control

thread can release the state transition lock. A subtle data

race could occur during state transition where just before

the lock is obtained, the traced thread transitions back to the

bytecode world. Debugging this unfortunately took weeks,

but we ﬁxed it by rechecking the traced thread’s execution

state after obtaining the lock.

7 Privacy and Security

Security and privacy are some of our top priorities. Hubble

does not collect personally identifying information, such as

phone numbers or user IDs. Hubble’s traces only contain

method names and timestamps, there are no actual data values,

not even parameter values. Widely-adopted error reporting

systems like Windows Error Reporting (WER) [21], MacOS’

crash report [2], or the Mozilla Crash Reporter [34], record

a subset of the memory state or often collect system logs. In

comparison, Hubble’s traces are far less sensitive. Similar

to WER and other widely-adopted error reporting systems,

Hubble uses an informed consent policy.

Even when user consent is given, Hubble further strives

to minimize the amount of data that leaves the device. Hub-

ble has the capability to perform the same analyses that are

performed server-side, locally on a user’s device, with only a

summary being sent back to the vendor. For example, Hubble

can quickly scan the trace ﬁles and compute the top methods

with the longest “self-execution-time”, or it can automatically

isolate and extract the longest method call chains from when

a performance anomaly occurred. Performance bug models

could be distributed to client devices, containing “signatures”

of problematic method names or method call chains, and if

there is a match, statistics could be sent back instead of the

complete trace.

Hubble also exploits many built-in data security features in

Android and the Linux kernel to protect trace data. The traces

are stored inside an application-private storage area that is

protected by the kernel-level application sandbox. Only the

application itself with matching its UID, device vendors, and

application developers—when they conﬁgure their mobile

device in debug mode—have access to the trace ﬁles.

8 Evaluation

Hubble has been repeatedly tested on Huawei’s performance

testing framework, which included the top 100 popular ap-

plications, with workloads including startup, stress testing

(simulated random screen touches at a high rate), and normal

usage simulations, on all supported devices. Overall, we have

found Hubble’s overhead is statistically insigniﬁcant in real-

world use-cases. Hubble tracing is now enabled by default in

all Huawei testing frameworks.

We have designed a few experiments to stress test and

study Hubble’s runtime characteristics, aiming to answer four

questions: (1) What is the runtime cost of Hubble’s tracing?

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 797

(2) What is Hubble’s effect on cache behavior and memory

bandwidth? (3) What is Hubble’s overhead in the most de-

manding real-world scenarios? (4) How long of an execution

trace can be stored in the ring buffer? We did not evaluate

power consumption. Despite best-effort attempts, we could

not reliably observe battery overhead in any experiments.

Huawei’s devices are shipped with aggressively tuned power-

saving proﬁles and thus far, we have not observed an increase

in reports of battery drain.

Unless otherwise speciﬁed, experiments were performed

on a Google Pixel 1 phone that is well-supported by the

open-source version of Android (AOSP). The phone con-

tains a Qualcomm Snapdragon 821 processor with two high-

performance cores each with a 64 KB L1 (divided equally

for instructions and data) and 1.5 MB L2 cache, and two

low-power cores each with a 64 KB L1 and 512 KB L2 cache.

We compared three execution modes: (1) baseline – the

phone running unmodiﬁed Android; (2) tracing off – Hubble

is enabled and applications are instrumented, but tracing is

turned off; and (3) tracing on. Baseline experiments were

performed on AOSP’s android-10.0.0_r2 [56] branch. We

recompiled the same branch with Hubble enabled.

Hubble’s overhead could only be measured reliably in CPU-

intensive and unrealistic microbenchmarks. Repeatedly run-

ning the two microbenchmarks in §8.1 and §8.2 causes the

CPU to quickly reduce its clock speed due to severe thermal

throttling. To improve the validity and reproducibility of the

experiments, we placed the phone on bags of ice water.

8.1 Trace Point Overhead

Hubble’s tracing overhead is amortized by the amount of work

performed by the traced method. Since Hubble’s tracing logic

does not impose any dependencies on the traced method, nor

does it use synchronization primitives on the critical path,

the amortization effect will be enlarged by the deeper CPU

pipeline. We evaluated both the cost of an individual trace

point as well as the overall runtime overhead as the method

performs more work. For comparison, we also evaluated An-

droid’s built-in method tracing utility, typically invoked via

Android Studio’s CPU proﬁler, henceforth referred to ASMT.

Listing 1 shows the method used. The amount of work done

can be controlled through the

work

parameter. To prevent the

method from being inlined by the JIT compiler, we added

tail-recursion on line 5. In addition, we executed the method

with a

depth

since the compiler still performs inlining

at lower

depth

sum

is carried across calls to ensure that the

loop is not optimized away by dead code elimination.

We ran the method with

work

values of 0, 1, 10, 100, and

1,000. We measured the runtime of two billion iterations. The

cost of a trace point is calculated as the overhead of the 0-work

experiment divided by two, since each method call contains

a method-entry and method-exit trace point. To ensure the

method is compiled by the JIT compiler before evaluation, we

Average

Cost (ns)

Standard

Deviation (ns)

Performance

Overhead (%)

ASMT

Tracing ON

32-bit 3,911.575 59.2450 920,587%

64-bit 3,366.050 57.8026 748,510%

Hubble Method

Tracing ON

32-bit 0.725 0.0551 171%

64-bit 0.650 0.0023 145%

Hubble Method

Tracing OFF

32-bit 0.001 0.0030 0%

64-bit 0.008 0.0027 2%

Table 2: Cost of a Single Trace Point

Android Studio Method Tracing

32-bit Trace On

64-bit Trace On

0 1 10 100 1000

Number of Work Iterations

100

150

Performance Overhead (%)

Hubble Method Tracing

32-bit Trace On

32-bit Trace Off

64-bit Trace On

64-bit Trace Off

Figure 5: Performance Overhead Over Work Iterations

ran the experiment until its runtime stabilized to a maximum

variance of ﬁve percent. The method is then executed ten

times for each experiment.

1 p ubli c long Test ( int depth , int work , long sum ) {

2 for ( int i = 0; i < w ork ; i ++) {

3 sum *= i - 1; sum /= i + 2;

4 }

5 if ( de pth > 1) re tur n T est ( dept h - 1, work , sum );

6 ret urn sum ;

7 }

Listing 1: Program used for measurement.

Table 2 shows the results of the 0-work experiment, with

the other work values in Figure 5. The 0-work experiment

shows that on average, each Hubble trace point costs less

than one nanosecond when tracing is on, and less than 10

picoseconds when tracing is off. This is far less than ASMT’s

overhead which is on the order of microseconds. Figure 5

shows the amortization effect: as the amount of work done by

the method is increased, Hubble’s tracing overhead percentage

decreases quickly. Note that in reality, small methods like this

would likely be inlined, excluding them from being traced.

8.2 Cache Effects Microbenchmark

We used matrix-multiplication (MM) to measure Hubble’s

effects on the cache. MM is a classic workload that can either

beneﬁt heavily from caching or suffer ample cache misses [8].

When multiplying large matrices, a naïve implementation

798 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

16x16

32x32

64x64

128x128

256x256

512x512

1024x1024

2048x2048

Loop Tiling Size

100

150

200

250

300

Time Per Element (μs)

Trace On Mean

Baseline Mean

Trace Off Mean

Figure 6: Cache and Memory Effects

causes many unnecessary cache misses. However, the major-

ity of these cache misses can be avoided using loop-tiling, i.e.,

partition each matrix into many small tiles, where each ﬁts in

the cache, and perform all accesses on one tile before moving

on to the next. We examined Hubble’s effect on each level of

the cache by gradually increasing the tile size.

We evaluated Hubble’s effect on MM with eight differ-

ent tile sizes: 16

16, .., 2048

2048. The input matrices are

2048

2048, and each element is a four-byte integer. This

means tile sizes 64

64 and below ﬁt within the L1 cache; tile

sizes 256

256 and below ﬁt within L2; and all remaining tile

sizes exceed both cache levels. To evaluate the highest amount

of interleaved memory-contention that Hubble may have with

MM, we performed each multiply and add operation inside

a method such that two trace points are produced for each

step of MM. We also inserted a dummy tail recursion call so

that the JIT compiler does not inline the method. For each tile

size, we ran the experiment ﬁve times. We did not compare

with ASMT because it was too slow.

Figure 6 shows the results. With Hubble’s tracing turned off,

we could not reliably observe any overhead. With Hubble’s

tracing turned on, for the smallest tile size that ﬁts within the

L1 cache, Hubble has a min / max / mean overhead of 41% /

70% / 54%. When the tile size still ﬁts within the L2 cache

at 128

128, the overhead increased slightly to a min / max

/ mean of 64% / 83% / 70%. Finally, when the tile size is

much larger than the L2 cache, caching is no longer effective.

In this region, the increased execution time when tracing is

turned on did not deviate signiﬁcantly from smaller tile sizes,

but the amortized overhead decreased.

Thus, in the absolute worst case scenarios, Hubble indeed

affects programs heavily optimized for caching and, to some

extent, memory-bound programs. However, in practice, simi-

lar small methods invoked in a tight loop would be inlined and

excluded from tracing, not to mention that such loop-tiling is

unlikely to be used in an application’s UI thread.

8.3 Startup Overhead Macrobenchmark

We measured Hubble’s overhead on application startup, one of

the most demanding but realistic workloads for a method trac-

ing tool since it comprises hundreds of thousands of method

No Tracing

Realworld Baseline

200 200

400 400

600 600

800 800

1000 1000

1200 1200

1400 1400

Startup Time (ms)

Off

Hubble Tracing

-4%

-2%

Tracing Overhead

Figure 7: Application Startup Time

calls in a short period of time. These methods perform data

loading and processing to prepare the application’s UI and are

often optimized to ensure the application loads quickly [22].

Since the performance of the application startup process

varies signiﬁcantly in practice, we took additional measures

to minimize variation across benchmark runs. Speciﬁcally,

we ran all experiments while disconnected from the network,

eliminating variance introduced by network connections. We

launched the target application repeatedly until its startup

time stabilized to within a maximum variance of 5% (without

these measures, the normal variance can be as much as 100%

as shown on the left hand side of Figure 7). Each applica-

tion was launched programmatically, avoiding any extraneous

touch input that would occur with manual interactions. The

startup time was obtained from a syslog message that indi-

cates the duration from when the application process launched

to the time after the application’s UI has been drawn on the

screen. To force cold starts (where the application starts com-

pletely unloaded), we manually killed each application before

starting it again. Furthermore, we performed tests in quick

succession to encourage the scheduler to place the application

process on the performance-oriented CPU core operating at

the maximum clock speed.

We ran the benchmark on the three applications that had the

most downloads in 2020 [6]: TikTok, WhatsApp, and Face-

book. The results are presented as a box and whisker chart

on the right hand size in Figure 7. As the ﬁgure shows, the

measured startup times vary considerably. To determine if

Hubble causes a statistically signiﬁcant difference in appli-

cation startup time in our tightly controlled test environment,

we performed two single-tailed dependent (paired sample)

t-tests with a signiﬁcance level of 5%. The t-test on the results

of tracing turned off produced a p-value of 14.25% and the

t-test of tracing turned on produced a p-value of 33.18%, both

of which exceed the 5% threshold. Thus, we cannot conclude

that Hubble causes a statistically signiﬁcant difference in

application startup time. In contrast, ASMT increased the av-

erage startup time of the three applications by approximately

10 times.

Although application startup overhead ﬂuctuates signiﬁ-

cantly under real world scenarios, the number of methods

executed remains nearly constant. When disconnected from

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 799

the network, TikTok, WhatsApp, and Facebook ﬁlled 6.0 MB,

3.8 MB, 6.4 MB of Hubble’s ring buffer respectively; this cor-

responds to roughly 400,000, 250,000, and 420,000 methods

invocations. When connected to the internet, the ring buffer

content increased to 14 MB, 5.1 MB, and 11 MB because

the applications loaded the user’s content. In all three ap-

plications, the 32 MB of ring buffer proved to be more than

sufﬁcient to capture the entire application startup sequence. In

Huawei’s Hubble deployment, the 32 MB trace buffer is able

to store the duration of almost all application startup and in-

termittent performance anomalies that our support engineers

have encountered.

The results of the macrobenchmark were also in-line with

results from our automated performance-regression testing,

as well as feedback from support engineers and application

developers. Recall that in part one of Huawei’s three-phase

deployment process (§4.1), we ran automated tests across a

large ﬂeet of devices and any signiﬁcant statistical deviation

in the results will prevent a new build from being deployed.

In the automated performance-regression tests, we measured

the application startup-time (both cold and warm startup) of

the 100 most-downloaded third-party applications in addition

to all our own applications. We categorized startup times into

increments of 500 ms and count the number of applications

that fall in each increment. After Hubble’s deployment, we

have not recorded any statistically-signiﬁcant changes in the

number of applications in each bucket for both cold and warm

startup times.

The choice of 500 ms may seem high; however, Farrer et al.

showed that users do not feel any loss of control (i.e., that an

application is not responding to their action) until the response

times reach approximately 350 ms [19], and users feel like

they have completely lost control when response times exceed

approximately 750 ms. Thus, our QA teams (and others [43])

have found that 500 ms increments are a good categorization

to qualitatively evaluate loading speed—response times below

500 ms are considered excellent, 500–1000 ms is considered

good, and above one second is considered slow.

9 Experiences

Hubble was shipped in the production branch of Huawei’s An-

droid system in August, 2020. An early prototype was merged

into the main development branch in 2019, and engineers have

been using it since. Huawei also runs a beta program where

users can receive new features before public release. There

are currently tens of thousands of beta users, and Hubble is en-

abled on their daily-use devices. For other end users, Hubble

can only be enabled with their express consent.

The trace collection frequencies and retention policies vary

depending on the type of users, the level of consent granted,

operating region and local regulations, and device model. In-

ternal beta users may not have any data upload restrictions.

However, there are often additional restrictions on public

users (including those beta users that are outside of Huawei).

A common policy is that each user device can upload at most

three traces per week. Which three traces to upload is conﬁg-

urable. For instance, sometimes there is a targeted campaign

to improve speciﬁc applications, so in that case, only traces of

anomalies for those applications are uploaded; other times we

collect traces for anomalies whose symptoms are extremely

severe; or, in the default case, we collect the ﬁrst three anoma-

lies detected. Although three traces is a low threshold, with

a large user base, we are usually able to collect one or a few

traces for each important issue.

Besides debugging production issues, Hubble is equally

useful for debugging problems discovered during automated

testing. Before Hubble, developers used ASMT to debug per-

formance regressions, but due to its overhead it could only

be enabled when debugging. This is cumbersome, and many

problems simply could not be reproduced while debugging or

worse, new issues would appear with ASMT enabled. Now,

whenever a performance regression is detected, Hubble’s

traces are automatically collected, helping developers quickly

narrow down the root cause without reproducing the issue.

A happy accident of implementing the tracing in assembly

was that we discovered a bug in ARM’s reference design on

an older CPU model. While optimizing and testing the tracing

assembly on a large number of devices, we found that when a

speciﬁc permutation of 32-bit assembly instructions is used

together with the Generic Timer counter, a segmentation fault

could occur on the out-of-order performance cores. The bug

was conﬁrmed by the chip design team and ﬁxed in later CPU

models. On the buggy CPU model, we work around the issue

by using an

ISB

instruction to ﬂush the CPU pipeline after

fetching the Generic Timer counter.

10 Concluding Remarks

Call proﬁlers are known to be useful in debugging, however,

their use has been limited to the development environment as

a result of their overhead. Hubble shows that by leveraging

Android’s on-device compilation process, a just-in-time ﬂush-

ing strategy, and together with careful system-level design

and engineering, we can achieve a highly efﬁcient tool that

can collect ﬁne-grained call traces even in production envi-

ronments. Hubble has proved its usefulness by signiﬁcantly

easing engineers’ postmortem debugging processes.

Acknowledgements

We thank our shepherd Jonathan Mace and the anonymous

reviewers for their insightful comments. Adrian Chiu pro-

vided help for us to understand the internals of a language

runtime and its JIT compiler. This research was supported by

a contract between Huawei and University of Toronto.

800 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

References

[1]

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener,

Patrick Reynolds, and Athicha Muthitacharoen. Per-

formance Debugging for Distributed Systems of Black

Boxes. In Proceedings of the 19th Symposium on Oper-

ating Systems Principles, SOSP ’03, pages 74–89. ACM,

October 2003.

[2]

Apple Inc. Diagnosing Issues Using Crash Reports and

Device Logs, May 2021.

https://developer.apple.

com/documentation/xcode/diagnosing-issues-

using-crash-reports-and-device-logs.

[3]

Arm Limited. AArch64 Programmer’s

Guides: Generic Timer, August 2019.

https:

//documentation-service.arm.com/static/

600eb3264ccc190e5e68023a.

[4]

Arm Limited. Arm

Architecture Reference Manual,

July 2021.

https://documentation-service.arm.

com/static/611fa684674a052ae36c7c91.

[5]

Paul Barham, Austin Donnelly, Rebecca Isaacs, and

Richard Mortier. Using Magpie for Request Extrac-

tion and Workload Modelling. In Proceedings of the

6th Symposium on Operating Systems Design and Im-

plementation, OSDI ’04, pages 259–272. USENIX As-

sociation, December 2004.

[6]

Adam Blacker. Worldwide & US Download Leaders

2020. January 2021.

https://blog.apptopia.com/

worldwide-us-download-leaders-2020.

[7]

Brendan Gregg. Linux perf Examples: 4.3 JIT

Symbols (Java, Node.js), July 2020.

https://www.

brendangregg.com/perf.html#JIT_Symbols.

[8]

Randal E. Bryant and David R. O’Hallaron. Computer

Systems: A Programmer’s Perspective. chapter 6, pages

615–629. Pearson, 2nd edition, 2011.

[9]

Bryan M. Cantrill, Michael W. Shapiro, and Adam H.

Leventhal. Dynamic Instrumentation of Production

Systems. In Proceedings of the 10th USENIX Annual

Technical Conference, USENIX ATC ’04, pages 15–28.

USENIX Association, June 2004.

[10]

Jong-Deok Choi and Harini Srinivasan. Deterministic

Replay of Java Multithreaded Applications. In Pro-

ceedings of the SIGMETRICS Symposium on Parallel

and Distributed Tools, SPDT ’98, pages 48–59. ACM,

August 1998.

[11]

Michael Chow, David Meisner, Jason Flinn, Daniel Peek,

and Thomas F. Wenisch. The Mystery Machine: End-

to-end Performance Analysis of Large-scale Internet

Services. In Proceedings of the 11th Symposium on Op-

erating Systems Design and Implementation, OSDI ’14,

pages 217–231. USENIX Association, October 2014.

[12]

Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upa-

manyu Sharma, Ruoyu Wang, and Insu Yun. REPT:

Reverse Debugging of Failures in Deployed Software.

In Proceedings of the 13th Symposium on Operating

Systems Design and Implementation, OSDI ’18, pages

17–32. USENIX Association, October 2018.

[13]

Datadog. Cloud Monitoring as a Service.

https://

www.datadoghq.com/.

[14]

Tudor David, Rachid Guerraoui, and Vasileios Trigo-

nakis. Everything You Always Wanted to Know about

Synchronization but Were Afraid to Ask. In Proceedings

of the 24th Symposium on Operating Systems Principles,

SOSP ’13, pages 33–48. ACM, November 2013.

[15]

George W. Dunlap, Samuel T. King, Sukru Cinar, Mur-

taza A. Basrai, and Peter M. Chen. ReVirt: Enabling In-

trusion Analysis through Virtual-Machine Logging and

Replay. In Proceedings of the 5th Symposium on Oper-

ating Systems Design and Implementation, OSDI ’02,

pages 211–224. USENIX Association, December 2002.

[16]

George W. Dunlap, Dominic G. Lucchetti, Michael A.

Fetterman, and Peter M. Chen. Execution Replay of

Multiprocessor Virtual Machines. In Proceedings of

the 4th International Conference on Virtual Execution

Environments, VEE ’08, pages 121–130. ACM, March

2008.

[17]

Úlfar Erlingsson, Marcus Peinado, Simon Peter, and Mi-

hai Budiu. Fay: Extensible Distributed Tracing from

Kernels to Clusters. In Proceedings of the 23rd Sympo-

sium on Operating Systems Principles, SOSP ’11, pages

311–326. ACM, October 2011.

[18]

Facebook, Inc. Proﬁlo - An Android Performance

Library.

https://facebookincubator.github.io/

profilo/.

[19]

Chlöé Farrer, G Valentin, and Jean-Michel Hupé. The

Time Windows of the Sense of Agency. Consciousness

and Cognition, 22(4):1431–1441, December 2013.

[20]

Rodrigo Fonseca, George Porter, Randy H. Katz, Scott

Shenker, and Ion Stoica. X-trace: A Pervasive Network

Tracing Framework. In Proceedings of the 4th Sympo-

sium on Networked Systems Design and Implementation,

NSDI ’07, pages 271–284. USENIX Association, April

2007.

[21]

Kirk Glerum, Kinshuman Kinshumann, Steve Green-

berg, Gabriel Aul, Vince Orgovan, Greg Nichols, David

Grant, Gretchen Loihle, and Galen Hunt. Debugging

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 801

in the (Very) Large: Ten Years of Implementation and

Experience. In Proceedings of the 22nd Symposium on

Operating Systems Principles, SOSP ’09, pages 103–

116. ACM, October 2009.

[22]

Google LLC. App Startup Time, April 2021.

https://developer.android.com/topic/

performance/vitals/launch-time.

[23]

Google LLC. Firebase Performance Monitoring, April

2021.

https://firebase.google.com/docs/perf-

mon.

[24]

Google LLC. Inspect CPU activity with CPU Pro-

ﬁler, May 2021.

https://developer.android.com/

studio/profile/cpu-profiler.

[25]

Google LLC. Overview of System Tracing, May

2021.

https://developer.android.com/topic/

performance/tracing.

[26]

Google LLC. Simpleperf Proﬁling Tool: JIT

Symbols, September 2021.

https://android.

googlesource.com/platform/system/extras/+/

ec8d549d4c4300dcfb4e12353eccbeba17bf7725/

simpleperf/doc/jit_symbols.md.

[27]

Susan L. Graham, Peter B. Kessler, and Marshall K.

Mckusick. Gprof: A Call Graph Execution Proﬁler. In

Proceedings of the SIGPLAN Symposium on Compiler

Construction, SIGPLAN ’82, pages 120–126. ACM,

June 1982.

[28]

Baris Kasikci, Benjamin Schubert, Cristiano Pereira,

Gilles Pokam, and George Candea. Failure Sketching:

A Technique for Automated Root Cause Diagnosis of

In-production Failures. In Proceedings of the 25th Sym-

posium on Operating Systems Principles, SOSP ’15,

pages 344–360. ACM, October 2015.

[29]

Dongyoon Lee, Peter M. Chen, Jason Flinn, and Satish

Narayanasamy. Chimera: Hybrid Program Analysis for

Determinism. In Proceedings of the 33rd ACM SIG-

PLAN Conference on Programming Language Design

and Implementation, PLDI ’12, pages 463–474. ACM,

June 2012.

[30]

Dongyoon Lee, Benjamin Wester, Kaushik Veeraragha-

van, Satish Narayanasamy, Peter M. Chen, and Jason

Flinn. Respec: Efﬁcient Online Multiprocessor Re-

playvia Speculation and External Determinism. In Pro-

ceedings of the 15th International Conference on Ar-

chitectural Support for Programming Languages and

Operating Systems, ASPLOS XV, pages 77–90. ACM,

March 2010.

[31]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca.

Pivot Tracing: Dynamic Causal Monitoring for Dis-

tributed Systems. In Proceedings of the 25th Symposium

on Operating Systems Principles, SOSP ’15, pages 378–

393. ACM, October 2015.

[32]

Gabriel Marin, Alexey Alexandrov, and Tipp Moseley.

Break Dancing: Low Overhead, Architecture Neutral

Software Branch Tracing. In Proceedings of the 22nd

Conference on Languages, Compilers, and Tools for

Embedded Systems, LCTES ’21, pages 122–133. ACM,

June 2021.

[33]

Ali José Mashtizadeh, Tal Garﬁnkel, David Terei, David

Mazieres, and Mendel Rosenblum. Towards Practical

Default-On Multi-Core Record/Replay. In Proceedings

of the 22nd International Conference on Architectural

Support for Programming Languages and Operating

Systems, ASPLOS ’17, pages 693–708. ACM, April

2017.

[34]

Mozilla. Mozilla Crash Reporter, May

2021.

https://support.mozilla.org/en-

US/kb/mozillacrashreporter.

[35]

Ravi Netravali and James Mickens. Reverb: Speculative

Debugging for Web Applications. In Proceedings of

the ACM Symposium on Cloud Computing, SoCC ’19,

pages 428–440. ACM, November 2019.

[36]

Robert O’Callahan, Chris Jones, Nathan Froyd, Kyle

Huey, Albert Noll, and Nimrod Partush. Engineering

Record and Replay for Deployability. In Proceed-

ings of the 2017 USENIX Annual Technical Conference,

USENIX ATC ’17, pages 377–389. USENIX Associa-

tion, July 2017.

[37]

Oracle Corporation. ThreadLocal (Java

Platform SE 7), December 2020.

https:

//docs.oracle.com/javase/7/docs/api/java/

lang/ThreadLocal.html.

[38]

Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning

Yin, Rini Kaushik, Kyu H. Lee, and Shan Lu. PRES:

Probabilistic Replay with Execution Sketching on Mul-

tiprocessors. In Proceedings of the 22nd Symposium

on Operating Systems Principles, SOSP ’09, pages 177–

192. ACM, October 2009.

[39]

Perf Wiki, June 2020.

https://perf.wiki.kernel.

org/index.php/Main_Page.

[40]

Perfetto - System proﬁling, App Tracing and Trace Anal-

ysis. https://perfetto.dev/.

[41]

Lenin Ravindranath, Jitendra Padhye, Sharad Agarwal,

Ratul Mahajan, Ian Obermiller, and Shahin Shayandeh.

AppInsight: Mobile App Performance Monitoring in the

802 16th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Wild. In Proceedings of the 10th Symposium on Oper-

ating Systems Design and Implementation, OSDI ’12,

pages 107–120. USENIX Association, October 2012.

[42]

Lenin Ravindranath, Jitendra Padhye, Ratul Mahajan,

and Hari Balakrishnan. Timecard: Controlling User-

Perceived Delays in Server-Based Mobile Applications.

In Proceedings of the 24th Symposium on Operating

Systems Principles, SOSP ’13, pages 85–100. ACM,

November 2013.

[43]

Raygun. Real User Monitoring Performance Metrics,

May 2022.

https://raygun.com/documentation/

product-guides/real-user-monitoring/for-

web/performance-metrics/.

[44] New Relic. New Relic®. https://newrelic.com/.

[45]

Michiel Ronsse and Koen De Bosschere. RecPlay:

A Fully Integrated Practical Record/Replay System.

ACM Transactions on Computer Systems, 17(2):133–

152, May 1999.

[46]

Kedar Sadekar. Netﬂix Engineering Blog:

Scalable Logging and Tracking. June 2012.

https://netflixtechblog.com/scalable-

logging-and-tracking-882bde0ddca2.

[47]

Benjamin H. Sigelman, Luiz André Barroso, Mike Bur-

rows, Pat Stephenson, Manoj Plakal, Donald Beaver,

Saul Jaspan, and Chandan Shanbhag. Dapper, a Large-

Scale Distributed Systems Tracing Infrastructure. Tech-

nical report, Google, Inc., April 2010.

[48]

Richard L. Sites. Datacenter Computers - Modern

Challenges in CPU Design. Video, February 2015.

https://vimeo.com/121396406.

[49]

Joel Spolsky. How Microsoft Lost the API War. June

2004.

https://www.joelonsoftware.com/2004/06/

13/how-microsoft-lost-the-api-war/.

[50]

Sudarshan M. Srinivasan, Srikanth Kandula, Christo-

pher R. Andrews, and Yuanyuan Zhou. Flashback: A

Lightweight Extension for Rollback and Deterministic

Replay for Software Debugging. In Proceedings of the

10th USENIX Annual Technical Conference, USENIX

ATC ’04, pages 29–44. USENIX Association, June

2004.

[51]

Steven Rostedt. ftrace - Function Tracer, July 2017.

https://www.kernel.org/doc/Documentation/

trace/ftrace.txt.

[52] SystemTap. https://sourceware.org/systemtap/.

[53] Tai Chi, May 2020. https://taichi.cool/doc/.

[54]

Leland Takamine and Brian Attwell. Introducing

Nanoscope: An Extremely Accurate Method Tracing

Tool for Android. April 2018.

https://eng.uber.

com/nanoscope/.

[55]

Jiang Tenglicheng. Logan: Meituan Open Source

Mobile Terminal Basic Log Library. October

2018.

https://tech.meituan.com/2018/10/11/

logan-open-source.html.

[56]

The Android Open Source Project. Android 10.0.0

Release 2, 2019.

https://android.googlesource.

com/platform/build/+/refs/tags/android-

10.0.0_r2.

[57]

Kaushik Veeraraghavan, Dongyoon Lee, Benjamin

Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

and Satish Narayanasamy. DoublePlay: Parallelizing

Sequential Logging and Replay. In Proceedings of the

16th International Conference on Architectural Support

for Programming Languages and Operating Systems,

ASPLOS XVI, pages 15–26. ACM, March 2011.

[58]

Stephen Yang, Seo Jin Park, and John Ousterhout.

NanoLog: A Nanosecond Scale Logging System. In

2018 USENIX Annual Technical Conference, USENIX

ATC ’18, pages 335–350. USENIX Association, July

2018.

[59]

Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm,

Ding Yuan, and Yuanyuan Zhou. Log20: Fully Au-

tomated Optimal Placement of Log Printing Statements

under Speciﬁed Overhead Threshold. In Proceedings of

the 26th Symposium on Operating Systems Principles,

SOSP ’17, pages 565–581. ACM, October 2017.

[60]

Gefei Zuo, Jiacheng Ma, Andrew Quinn, Pramod Bhato-

tia, Pedro Fonseca, and Baris Kasikci. Execution Recon-

struction: Harnessing Failure Reoccurrences for Fail-

ure Reproduction. In Proceedings of the 42nd ACM

SIGPLAN Conference on Programming Language De-

sign and Implementation, PLDI 2021, pages 1155–1170.

ACM, June 2021.

USENIX Association 16th USENIX Symposium on Operating Systems Design and Implementation 803