![]() In a real setting, an engineer debugging the reports can infer that events b and c, occurring in that order, are suspicious and deserve close inspection. In this example, the result shows that the pattern b → c is the highest ranked. The system computes the F1-score of all patterns in this manner and returns the list of all patterns ranked by F1-score, as shown in the table below. ![]() The harmonic mean of the two, 0.67, is the F1-score. It also has a recall of 0.6 because it occurs in three out of five traces in the test group (i.e., it covers 60 percent of the test group). For example, the pattern b has a precision of 0.75 because it occurs in four traces in total, three of which are in the test group (i.e., it is 75 percent specific to the test group). Informally, precision describes how accurate P is in detecting whether a given trace is in the test group rather than the control group, and recall describes how much of the test group P can cover. For each pattern, P, it computes precision and recall using its support: Once all patterns are extracted along with their supports in T and C, the system performs statistical isolation. Note that the pattern space is combinatorial in nature, and therefore it is crucial to employ algorithms that can search this space efficiently without an exponential blowup. In this case, its support in C is also 2 ( t 7 and t 8 ). In the example above, this pattern appears twice in T ( t 1 and t 2 ), and so its support in T is 2. For each pattern, it computes the number of traces in which it appears in each group ( T and C ). A sequential pattern is simply a chronological sequence of events that happened, but not necessarily one after another (i.e., there might be other events in between that were not significant).Īn example of a pattern that the system can extract here is a → c. When Minesweeper is applied to these traces, it begins by extracting sequential patterns in T and C. We have 10 traces in total on these events, five in the test group ( T ) from people who encountered the bug, and five in the control group ( C ) from the remaining people who did not. Five of these people report a problem, and there are eight possible events tracked in the app ( a, b, … through h ). Suppose (hypothetically) that 10 people are using the Facebook app. These patterns are likely to be correlated to the bug and can thereby point toward its root cause. Minesweeper finds patterns of events that are statistically distinctive to the test group as opposed to the control group. Traces that contain the bug (the test group) are compared with traces that do not (the control group). Minesweeper scans these traces to look for distinctive patterns that could point to the cause of a bug. The idea is to get a snapshot of what might have caused the error to happen, such as the example below: How does Minesweeper work?Īnytime someone reports a bug through the Facebook app, the error reporting system typically captures a chronological trace of actions (or “events”) that person performed in the app prior to encountering the bug. Since it was developed, Minesweeper has become Facebook’s first line of defense against bugs and has helped us prevent potentially wide-scale disruptions from affecting people on our apps. Our own evaluations of Minesweeper using real-world bug reports from Facebook’s apps have proven that it can perform RCA for tens of thousands of reports in minutes and can identify the root cause of bugs with 85 percent accuracy. ![]() Minesweeper-based RCA is completely automated and scalable, and it’s grounded in formal statistical concepts. Today, they can use Minesweeper - a technique we’ve developed for automating RCA that identifies the causes of bugs based on their symptoms. There was a time when on-call engineers had to spend hours, or even days, manually combing through error reports, looking for patterns to help them debug. When billions of people are using an app on a variety of platforms and devices, a single bug can create several different issues on its own and multiple bugs can happen simultaneously. But RCA isn’t always simple, especially at a scale like Facebook’s. After all, you can’t solve a problem without getting to the heart of it. Root cause analysis (RCA) is an important part of fixing any bug.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |