HellaSec joins DARPA’s Cyber Fast Track program

I am very excited to announce that DARPA has awarded funding to HellaSec, my security R&D side business, through the Cyber Fast Track program. I’ll be working with a fellow security researcher to develop novel security technology, which we plan to open source at the conclusion of the program.

In a future blog post I will describe the technology we’re building. For this post though I want to build some context by discussing the Cyber Fast Track program and why it’s important.

What is DARPA?

To understand the Cyber Fast Track you first need to understand DARPA — the United States’ Defense Advanced Research Projects Agency. DARPA hires visionary scientists and engineers to pursue their own big ideas to benefit national security. By charter, DARPA goes after the high-risk, high-payoff R&D that other agencies avoid.

Among many fascinating projects, DARPA has led the development of the world’s fastest legged robot, the first stealth airplane, and the original Internet. Every research program is an outlandlish experiment. Can we build a running robot faster than most humans? Can we hide from radar? Can we build a network capable of joining all other computer networks?

What is the Cyber Fast Track program?

In 2010 Mudge, the famed hacker of L0pht and cDc, joined DARPA as a program manager. What is his big idea?

Rather than proposing a specific new security technology or technique, Mudge’s big idea is to experiment with the R&D process itself. Leave it to a hacker to be so meta.

Mudge created the Cyber Fast Track program to fund lone hackers and small, agile security firms like my own.  The hope is that these smaller efforts can rapidly advance the United State’s security capabilities, while “leapfrogging” over the expensive and slow research efforts lead by the traditional pool of large corporations.

I buy into Mudge’s vision and I am excited to be a part of his program. If all goes according to plan, HellaSec will contribute to the success of the experiment — paving the way for new mechanisms for the United States to keep pace with the security world.

Stay tuned

If you would like to keep posted, follow me on Twitter @MichaelNGagnon.

The Cycle of Gaining and Losing Knowledge

“There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – there are things we do not know we don’t know.” — Donald Rumsfeld

In a press briefing a little over 10 years go, Rumsfeld popularized the concept of known knowns, known unknowns, and unknown unknowns. I hope my figure below communicates the problem with “unknown unknowns.”

In general you want the small green slice of “known knowns” to be as big as possible; this represents your knowledge of the universe. For example,

I know that I know how to program in the Ruby programming language.

You also want the blue slice of “known unknowns” to be as big as possible.  Once you are aware of ignorance about a particular thing, you can then decide to educate yourself (if the benefit of learning outweighs the cost of learning).

I didn’t always know how to program in Ruby, but I knew Ruby existed and had an idea of its value proposition.

Moving from “known unknowns” to “known knowns” is therefore relatively straightforward. However, that big red slice of “unknown unknowns” is really scary. You have no idea what’s in there and you have no idea how big it is.  You can’t really decide on your own to move from “unknown unknown” into a “known unknown,” because, well — you don’t even know what knowledge is in that category.

The best you can do with “unknown unknowns” is be aware that the category exists and maintain an open mind. This way when information presents itself to you, you can cognizantly realize that it was an “unknown unknown” and then you’ll either be in the “known knowns” or “known unknowns” category.

I didn’t always know that Ruby existed. One day someone told me about it, and I became aware of its value proposition.

These transitions of knowledge are presented in the figure below.

I’ve already explained the blue and green arrows, but what the hell is that red arrow?  The point of this blog post to describe the arrow  from “known knowns” to “unknown unknowns;” the transition from knowledge to ignorance.

Becoming ignorant again

It’s often a long road becoming an expert on a topic.

I have been programming for 17 years.  I’ve programmed in Ruby for severals years and I know Ruby pretty well now.

Becoming an expert was hard. I remember what it was like to be a beginner — the perplexing error messages and the mystery of source code.

But today some information came my way that made me realize I don’t actually know what it’s like to be a beginner. I thought I knew what it was like; in actuality though, I didn’t know that I didn’t know what it was like to be a beginner.

I was reading an article in Slate about a journalist’s efforts to teach herself programming in Ruby.  This is the line that shocked me out of the “unknown unknown” category.

“I started off by saving [my first Ruby program] as a Word document” — 

I never would have imagined someone would try that. From 17 years of experience, I know that you need to save your programs in text file format, but I had no idea that a beginner wouldn’t know that.

Even worse than being ignorant, I had been deceiving myself into thinking that I remembered what it was like to be a beginner.

I do remember bits and pieces of learning to program, which fooled me into thinking I remembered what it was like to be a beginner. For instance I remember that when I first started programming, the error messages overwhelmed me.  Perhaps the only reason these memories stand out is because sometimes they still overwhelm me. Maybe I forgot about the text-file issue because it was one particular issue I learned 17 years ago; and once I learned it, I moved on and never thought about it again.

I can’t relate to what its like to be a beginner programmer.  But at least now, I know that I don’t know.

The scientific method is not applicable to computer science (a patently false statement)

While a student in Margo Seltzer‘s research-based Operating Systems class, she challenged each of us to write a one paragraph essay that argues a “patently false statement.” It was an exercise in persuasive writing. Here is mine:

The scientific method is not applicable to computer science

Science only works because scientists share a common philosophical belief—the belief that the laws of the physical universe remain constant.  However, computers are man-made systems, whose laws remain at the whims of engineers and users.  This point is particularly emphasized by the failures of computer security: malicious users are very often capable of breaking the laws of computer systems.  Because the laws vary from one universe (computer system) to another, one cannot safely generalize the results of computer-science experiments.  Scientific induction simply does not apply and, ultimately, the scientific method is fundamentally not applicable to computer science.

What do you think?

What the flock?

Today, through the acquisition of my company Dasient, I have joined the flock at Twitter (they love bird metaphors). I am a proud new member of Twitter’s revenue-security team, and I’m excited!

Until I heard about Twitter’s interest in acquiring Dasient, I was only a light Twitter user. I didn’t really get it. And when I talk to some of my friends about Twitter, they don’t get it either. Let me share what I learned about Twitter that enlightened me.

Twitter helps me find and follow information I can’t get any other way

Just by looking at my Tweets, you can’t tell I’m a power user, because I don’t Tweet much. But I am a power user because I use Twitter to access information I can’t get any other way. Here’s two personal examples of how Twitter helped me within the past couple of weeks:

(1) ATM outage

Last week at Bank of America every ATM was out of order and the bank lobby was closed. Strange.

In the days before smartphones I would normally drive around town looking for another ATM. But now I have a smartphone and understand the value of Twitter. Searching for “bank of america outage” on Twitter revealed 4 tweets from the past hour reporting Bank of America ATM outages all across the United States.

Bank of America outage on Twitter search

No need to look for another ATM; the entire system was down.

Now that I get Twitter, I know how to find that kind of real-time information. But what would I have done if I hadn’t been a Twitter user? When I got back to the office, I did a little experiment. I tried looking up the ATM outage on Google. I searched the news, blogs, BofA’s own site, and the Web at large—and found nothing.

Twitter was the only medium that had the real-time information I needed.

(2) Algorithmic-complexity attacks in the wild

The primary topic of my graduate research at Harvard and MIT Lincoln Laboratory was algorithmic-complexity attacks (AC attacks), a particular type of denial-of-service (DoS) attack, which I have blogged about before.

While surveying the literature, I searched far and wide for public discussions of these attacks occurring in the wild. I found nothing. I felt sure they actually happened in the wild, because of their appealing features from an attacker’s perspective. I just couldn’t find any discussions. In my research paper describing my novel defense against such attacks, I couldn’t mention the occurrence of any real-world AC attacks in the wild because I was unaware of any.

My negative experiences researching AC attacks without Twitter sharply contrasts my recent positive experiences researching AC attacks with Twitter. Specifically, several AC vulnerabilities were recently announced on the web. These particular vulnerabilities represent a subgenre of AC vulnerabilities, known as HashDoS vulnerabilities. Discussions of the vulnerabilities emerged on Twitter using the #hashDoS hashtag. There is even a @HashDoS account on Twitter, which moderates and promotes such discussions.

Within minutes of beginning my search on Twitter, I found the gem I’d been searching for for years: a report of an AC attack against StackOverflow.com by an employee.

From Geoff Dalagas, @SuperDalagas. #StackOverflow just saw somebody attempt to use the #hashdos 42 mins ago - patch your servers folks (yes, we have) bit.ly/tKmTFO

Keep in mind: I am an expert on AC attacks, I am a Google power user, and I am a newcomer to Twitter.

To top it all off, now that I know who the other researchers are in this area, I don’t have to search anymore. I just follow them and their insights get pushed to me automatically.

Flock

I think most people are like me; they consume information more than they produce it. Subsequently, thinking of Twitter as just a microblogging platform grossly understates its value. If you are not already, become a Twitter user and have the world’s real-time status at your fingertips.  Join the flock and follow me @MichaelNGagnon.

The only brain teaser I use in interviews

Interviewing is a controversial topic in Silicon Valley. Should you attract and filter talent using puzzles? Should interviewers avoid puzzles and brain teasers altogether? Of course I don’t have the answer. But I have an opinion.

I always use exactly one puzzle when I interview candidates for tech positions (such as software engineering and security R&D). I believe the ability to solve this puzzle strongly indicates that the candidate thinks scientifically, which is very important in tech positions. I rarely recommend moving forward with a candidate unless he or she passes this puzzle.

The puzzle tests the candidate for a specific cognitive bias and works like this:

I will give you a sequence of three numbers (a triple). This triple adheres to a particular rule. Your goal is to figure out what this rule is. You will figure out what the rule is by proposing new triples to me. For each triple you give me, I will respond by telling you whether that triple adheres to the rule, or breaks the rule. You can give me as many triples as you like, but you only get one chance to guess the rule.

An example triple that follows the rule is 2, 4, 6 . (The check mark indicates that the triple follows the rule).

Confirmation bias

This puzzle tests candidates for their susceptibility to the confirmation bias.

When given the puzzle, people tend to develop an initial hypothesis such as “ascending even numbers” (since it seems to nicely generalize the example triple 2, 4, 6 ). Once candidates have an initial hypothesis, they can go two routes in their investigation: confirmation and falsification.

Confirmation

Scientists have studied the confirmation bias using this exact puzzle, and the research has shown that the average Joe tends to confirm hypotheses, rather than attempt to refute them. For example, if your initial hypothesis is “ascending even numbers” you could confirm your hypothesis by giving triples that follow the rule, such as:

  • 4, 6, 8 
  • 108, 116, 140 
  • -8, -6, -4 

In the particular puzzle that I give, all of the above triples follow the rule. And if you fall victim to the confirmation bias, you might convince yourself that your hypothesis is correct based on the above evidence.

But in fact, the correct rule is not “ascending even numbers.”

So how do you solve the puzzle?

Falsification

To correctly solve the puzzle you must attempt to falsify your hypothesis (in addition to confirming it). For example, if your initial hypothesis is “ascending even numbers” you could refute your hypothesis by giving triples that attempt to break the rule, such as:

  • 6, 4, 2 X
  • 3, 2, 1
  • 1, 2, 3 

In the above cases the last triple (1, 2, 3) follows the rule, while the first two break the rule. This evidence refutes the hypothesis (because if the hypothesis were true, 1, 2, 3 would break the rule).

Once you’ve falsified a hypothesis, you need to develop a new hypothesis and repeat the process.

Solution

The correct rule is “ascending numbers.” However, guessing the correct rule is not enough to get credit for solving the puzzle. To solve the puzzle you must scientifically attempt to confirm and refute your hypotheses until you’ve gathered enough evidence to confidently guess the correct rule.

Relevance

As I mentioned before, research has shown that most people fail to solve this puzzle. What then is the relevance of identifying whether or not someone can pass this puzzle? If they fail, you might argue “So what? Everyone has cognitive biases.”

To succeed in a technical job you must be able to think scientifically, and overcoming the confirmation bias is a cornerstone of scientific thinking.

For example, when debugging faulty software, you might be inclined to stop as soon as the faulty behavior disappears. But what you really need to do is figure out what is causing the faulty behavior? Don’t trick yourself into thinking you’ve figured it out until you’ve honed in on the problem by rigorously testing your hypotheses.

Or for example say you’re writing unit tests for your software. You carefully wrote the software under test and you’re feeling good that it’s correct. You might be tempted to take it easy on the unit tests, but really you must try as best you can to write unit tests that break your code.

So there it is.  Use this puzzle and gain confidence that the people you hire think scientifically!

How to defend against algorithmic-complexity attacks

Algorithmic-complexity attacks have been all over the news lately:

This type of attack is well known, causes significant service failures, and can even happen accidentally (such as when Wikipedia crashed while covering news of Michael Jackson’s Death, [1], [2]).

Ever since I discovered my first algorithmic-complexity vulnerability while in undergrad, I’ve kept an eye out for these vulnerabilities. They’re everywhere. I’ve even used an algorithmic-complexity attack to take down a radar during a live-fire ballistic-missile-defense exercise. Our servers, operating systems, programming languages, and best practices do not prevent them.

So while I was in grad school I set out to change that by developing a method to probabilistically guarantee immunity against algorithmic complexity attacks.

Here’s the gist of how you can defend your software from alg-complexity attacks.

Provably Protecting Servers From Algorithmic Complexity Attacks

  1. Enforce a minimum inter-arrival time between incoming requests (you really only need to do this during an overload)
  2. When the system becomes overloaded, do not queue up incoming requests. Rather, kick out old requests that are using up too many resources.
  3. By assuming a probability distribution on the resource requirements of legitimate requests, this method can provide guarantees of the form: with probability at least X, at least Y legitimate requests will complete on time.

You can read the nitty gritty details in this technical report (including a mathematical proof of the guarantee as well as experimental results from attacking Mediawiki).

To finish the work I’m planning on (1) implementing my technique as an enhancement to FastCGI, (2) finding a bunch of real-world vulnerabilities, (3) then doing experiments that demonstrate the effectiveness of my defense against 0-day attacks. Anyone interested on collaborating?

Hit me up with an email (mikegagnon at gmail) or follow me on Twitter.

Spontaneous radial symmetry

Minutes into January 1, 2009 my wife’s family and I spontaneously jumped together for a group photo.  Several of us threw our hands in the air independently.

Amazingly, our arms made exact 45-degree angles.

Original photo

As soon as I saw the photo I immediately announced it was going to be the best picture of 2009. I still stand by that statement. I also knew that some image manipulation with gimp would transform this photograph into fine art.

Spontaneous Radial Symmetry

I hope it doesn’t scare you.

Platform-independent micro-benchmarks are inherently limited

On Quora someone asked, are there any purportedly objective benchmark tests or comparisons of operating system efficiency?

Last year when I was in grad school I took Margo Seltzer‘s research-based Operating Systems class.  Part of my thesis for my class project was that the accuracy of platform-independent micro-benchmarks suffers inherent limitations (as opposed to platform-specific micro-benchmarks, which do not suffer the same limitations).

Micro-benchmarks are benchmarks that measure the performance of primitive operations, such as systems calls or page faults.

Platform-independent micro-benchmarks are micro-benchmarks designed to run on multiple platforms.

To prove this part of my thesis I analyzed lmbench, the most popular OS benchmark suite that uses platform-independent micro-benchmarks.

“lmbench is a suite of simple, portable, ANSI/C microbenchmarks for UNIX/POSIX. In general, it measures two key features: latency and bandwidth. lmbench is intended to give system developers insight into basic costs of key operations.” — from the lmbench SourceForge page

True to my thesis, the platform-independent aspect of lmbench’s design caused its benchmarks to suffer inherent limitations.  These limitations can only be fixed by abandoning platform-independence and developing platform-specific micro-benchmarks.  The results of my study illustrate why comparative OS benchmarking is hard and why the results are often trustworthy, which leads to my response to the Quora question…

Are there any purportedly objective benchmark tests or comparisons of operating system efficiency?

It depends on what you want to benchmark.

There are two things (that I can think of) that you might want to benchmark:

  1. The effect of the OS on application performance, and
  2. The performance of operating-system primitives such as system calls.

(1) is easier to measure because you can re-run the same application benchmarks on multiple OSes and compare results. However, (2) is much more difficult because it requires portable micro-benchmarks and developing portable micro-benchmarks for OSes is inherently hard.

Lmbench is probably the most popular micro-benchmark suite that benchmarks operating-system primitives. Despite its popularity, lmbench produces inaccurate results on modern operating systems, due to the inherent difficulty of developing portable micro-benchmarks.

Here’s why. Lmbench attempts to achieve portability through ANSI C and by only using POSIX interfaces. Thus any system that supports ANSI C and POSIX may run the lmbench benchmark suite. However, POSIX only specifies interfaces and does not specify many implementation details. The lmbench benchmarks rely on implementation-specific details of the operating system in order to execute specific OS primitives. Thus, lmbench only yields accurate results on operating systems that follow its assumptions.

Example: The lat_pagefault benchmark

I’ll give a concrete example of an inaccurate lmbench benchmark. The lat_pagefault benchmark attempts to measure the average runtime of page faults. I’ve tested it on two operating systems. The benchmark is somewhat accurate on OS X but on Linux it under-reports page-fault runtime by several orders of magnitude.

Here’s how the benchmark works:

  1. Lmbench sets up the benchmark by mmap-ing a sequence of N pages, then attempts to page out the pages. To cause page outs, lmbench uses the POSIX msync() system call with the MS_INVALIDATE flag set.
  2. Once the memory is paged out, then every memory access to the pages causes a page fault. The benchmark itself simply iterates over mmaped pages, one by one, where each access is supposed to cause a page fault. Lmbench reports the page fault latency as the number of page faults divided by the total runtime = N / runtime.

The benchmark fails to be accurate for two reasons:

(Problem 1) POSIX does not specify any means to force page outs. Lmbench fallaciously assumes that msync() causes page outs. This behavior is actually a platform-specific implementation detail; the POSIX standard does not specify that invocations of msync() cause page outs. On Linux, for example, the msync system call does not cause pageouts. Therefore the benchmark only measures the latency of normal memory accesses (which is often faster by several orders of magnitudes).

You can fix the Linux benchmark in a non-portable way by instructing the kernel to flush the swap cache (write the value “1” to the file  /proc/sys/vm/drop_caches).  Note that by applying this fix, the benchmark becomes platform specific and loses portability.

(Problem 2) You can solve Problem 1 by adding platform-specific code to cause page-outs on Linux. However, there remains another problem: the benchmark still significantly under-reports page-fault latencies.

Many modern operating-system kernels (such as Linux and Darwin) employ anticipatory paging. Rather than just paging in the single faulting page, the kernel pages in multiple pages during a fault — hoping to reduce the chance of future page faults. The kernel uses information about the process workload to heuristically determine which pages should be paged in during a fault. Thus, the memory access patterns of the benchmark directly determine the behavior of the virtual-memory manager (VMM).

Anticipatory paging affects benchmark accuracy because a single page-fault may cause many pages to be paged in, making it less likely that other memory accesses in the benchmark will cause page faults. Thus, the benchmark thinks it is causing page faults, but in reality it is only causing some fraction of that. As a result, lmbench severely under-reports the average page-fault runtime.

To fix this problem, you can use the mincore() system call to figure out exactly how many page faults your benchmark causes. Then, you can use that information to develop workloads that cause predictable page-faults in your benchmark. Unfortunately, this system call is not part of the POSIX standard so this solution is not portable (though it does happen to work on both Linux and OS X).

Correct results

I fixed these problems and ran the benchmark on a MacBook Pro running OS X and Linux. Here’s the results (in microseconds):

The graph shows that Linux is slightly faster at handling page faults than OS X. It’s also interesting to note that the on both OS X and Linux page-fault latency seems to scale linearly at the same rate with regard to the number of page-ins per fault (recall, a page fault usually results in multiple page ins because of the anticipatory pager).

Note that these results were only possible by scrutinizing the behavior of the benchmark on different operating systems and writing non-portable, platform-specific benchmarks. In general, portable micro-benchmarks cannot safely measure the performance of platform-specific operations (such as page faults).

Hashing IMEI numbers does not protect privacy

(I originally published this post on Dasient’s blog)

In an effort to protect the privacy of users, mobile apps sometimes hash the user’s IMEI number prior to sending it to a server. We found that hashing IMEIs does not protect the privacy of users, even with the use of cryptographically secure hash functions. This result is due to the fundamental structure of IMEI numbers.

IMEI numbers and privacy concerns

Each modern cell phone has a private IMEI number—a 15-digit number that uniquely identifies the device on the cellular network. IMEIs are primarily useful for tracking stolen devices. As a secondary use, IMEIs are often re-purposed as user IDs in mobile applications. 

Using IMEIs as user IDs represents a threat to privacy because it enables unrelated applications to compare notes on your behavior. For example, say I use two unrelated applications published by two different corporations: a web-browser app and an email app. The web-browser app knows which websites I visit (but not my real name), whereas the email app knows my real name (but not which websites I visit). If both apps use the IMEI as my user ID then the corporations can easily compare notes since they both share a common ID that uniquely identifies me. My privacy is threatened since it is possible for my browsing habits to be tied to my real name.

IMEI numbers can also help attackers spoof your identity. For example, a voicemail app might have the security feature that it only delivers voicemail to the user’s specific cell phone (as opposed to allowing the user to check their voicemail from multiple devices). One way the voicemail app might authenticate the phone is by checking that the phone’s IMEI number matches the user’s registered IMEI number. If an attacker learns the IMEI number of a particular user, it can help them impersonate that user and access their voicemail.

IMEI leakage

We recently behaviorally analyzed 10,000 apps from the Android Market for a mobile-malware survey for Black Hat 2011. To the best of our knowledge this study represents the largest behavioral analysis of Android apps to date. In agreement with previous studies (such as TaintDroid and Taming), we found that Android apps often leak IMEI numbers over the web. We observed that at least 8% of apps leaked the user’s IMEI number. 93% of those apps leaked IMEI numbers in the clear. In the other 7 percent, the apps hashed the IMEI with either MD5 or SHA-256—presumably in an attempt to protect user privacy.

Hashing the IMEI may seem like a good way to protect the privacy of users, since it is supposed to be difficult to reverse a hash value to obtain on original message. However, the inherent structure of IMEI numbers makes them vulnerable to reversing via lookup tables.

Background: lookup tables

Cryptographic hash functions should be irreversible. That is, it should be easy to generate a hash value from an original message but infeasible to reverse a hash value back to an original message. However, if the original messages are not sufficiently random, then hashing can be defeated using lookup tables—which are essentially dictionaries that map hash values back to original messages.

To generate a lookup table for IMEIs you simply need to go through a list of IMEIs, generating the hash value for each IMEI. Then to reverse a hash value, you simply look up the corresponding IMEI in the table. (For improved space efficiency you can use a rainbow table in place of a lookup table.)

First five entries in one of the iPhone lookup tables

Hashed IMEIs are vulnerable to lookup tables

Complete lookup tables are usually infeasible to build because there are so many possible entries. And at a first glance, it seems infeasible to build a useful lookup table for IMEI hashes. There are 1015 = one quadrillion possible IMEI values, necessitating 1015 entries.
For most adversaries, it would require a prohibitive amount of computation to create such a large lookup table. On my 8-core 2.26 GHz machine, I can hash 8 million IMEI numbers in about 2 seconds, using the SHA-1 cryptographic hash function. At that rate it would take about 8 years to compute all the hashes for a complete lookup table.
However, IMEI numbers are not distributed uniformly at random. The first 8 digits of an IMEI represent the Type Allocation Code (TAC), which is determined by the model of the phone. For example, because I have an HTC Thunderbolt, the first 8 digits of my IMEI are 99000032. Although this is the most significant portion of my IMEI number, it is not private information; knowing the model of my phone (or guessing the model) is sufficient to guess most of my IMEI number.After the 8-digit TAC there are 6 digits that uniquely identify the specific cellular device. In the screen shot of my phone, I x’ed out those 6 digits to protect my privacy. These 6 digits are the only digits that are difficult for an attacker to guess. After those 6 digits the last digit is a Luhn-checksum digit, which is computed as a function of the first 14 digits. Thus, in a 15-digit IMEI number there is a relatively low amount of randomness.With this knowledge in mind an adversary can follow a common attack pattern: build lookup tables only for the most common TAC numbers. Since a relatively small number of mobile devices dominates the market, the attacker only needs to build the lookup tables for the most popular TAC numbers.

Attack Demonstration

To demonstrate the practicality of this attack I built 105 lookup tables for 105 different iPhone TACs, using the SHA-1 hash function. Each table took up 55 megabytes of space, yielding 5.6 gigabytes in total (which is larger than the theoretical minimum since I stored data as ASCII). On an 8-core 2.26 GHz machine it took my simple Python script about six and a half minutes to build the iPhone lookup tables. With these tables built I was able to instantly reverse the IMEI hashes for all the iPhones in our office.

Recommendations

We recommend that mobile apps refrain from sending hashes of IMEIs over the web. It is easy for attackers to generate IMEI numbers when given the hash values of IMEIs—even for cryptographically secure hash functions. Salting the hash function (adding random bits to the input) helps to obscure the IMEIs further. However, if the adversary knows the salt value and the model of the phone (or can guesses well), it is easy to rebuild custom lookup tables.

In order to prevent apps from having the ability to compare notes on their users, apps need to refrain from basing their user IDs on device IDs altogether. Even if IMEIs were not vulnerable to lookup tables, two different app publishers could de-anonymize users by hashing the IMEI in the same way.

Last but not least, we recommend attending our Black Hat talk on August 4, where we will present other interesting findings from our dynamic analysis of 10,000 Android apps. ;-)