Software development

Intelligent authentication

TL;DR: Can you spend a few minutes participating in a short survey?

I'm writing my master's thesis about intelligent authentication: the idea is to build a system that makes automatic decisions based on location, access times and other datasets. For example, if user tries to access our internal service from the office with their own computer, right after using electronic door key, the chance is high the login attempt is legitimate. If that is the case, there's no need to ask for password yet again. Everyone hates entering username and password every single morning, especially after already signing in to their workstation.

I have two features that need more validation data: locating users and identifying people with keystroke timing. The survey collects data for both of these.

First, location data. When accessing any service on the internet, the service has two major options for trying to guess your physical location: geoip databases (for example Maxmind) and WHOIS. Geoip databases try to maintain a list of physical IP addresses for locations. The data is usually accurate to the country level, and in some cases up to specific cities. WHOIS data is usually entered by the ISP, and might point to an actual customer location - typically, company office address - or to the ISPs office. With internet connection at your home, WHOIS data usually points to your service provider's office or to PO box. If you want to, you can check these: for WHOIS data, open and click your IP address below the search box. For geoip, try Maxmind demo service.

Based on a handful of IP addresses where the actual physical address was known - my home address, our offices, a few customer offices - both geoip and WHOIS databases are relatively accurate. However, guessing the accuracy of the databases based on a five or so IP addresses is not good enough: the accuracy might vary wildly over the world. To properly validate the accuracy, more data is needed. Fortunately, all modern browsers support geolocation. Even though laptops do not have GPS, wi-fi based locationing is really accurate, especially in densely populated areas (it even worked fine in the countryside in Vietnam!). Comparing actual physical locations to what geoip and WHOIS databases provide should provide more insight into this. Remember that there's absolutely no need to tie this information to your identity - it's all anonymous.

Second, keystroke timing (or keystroke dynamics) seems to be a rather unique biometric feature that can be used to identify users. Even though it is hard to see any difference between two people who write at approximately same speed, millisecond-resolution timing tells a different story. However, even though the problem is already relatively well-defined - identify users with the timing of keypresses - it is hard to develop and tune an algorithm without baseline data: something that can be used to benchmark the accuracy.

Two public keystroke timing databases are already available. Both of these represent dataset where users are already familiar with the passwords, and enter the passwords with exactly the same hardware with reliable software measuring the timing. This is not the case with a typical web authentication system. First, people enter passwords they don't use too often, and with one-time passwords, the password is entered only once. Second, timing is measured by web browser, not by the server. Third, very rarely the environment is exactly the same: people have different keyboards (or laptops) and work from different environments. This might add additional noise and complexities to the data. The goal of this survey is to create a third database that includes these real-life characteristics.

All the data collected with the survey will be published under CC BY 4.0 license, because there's no need to keep it secret - others might find it useful. All identifying features will be anonymized first. For example, for location data, appropriate amount of noise will be added. Also, there's no way to connect the location data and your identity, i.e. your name. Keystroke timing data is published separately with only country-level location information - again, this information is not connected to you in any way, as no-one knows it was you who entered the data.