elehack.net

Web privacy

Today, I made some changes to our web server code and our privacy policy. The primary effect of these changes are that we no longer record the IP addresses of visitors to elehack.net. This change was prompted by our discovery of the search engine Duck Duck Go and particularly its privacy policy.

As you browse the web, a good deal of information is sent to web sites you view. I want to take this opportunity to provide a run-down of what some of this information is and how it can be used.

IP Address

Any time your computer connect to another computer, such as a web server, that other computer receives your IP address. An IP address is like a mailing address — it tells other computers how to send data to yours. Sending your IP address is nearly unavoidable, as the web server must know where to send the data you requested.

The IP address can be somewhat revealing. IP addresses are allocated geographically, so the address can be used to identify your general location. Further, the address can be looked up to identify its owner. The owner is typically your ISP or, for corporate networks, the company. It is rare for an individual to have an IP address that can be uniquely tracked to them without subpoenaing the ISP. Combined with other information, however, it can still be powerful — if a web site owner knows that only 3 of their likely users use a particular ISP, then knowing the IP address tells them quite a bit.

Since they are fundamental to routing traffic, IP addresses are necessary and difficult to cloak. It is possible to cloak them at the expense of browsing speed using a service such as Tor, but using such a service is slow, cumbersome, and difficult to do properly. There is also ongoing research, with some success, on deanonymizing Tor connections, so it may be feasible for an attacker with sufficient resources to find your IP address anyway.

As I mentioned earlier, the big change in our privacy policy is that we no longer record IP addresses for web site visits. We still record them with blog comments, and may use them for short periods of time for security and abuse-control purposes, but they are not stored in our long-term access logs or used for monitoring general web site use.

Request headers

When your browser requests a web page, it also sends several headers to clarify the request and enable the server to better meet its needs. These headers announce things such as what kinds of files the browser prefers to receive or the last-known version of the page (to avoid re-sending it if it has not changed) and the version of your web browser and operating system (the user-agent header).

One of these headers deserves particular mention — the Referer header. Poorly-named due to a codified misspelling, this header contains the URL of the page that referred you to the page that is being requested. Typically, this happens when you click a link: your browser sends the address of the page containing the link to the server which will provide the linked-to page. This can be very useful to website operators, as it allows them to see what sites are linking to theirs and whether those links are effective in bringing traffic.

When combined with an IP address, the referrer header becomes particularly powerful. It allows the operator not only to narrow views down to smaller groups of people and to know what links are being followed to their site, it tells them something about who is following what links to their site. If the referrer is a search results page, then the header may contain your search query (the remarkable thing about Duck Duck Go’s policy is that they prevent this). This tells someone other than the search engine that someone at your IP address issued a particular search.

In some cases, the referrer itself can contain information about your identity. If you visit a link in a page on a service for which you have an account and the URL on that service contains your account identifier, then that can tell the website operator who you are. In many cases, someone else clicking the link from your profile page would have the same effect, but it is possible for referrer URLs to identify you based on your accounts on other services in some cases.

Cookies

Cookies are actually a part of the request headers, but due to their special and oft-misunderstood nature they deserve special treatment. A cookie is a small piece of text that a web server stores in your browser. When the server sends you a page, it can tell your browser “Here is a piece of text. Please send it back to me whenever you request another page.” This can be extremely useful for handling things such as logging in to web sites. They can also be used to track your “clickstream” — the sequence of pages you viewed — and identify it as a single user’s session.

When used properly, cookies should only be sent back to the server that gave them out in the first place, so they don’t let a server know anything it doesn’t already. I believe there are ways to have cookies sent to other sites, but Firefox’s default configuration is to block such cookies. So, in general, cookies allow a single site to track you and remember things about your browsing. They don’t let sites share information or sniff what you have done on other sites.

One problem, though — that last sentence isn’t true. Many (perhaps most) web pages don’t just contain content from the site itself. They also contain content from other sites. This can be ads, embedded video or images, or fonts, JavaScript code, and other assets stored on third-party servers. These servers, however, are also allowed to send you cookies. Their cookies won’t get mixed with the cookies for the site you’re visiting, but they can still set cookies.

Remember that referer header we talked about? It isn’t just sent when you click a link. When your browser displays an image in a page, it makes a separate request to retrieve the image. When retrieving the image, it sends the page it is displaying as the referrer. If that image is stored on a third-party server...

Advertising networks such as DoubleClick use tracking cookies that are effectively Internet-wide. They store cookie on your browser containing a unique ID number. It is only sent back to their servers. However, their servers are contacted every time you view a page anywhere on the Internet that uses their advertising service. Therefore, they can track you as a unique individual across ever site serving their advertising: any time you see an ad, they know your ID number and what page you were looking at. For DoubleClick or AdWords, that’s a lot of sites.

There are also more innocuous vectors for this problem. There are some standard JavaScript libraries, such as the excellent jQuery library, that are used by many different web sites. Google provides hosting for jQuery and other libraries; this allows a web site operator to put code in the page to go get jQuery from Google rather than using space and bandwidth to serve it up themselves. Further, it allows your browser to avoid re-fetching the same code for multiple sites. All told, it is a useful service. Similar services are available for things such as fonts.

Your browser then sends the pages you visit, in the form of referrer tags, to Google (or whoever is hosting the library or font), and gives them the opportunity to set cookies.

For these reasons, we avoid embedding third-party content in our web site. We may embed some — if we post any videos, they will likely be embedded via Vimeo — but we generally host things ourselves. This problem is also a major reason we do not participate in some affiliate programs such as Amazon’s; they encourage web site operators to embed their code or images and thus provide a way for them to track users’ visits across affiliate sites. We believe that our affiliates should only know that you read our page if you decide to click the link.

JavaScript and Flash

While not in the category of information sent by your browser, JavaScript and Flash provide additional means for web services to collect information. JavaScript code running in a web page can send information about your browser, such as your screen resolution, back to the server. Flash (and Silverlight) allow similar behavior.

JavaScript and Flash also allow more ways for servers to store and retrieve information in your browser. Flash provides a facility for applications to store settings. Web site operators can use Flash (sometimes even hiding the Flash applicaiton) to store information in these settings files and send it back to their server much in the same way as cookies, except that this data is not deleted when you clear your cookies and is shared across all Flash-capable browsers installed on your computer. The result of this last point is that “Flash cookies” allow servers to track you as one person like they would with cookies even if you use multiple web browsers.

Modern browsers also provide a way for JavaScript to store more information on your computer. This is used for things like allowing you to read your GMail without an internet connection. I am not sure if this has security implications beyond those of cookies; I don’t think it does, although the data is not necessarily cleared when you clear cookies.

We don’t use JavaScript in these icky ways. Currently, we use no JavaScript. We will likely start using JavaScript, but only to enhance your interaction with our site and not to perform invasive information-gathering.

To sum up

We like privacy, and we want to respect yours. We are dropping IP addresses from our access logs to decrease the information we find out about our readers.

There are a variety of things you can do to improve your privacy on the Web. I use and recommend FlashBlock, a Firefox extension that prevents Flash applications from being loaded unless you explicitly request them. Not only does it decrease the number of Flash cookies you get, it lets your browser run faster by only showing you the Flash you really want to see. There is also the NoScript extension which blocks JavaScript except on sites where you particularly want it enabled. I do not use this extension myself, but many people enjoy it.

You can also configure your browser to refuse cookies (which will break many sites) or to only accept cookies for the current browsing session (which causes all cookies to be thrown away whenever you close your browser — effectively re-setting the IDs used to track you except for clever sites that try to use your IP address to recover them). Firefox allows you to make exceptions to the policy as well, to always block cookies from certain sites (I do this with some advertisers) or allow cookies from others to stick around (so you don’t have to log back in to Twitter every time you start your browser).

There are also ad-blockers such as AdBlock Plus that block many web ads (and the privacy problems with them!). Unfortunately, that also hurts web site operators, as discussed by Ars Technica, as you view their content without viewing the ad, and they get paid (and support their site!) by showing ads. Even if you don’t click ads, many sites are paid just for showing you the ad. There is therefore a touchy balance between protecting your privacy and allowing web authors to get paid. I have no problem with blocking popups or other excessively invasive advertising methods, but blanket-blocking all advertising is harsh to content providers. Flashblock even causes something of a problem here — if you don’t have Flash, your browser won’t claim to support it, and the server will usually send you a non-Flash ad. But Flashblock makes the browser claim to support Flash but then not display it. The result is that the server will think you can see a flash ad but then you see no ad. At this point, I think the problems with Flash, particularly on Linux, warrant running Flashblock, but I wish that content providers with respectful advertising (such as Ars Technica) could send me fall-back image ads in place of the flash ads I don’t see.

Privacy is a big topic, and it’s impossible to cover it in any reasonable or accurate depth in a single post. The Electronic Frontier Foundation has a variety of articles providing information on privacy and covering news affecting online privacy. They also provide an interesting service that tells you what information your browser sends to servers.

Comments

Comment from anonymous reader on July 30, 2010 at 1:37 PM CDT

Comment from Tom on July 30, 2010 at 1:38 PM CDT

That "comment from anonymous reader" was me. I wanted to see if "all fields are optional" was in fact true. Guess it is!

Post a Comment

You may post a comment using the form below. All fields are optional. By submitting a comment, you release it to Michael and Jennifer Ekstrand under the Creative Commons Attribution 3.0 license. See our copyright notice for details. You might also want to read our privacy statement.