I want to comment on the “Bing copies Google” discussion here, here, and here, because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me.
As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo!, which has started using Bing. This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand, I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning. Also note, this post is not related to official Yahoo! policy, it’s just my personal view.
The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later noticed some synthetic responses from Bing with the synthetic queries.
There are two kinds of disagreement which people might have with this.
One is the privacy disagreement “Big Brother Microsoft is looking at what I search and using it”. I’m sympathetic on this count, but also sympathetic to the counter argument, that the data collected has value and can enhance the results for all users. In the end, I think companies should simply do their best to accept a user’s wishes, so those who want privacy can have it, and those who want to contribute their data towards improving a search engine can do so. The precise manner for achieving this by opt-in, opt-out, differential privacy, anonymization or other techniques is not entirely clear to me.
Let’s assume the privacy issue is dealt with. This is at least partly and possibly grossly untrue, but I want to focus on the other issue, and this assumption simplifies it’s discussion because a user and their internet browser are synonymous when the privacy issue is dealt with, as the agent’s actions are a true reflection of the user’s preferences.
The other issue is an originality disagreement, which much of the discussion focuses on. What I believe happened was a user feedback process, where users queried Google, clicked on a result, informed Microsoft/Bing of the query and clicked result, and their preference was used to promote the search result within Bing. Now, there is a slippery-slope of questions. Should a user be allowed to:
- Reveal to their chosen search engine their preferred result?
- Reveal to a competitor’s search engine their preferred result?
If you answer ‘no’ to the first, you are deeply against user freedom in a manner I can’t sympathize with. If you answer ‘yes’ to the first, and ‘no’ to the second, then you are still somewhat against user freedom. This isn’t too crazy a stance, as various people sell information and require of their users that it not be retransmitted. One of the more famous examples of this is the Bloomberg Terminal. However, in all instances I’m aware of, users knowingly agree to a contract providing access to the information with limitations. Google never entered into such a contract with it’s users, and I don’t know a sound basis for even an implicit contract. So, my answer are “yes, and yes” here.
But this doesn’t entirely deal with the issue of originality. You could argue that it’s ok for Microsoft to take advantage of revealed user interaction, but it’s still a matter of following rather than leading. This argument is simplistic and wrong, as I expect all informed parties already understand. A basic truth seen in many ways, is that the proper incorporation of new sources of information always improves results. This is true in machine learning where sample complexity results and cotraining formalize mechanisms and values of incorporating additional information, and it was heavily used by all competitive teams in the Netflix Competition. More generally, it’s true in basic knowledge engineering, where people fuse sources of information to create a better system, and I’m virtually certain it’s true of the ranking algorithms behind Google and Bing, which are surely complex beasts taking into account many sources of information. I know no details about the algorithm which Microsoft is using, but it’s quite plausible that they incorporated this information well enough to improve the quality of their results, perhaps in some instances so they are better than Google’s or the earlier version of Bing’s. If that’s the case, Google will either follow Microsoft’s lead taking into account user feedback as Microsoft does, or risk becoming obsolete.
We can also think about things in terms of the future. A basic truth, is that building a successful search engine is extraordinarily difficult. This is revealed by search market share, but also by simply thinking about the logistics involved. You need to crawl the web, have server farms all over the world (because the speed of light just isn’t fast enough), and incorporate many sources of information in just the right way in order to succeed, all while adversaries try to corrupt your results. If we prefer a future where there is a healthy competition amongst search engines, then it’s important to lower these barriers to entry so new people with new ideas can more easily test them out. One way to lower the barrier to entry is to accept that users can share their interaction, even with a competitor’s search engine.
Perhaps it’s inevitable that Amit Singhal has a viewpoint driving towards a monopoly on internet search. However, Google has generally been relatively good about supporting a rich ecosystem of innovation for information technology development, so I am still somewhat surprised. I would be more sympathetic to a position for allowing users of Internet Explorer a built-in means to choose to share their search behavior with Google or other search engines on an equal footing.