Machine Learning (Theory)

2/2/2011

User preferences for search engines

I want to comment on the “Bing copies Google” discussion here, here, and here, because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me.

As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo!, which has started using Bing. This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand, I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning. Also note, this post is not related to official Yahoo! policy, it’s just my personal view.

The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later noticed some synthetic responses from Bing with the synthetic queries.

There are two kinds of disagreement which people might have with this.

One is the privacy disagreementBig Brother Microsoft is looking at what I search and using it”. I’m sympathetic on this count, but also sympathetic to the counter argument, that the data collected has value and can enhance the results for all users. In the end, I think companies should simply do their best to accept a user’s wishes, so those who want privacy can have it, and those who want to contribute their data towards improving a search engine can do so. The precise manner for achieving this by opt-in, opt-out, differential privacy, anonymization or other techniques is not entirely clear to me.

Let’s assume the privacy issue is dealt with. This is at least partly and possibly grossly untrue, but I want to focus on the other issue, and this assumption simplifies it’s discussion because a user and their internet browser are synonymous when the privacy issue is dealt with, as the agent’s actions are a true reflection of the user’s preferences.

The other issue is an originality disagreement, which much of the discussion focuses on. What I believe happened was a user feedback process, where users queried Google, clicked on a result, informed Microsoft/Bing of the query and clicked result, and their preference was used to promote the search result within Bing. Now, there is a slippery-slope of questions. Should a user be allowed to:

  1. Reveal to their chosen search engine their preferred result?
  2. Reveal to a competitor’s search engine their preferred result?

If you answer ‘no’ to the first, you are deeply against user freedom in a manner I can’t sympathize with. If you answer ‘yes’ to the first, and ‘no’ to the second, then you are still somewhat against user freedom. This isn’t too crazy a stance, as various people sell information and require of their users that it not be retransmitted. One of the more famous examples of this is the Bloomberg Terminal. However, in all instances I’m aware of, users knowingly agree to a contract providing access to the information with limitations. Google never entered into such a contract with it’s users, and I don’t know a sound basis for even an implicit contract. So, my answer are “yes, and yes” here.

But this doesn’t entirely deal with the issue of originality. You could argue that it’s ok for Microsoft to take advantage of revealed user interaction, but it’s still a matter of following rather than leading. This argument is simplistic and wrong, as I expect all informed parties already understand. A basic truth seen in many ways, is that the proper incorporation of new sources of information always improves results. This is true in machine learning where sample complexity results and cotraining formalize mechanisms and values of incorporating additional information, and it was heavily used by all competitive teams in the Netflix Competition. More generally, it’s true in basic knowledge engineering, where people fuse sources of information to create a better system, and I’m virtually certain it’s true of the ranking algorithms behind Google and Bing, which are surely complex beasts taking into account many sources of information. I know no details about the algorithm which Microsoft is using, but it’s quite plausible that they incorporated this information well enough to improve the quality of their results, perhaps in some instances so they are better than Google’s or the earlier version of Bing’s. If that’s the case, Google will either follow Microsoft’s lead taking into account user feedback as Microsoft does, or risk becoming obsolete.

We can also think about things in terms of the future. A basic truth, is that building a successful search engine is extraordinarily difficult. This is revealed by search market share, but also by simply thinking about the logistics involved. You need to crawl the web, have server farms all over the world (because the speed of light just isn’t fast enough), and incorporate many sources of information in just the right way in order to succeed, all while adversaries try to corrupt your results. If we prefer a future where there is a healthy competition amongst search engines, then it’s important to lower these barriers to entry so new people with new ideas can more easily test them out. One way to lower the barrier to entry is to accept that users can share their interaction, even with a competitor’s search engine.

Perhaps it’s inevitable that Amit Singhal has a viewpoint driving towards a monopoly on internet search. However, Google has generally been relatively good about supporting a rich ecosystem of innovation for information technology development, so I am still somewhat surprised. I would be more sympathetic to a position for allowing users of Internet Explorer a built-in means to choose to share their search behavior with Google or other search engines on an equal footing.

24 Comments to “User preferences for search engines”
  1. Scott says:

    Thank you. I’ve felt the discussion was terribly misleading and thought Google’s holier-than-thou response (http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-search.html) simultaneously comical and mendacious (I don’t think Bing denied what what Google is implying they denied.)
    Disclaimer: My wife works in marketing for Windows Live.

  2. [...] This post was mentioned on Twitter by Programming Tweets and Raza Rizvi, Jon Elsas. Jon Elsas said: John Langford (ML Guru @ Yahoo!) on Google v. Bing: http://hunch.net/?p=1660 [...]

  3. Why exactly were Google search engineers entering their pet misspelled query [torsorophy] into Bing in the first place? I’d hate to think they were looking over their competitor’s shoulder to see how they responded to queries.

    Whatever their motivation, running an SEO operation on Bing using 20 engineers and then blogging about it demonstrates just how much they’re willing to invest in competing with Bing. Perhaps they’re still stinging from the accusations that they copied Bing on several interfaces last year.

    What’s especially ironic about Singhal’s claim that “some Bing results increasingly look like an incomplete, stale version of Google results” is that Google suffers from the same sort of irrelevant results for the same reason, namely SEO.

  4. Sebastian says:

    John, you raise very interesting points and you helped me broaden my viewpoint on the matter.

    However, as I would agree on revealing to a competitor’s search engine (CSE) my preferred result, how fair would be CSE delivering results which rely 100% on other SEs?

    This seems pretty unfair.

    If you query those synthetic queries on Bing, and then “alaska” on that same search engina, in both cases you would get results, but the former query gives you 100% Google’s results.
    Wouldn’t you at least give credit??

    Imagine Yahoo using Bing but not revealing it, and keep on pretending its their algorithm throwing results.

    I believe it’s a matter of intellectual honesty.

    Googler’s experiment hit right on the point, and it is fair for them to raise the issue.
    Maybe the outcome turns to be what you somehow predict… everyone sharing information and search engines broadening their data sources, including competitors’ data.

    Anyway, I believe this is not how everyone understands this industry right now, so Google’s head up raises a good point.

    Best

    • jl says:

      I’m assuming Bing is incorporating information via user feedback, rather than actually copying. If they are actually copying, then of course they should give credit.

      For user feedback, the idea of giving credit is appealing in principle, and it might even be feasible for a subset of the queries which are only issued to one search engine initially. But for anything else (meaning almost all queries), I expect it’s impossible because no one source of information decides the returned results.

      You might think, “well, they should give partial credit”. But it becomes very clumsy very quickly. If search engine A’s second result is clicked on by a user who tells search engine B, and B returns the second result as the first result, does credit go to search engine A? The user? And what if search engine B returns it as the 5th result? And what if data from search engine C is entering the formula? And what about data from the webpage itself? When you have a complex algorithm, there is no one canonical way to describe the influence of an input variable on the output, and trying to give credit appropriately would be a user interface nightmare.

      You might think, “well, if they can’t give partial credit, they shouldn’t use the information”. But this would be catastrophic. Suppose search engine A does a study on a subset of 100 queries comparing their results and search engine B’s results. Someone at engine A notices that engine B had a great result for one of the queries and adjusts their algorithm so that engine A produces the same result on that query. Should engine A give credit to engine B? And how much credit? Remember: the same algorithm is used to return the results for many queries, not just the one that was ‘debugged’. I’m nearly certain this kind of thing has happened at every search engine with a pulse, and the world has benefited from better search because of it.

      • Sebastian says:

        I understand the trouble of implementing it, but those synthetic queries relied 100% on another search engine.

        Of course there may be no other cases in which Bing only throws Google’s results, but if others exist, I would not show results without giving credit.

        I believe is the neat way to do it. Its an addenda for the implemented algorithm…

        • Bayjinger says:

          Sebastian, on the subject of “giving credit”, you are essentially asking Google and Microsoft to publish their search algorithm. I’m sure microsoft would more than happily comply, if only to see Google do the same.
          Google crawls the web for information to build its search service. It never says “i’m ranking this page high for this search term because wikipedia links to it”. It promotes the “open” web, while claiming a need for secrecy of its algorithm. So it should not complain when others, even competitors, use its publicly available free service to enhance the quality of their services.

  5. Paul says:

    The “should users be allowed to reveal…” doesn’t exactly hold, as I doubt it was ever the users intention to give these results to Microsoft for this purpose. You could make a good business offering cheap corporate computer repairs, selling all the documents you find on the computers to hedge funds. Using vague EULA’s is not the same as active user desires, e.g.: users wanting to take their contact lists from Facebook into another application. So while I’ve got no objections to Microsoft, say, recording their employees search patterns on Google, this is far from a clear cut case of user intentions. And the Google post leaves vague whether this was the Bing Toolbar or IE8 (I suspect they know it’s Toolbar, but left it vague to raise the specter of antitrust), because if it were IE8, that’d strike me as highly anticompetitive: if you own a browser (adjacent market), you get to spy on your search competitors users, if you don’t you have to build your results ‘organically’.

    Of course, Google has an easy solution here. More effective then complaining would be to selectively serve skewed results to all the ip addresses in their control: spam pages for all the products, comic misunderstandings of the queries. When you know you’re being watched, use that to your advantage.

  6. [...] John ???????????????????????????????????????????????????????????????????????????????????????????????????????? Bing ?????????????????????????????????? [...]

  7. @Sebastian: Singhal makes it clear that Bing was modeling user click data, not the Google results themselves. What Google did to Bing is similar in spirit to Google bombing, which is a kind of SEO.

    @Paul: Google’s gazillion page EULA and privacy terms for Chrome aren’t exactly light reading. For instance, the first clause of the privacy terms seem to indicate that Google is reserving the right to do exactly what Bing’s doing, but it’s hard to tell given qualifying language and the language in the opt out instructions.

    @Paul (part two): How could Google serve skewed results in this context? What Bing’s using is feedback from ordinary people interacting with Google through their browser. If Google skewed results going to IE, not only would they be giving up an even larger market than China, but they’d probably be in anti-trust hot water.

    • Paul says:

      I didn’t mean serve bad results to IE, I mean serve bad results to Google computers. With all its data centers, Google surely owns a large swathe of non-contiguous ip addresses, such that there’s no real way for Microsoft to know which addresses are Google-owned and which aren’t. Install the Bing toolbar on a whole bunch of computers, and script them to search for popular search terms and click on poor quality links. Since Google knows which ip addresses are involved, it can serve the spam pages to only those ip addresses. No real users see bad results, but Bing suddenly finds that the results its getting from its toolbars aren’t very useful anymore.

      • Todd says:

        Don’t you think that would be an even clearer case of click fraud by Google?

        • Paul says:

          This whole thing occupies a pretty gray area of competition. I personally don’t see giving off false signals as any worse than trying to reverse engineer what your competitor is doing. If Google searches Bing and clicks bad links, yeah, that’s click fraud. But if Google is running fake searches on its own engine, I’d say its Bing’s responsibility to make sure its not getting bad data. The Google EULA says you aren’t allowed to copy the results, if Microsoft is doing so anyways, it’s at their risk.

  8. alext says:

    @Bob,

    your comparison with Google bomb doesn’t hold: Google bomb uses Google’s own data and algorithms to improve ranking, not the results of another search engine. The point is that Bing’s results were based entirely on the results of another search engine.

  9. @alext: A Google bomb is a set of links with a specific phrase as anchor text that is designed to trick Google into showing the target of the link in searches for the phrase in the anchor text. What Google did to Bing was use IE toolbar to plant fake clickthrough data for artificial queries in order to trick Bing into showing the clicked-to page as a result when searching for the artificial query. That seems similar to me, but then I have a very liberal attitude toward analogies.

    @Paul: Thanks for the clarification. Yes, that’d probably work, though some might construe it as a touch “evil”.

  10. beniz says:

    “One way to lower the barrier to entry is to accept that users can share their interaction”. This is what the Seeks Project does ( http://www.seeks-project.info/ ).

  11. J says:

    Sorry guys, but I disagree. Toying with definitions and semantics won’t suddenly change copying to “legitimate innovation”. It’s like taking someone else’s software, putting a “Microsoft” sticker on the box, and selling it as it’s yours. They can complain that the barrier to entry is high, but that won’t make copying any more acceptable, just like you can’t plagiarize a paper because “Math is hard”. Either study up (invest more) or don’t do it.

    By the way, I am not a “Microsoft hater” and have always used their products (not all of them of course; I prefer Windows over Linux, but Firefox over of IE).

  12. Armchair Guy says:

    I have to disagree with the idea that, since Microsoft is using user clicks, what it’s doing is legitimate. It’s certainly a privacy issue, but *whose* privacy? Microsoft isn’t just monitoring sites users visit. It’s monitoring Google’s processes.

    In other words, when the user gives Microsoft permission to monitor his clicks, Microsoft doesn’t get permission to monitor Google-generated search results. The search results don’t belong to the user, they belong to Google. The user has Google’s permission to use the results, but not necessarily to pass along Google results of all his searches to Microsoft. Microsoft needs Google’s permission to farm Google search results in such a systematic way.

    Thus, it IS a privacy issue, just identified differently.

  13. Yisong Yue says:

    Regardless of moral or ethical issues, I think it’s ultimately inevitable that such data harvesting happens — and I’m fairly certain it will happen more frequently in the future. As John mentioned, these systems are highly complex and are optimized using data-driven approaches.

    I am personally of the view that I should own my search logs. If I am allowed to share my usage data with any company that I choose, then I think that’s a big win for everybody (except for the company currently holding a monopoly over the usage data, of course).

  14. Fluffy says:

    Looking past the fact that you assumed away the most troubling issue, that of user permission. It seems doubtful that allowing this type of behavior results in a significant lowering of the barriers to entry for anyone but a small group of companies.

    To collect a large enough sample of user interactions for enough different search results to really use it in your algorithm, you would need to have your search bar installled on a lot of machines.

    The new people with the good ideas will have to be backed by a large corporation to have that base level of installations.

  15. Tom says:

    Seeing that google’s own search engine uses web pages built by others without request to create a search result, google itself is doing what Microsoft did. If I has the time and money to build thousands of fake web pages pointing to useless websites so google would end up using my results….

    wait doesn’t that happen exactly to google? So could I then say that google stole and copied the several organised list of links, like say, yahoo?

    Google built it’s dominance on the back of others public web pages and now complains that others are using it’s public search results?

    Calling the kettle black is a clear example of how you’re no longer concentrating on innovating, but on protecting your turf…

    And on innovation google’s done the best and the worst (and by worst i mean reverse engineered others to it’s own gain).

  16. Jeff says:

    This really has nothing to do with user freedom. I should be able to share my preferred search results with Bing, but Bing can’t just use them unmodified, because I found these results _through_ Google. Imagine I had the recipe for Coca Cola, and I “revealed” to Pepsi that this recipe is my preferred soft drink. Does this entitle Pepsi to start using Coke’s recipe? Of course not! In other words, a list of search results is not mine to give to Bing, just as it is illegal for me to copy the phone book and sell that.

    I sympathize with the need to incorporate user feedback into machine learning, but this is not the way to do it. Bing’s argument is that this is only one factor of hundreds that goes into producing search results. True, and I agree Google’s claims are more grandoise than they really are. But Bing shouldn’t be able to copy ANY of Google’s results. Hypothetically, imagine they now decided to remove all of the other factors that go into producing search results. Now they are left with nothing but Google’s results, and the legality and is cut and dry.

    Here’s a legitimate way to use user feedback: Bing listens in on users and figures out what search results (from Google) that people like. Now, instead of copying these results, Bing should try to modify it’s algorithm to _produce_ these results on its own. In other words, use a variety of statistical techniques in order to predict the Google results from Bing input cues.

    There’s nothing wrong with Bing acquiring data from people using Google, as long as we’re setting privacy concern aside. The concern is merely that, because Google’s searches were synthetic, there is no POSSIBLE way Bing could have modified its algorithm to produce those results. They were stolen.

    • jl says:

      There are a couple failures of analogy and perhaps a wrong surmise from which you base your conclusion.

      (1) In the Coca Cola analogy, it’s as if the recipe is revealed to every drinker. If every drinker knows the recipe, it becomes common knowledge, and the unreasonableness of copying the recipe is far less clear.

      (2) I’m not at all clear that users revealed a _list_ of search results to Bing, rather than merely the result which was clicked on.

      (3) Given this, it’s more like Bing has (sometimes) excerpted single phone numbers from the phone book, something which I’m virtually everyone has done and which is accepted as reasonable.

      Your ‘legitimate method’ is my working hypothesis about what actually occurred. You claim this is not what could have happened, and yet if you read the Google post, you’ll see that they had 20 different people constantly telling Bing/Microsoft what the “right” answer was, and only a few of the synthetic queries were tuned to produce what people were chronically telling Microsoft was correct. This is entirely consistent with your “legitimate way”.

  17. Incredible points. Solid simlock samsung reasons. Continue
    the amazing work.

    This kind of information depilacja laserowa Radom is invaluable.
    How could I read more?

Leave a Reply


Powered by WordPress