File this post under, "I meant to publish this a month ago but forgot." Anyway ... In March of 2005 I complained about all of the attention given to the last team left out of the NCAA basketball tournament when the real problem lay in the seeding. My suggestion: seed teams 1-4 and have a random draw play-in round for 96 teams for the other 48 spots with seeds randomly assigned to the winners of the play-in round. The NCAA's seeding of teams 5-12 appeared to rely more on conference affiliation or name recognition than genuine differences in performance ability. Over the past few years, the increasing flow of one-and-done players to the NBA has only reinforced my suspicions.
Data from the highly competitive 2011 Tourney strongly suggests that differences in, say 5 and 11 seeds are built on slight, if not non-existent, differences. In fact, seeding 4-13 has little informational content. If all of the first round games are used, the seeding doesn't look too bad. Here are the results of a simple regression of seeding on score differential:
Score Differential = 4.0 - 1.7xSeed Differential (Score Diff Explained = 36%)
In fact, the seed differential does a slightly better job (36 percent to 32 percent) than the Vegas line in predicting scores within the sample:
Score Differential = -1.8 + 1.0xVegas Spread (Score Diff Explained = 32%)
In this setup, the Vegas line displays a 1-to-1 relationship with the score differential. Both the Seed difference and Vegas line jump up 10 percent in explanatory ability of non-linear effects are included. The Selection Committee and Vegas oddsmakers and betters appear to find and utilize useful information.
However, cutting the sample seeds 4-13 produces very different and sketchy results:
Score Differential = -2.8 - 0.05xSeed Differential (Score Diff Explained = 0.2%)
Score Differential = -5.3 - 0.7xVegas Spread (Score Diff Explained = 7%)
While Vegas "explains" more of the score differences, the direction of effect is wrong -- bigger spread, smaller score differential. Both of these models do a little better with non-linear effects, but not much. Seed Differential increases to only 4%. Beyond score differences, both do a poor job of predicting winners. Of the 20 games, the higher seeded team lost 7 and the team with a point spread advantage lost 9.
A big gap exists between teams seeded 13 and those seeded 15 and 16, giving the top-seeded teams a huge advantage by removing one round of the tourney. In spite of this one game "bye," even 1 and 2 seeds are very vulnerable once the truly weak teams are sifted.
Of course, a single year of data hardly proves anything, but it is suggestive that the NCAA Selection Committee spends countless hours finding ways to interpret data that contains a lot more noise than signal.