S&P Sector Constituent Database – Garbage In, Garbage Out

We are currently engaging in research utilising 23 years of historical constituent data for the S&P 500 sectors.  But if our database isn’t accurate then our test results will be worthless.  I started writing this post about the processes we went through to ensure that the historical data we used was clean and that our constituent list was accurate.  But then I realised that no one cares how many multiple fail safe cross over checks were made or how difficult the process was.  The only thing people care about (the only thing that matters) is being able to prove that the database is accurate.

So how do we prove that we are working with an accurate database?

Well the second half of our S&P 500 Sector constituent list (Sept 2001 – March 2013) came directory from our insider at State Street; the company that actually issues the Select Sector SPDR ETFs.  With the data for this period coming straight from the horses mouth it is safe to say that the accuracy for this period can be relied upon.  It also contains an abundance of information, enough to reconstruct the ETFs, including:

Company Name, Symbol, Exchange, Shares, Float, Float Shares, Multiplier, Adjusted Shares, Last Sale, Previous Close, Index Weight, Index Market Value, Market Value (Unadjusted Shares), Current Cap, Divisor, Previous Cap, Number of Components, Sum Of Adjusted Shares, Calculated Index, Published Index, # of Stocks, Sum of Adjusted Shares, Capitalization Using Unadjusted Shares, Estimated Weight of Index Components in the S&P 500…

The first half of our S&P 500 Sector constituent list however (Feb 1990 – Aug 2001) was compiled from several sources of varying reliability and only consists of dates and symbols.  Plus most of the stocks had to be classified into their sectors manually.

The best way to prove the accuracy of our database then is to reconstruct the sector indices and compare the correlation coefficient for each of the two periods against the actual indices published by S&P.  If our data is good then we should be able to closely reproduce the Equal Weighted S&P 500 Sector Indices.

In this post there is reference to several different indices.  Here are a number of relevant links:

Index and ETF Link Matrix

Market Capitalization Weighted Index Select Sector Index Select Sector SPDR ETF Equal Weight Index Equal Weighted ETF
S&P 500 SPX/GSPC/INX SPY SPW / SPXEW RSP
Materials S5MATR / SPXM IXB XLB S15 RTM
Energy SPN / SPXE IXE XLE S10 RYE
Industrials S5INDU / SPXI IXI XLI S20 RGI
Financials SPF / SPXF IXM XLF S40 RYF
Cons Staples S5CONS / SPXS IXR XLP S30 RHS
Technology S5INFT / SPXT IXT XLK S45 RYT
Utilities S5UTIL / SPXU IXU XLU S55 RYU
Health Care S5HLTH / SPXA IXV XLV S35 RHY
Cons Discret S5COND / SPXD IXY XLY S25 RCD

 

Now, to keep things simple the ETFHQ constructed indices will be equally weighted on a daily basis rather than quarterly.  For this reason our results won’t be identical to that of the S&P, but this is not an issue.  As long as the level of correlation Feb 1990 – Aug 2001 is not far below the level of correlation Sept 2001 – March 2013 then our hard work and patience has paid off:

Correlation - S&P EW Index vs ETFHQ

(Special thanks to Mr Anonymous for sending us some data that we needed for these tests).  As you can see above, the results are even better than we could have hoped.  In many cases the correlation for the first half of our data is greater than that for the second.  How is this possible when we know that the data from Sept 2001 – March 2013 is from a reliable source?  Because during this period the market has endured some extreme turmoil.  Extreme stock behavior will result in greater index discrepancies when the component weightings are not identical.

So with this we have definitive proof that our data and constituent list is extremely accurate.  Let the testing begin!

But before we do that, for those that are interested, below you will find charts that display each index; the S&P version vs the ETFHQ version including a rolling 252 day (one trading year) correlation coefficient.

spx-vs-spxew

The chart above actually shows the correlation between the S&P 500 (official) and the S&P 500 Equal Weighted Index (official).  I have included it to illustrate why we didn’t test our results against the standard market cap weighted indices.

Stocks in companies of different sizes can behave very differently at times and for that reason market cap and equally weighted indices perform very differently.  In fact, in this case the two diverged to such an extent that the correlation dropped to -44.52%.  That means that they moved in opposite directions for over a year despite tracking the exact same 500 stocks!

spx-ew-v-etfhq

materials-ew-v-etfhq

energy-ew-v-etfhq

industrials-ew-v-etfhq

financials-ew-v-etfhq

consumer-staples-ew-v-etfhq

technology-ew-v-etfhq

utilities-ew-v-etfhq

health-care-ew-v-etfhq

consumer-discretionary-ew-v-etfhq

S&P 500 Sectors – Historical Holdings Data

S&P 500 Sector ETFs“Diversification is protection against ignorance.  It makes little sense if you know what you are doing.”
– Warren Buffett

Well when it comes to selecting individual companies on the basis of value, I certainly don’t know what I am doing and you know what?  I don’t care to learn.

That is the #1 draw card of ETFs; they provide diversification that protects me from my ignorance.  Furthermore by tracking the average of the stocks in an ETF, the noise found in the data of each individual holding is largely canceled out leaving numbers that are easier to decipher through technical analysis.

BUT, the data from an ETF is NOT the data from the underlying assets.  Yes, an ETFs price changes reflect the net asset value (NAV) of its holdings, but nothing more.  Quality breadth data is difficult to come by and historical breadth data going back more than 5-10 years is almost non-existent.  Access to such data is only a dream for most trading system engineers.

We contacted ‘S&P Dow Jones’ looking for such information and discovered that historical constituent data for the S&P 500 would cost $1,800 USD a year… 20 years would cost $36,000 and to include each of the 9 S&P sectors they would do us a deal; just $120,000.00 bucks…  We do have a budget for data, but…

So as luck would have it I managed to make friends with Mr XXXX from State Street who was kind enough to give me monthly S&P sector constituent data back to 2001.  But a lot has changed over the last 12 years.  Many of the S&P 500 holdings have been de-listed, changed names, ticker codes, have merged, been acquired, broken up etc.  Hunting down the last trading name, ticker code and clean data for these stocks is not a task for the faint of heart (or short of patience).

I could write a book about the difficulty of this task but instead will give you one example:

The old ‘General Motors’ (GM) stock was de-listed in March 2011 following bankruptcy.  What was remaining of the old GM at that time was trading under the name ‘Motors Liquidation Company’ (MTLQQ).  You will not find this name or ticker code in any historical holdings data for the S&P 500 or the S&P Consumer Discretionary Index because GM was removed from these indices in June 2009, before the name change.  However in November 2010 the new ‘General Motors’ was re-listed under the same name and symbol and in June 2013 returned to the S&P 500.  Very confusing!  Hundreds of similar yet different scenarios have faced the constituents of the S&P 500 over the last 23 years so you can imagine how difficult it was reconciling this database.

Anyway, with that hard work done we received some help from Frank Hassler over at Engineering Returns who provided us with fairly clean S&P 500 holdings data back to 1990.  Then the hard work began again and after multiple crossover checks it was a matter of researching several hundred stocks individually (many of which had been de-listed for over 15 years) and classifying them into the corresponding sectors.  Several sources were used for this process including:

http://www.moodys.com
http://en.wikipedia.org
http://www.bloomberg.com
http://www.fundinguniverse.com/company-histories/
http://www.nytimes.com
http://www.nndb.com

We logged about 270 hours on the project and now have a very exciting, quality database to work with (proof the data is good).  Realistically, most people wouldn’t know how to use this database even if they wanted to but I am happy to provide you with a copy at no cost on request.  All I ask is three things or your request will be ignored; 1 Let me know what ideas you want to test, 2 I must agree that these ideas are worth testing, 3 I kindly ask that you share your findings 🙂

Over the coming months we will be publishing a variety of tests using this data including:

  • Correlation, Beta and Volume – Does the tail now wag the dog?  Has there been an increase in the correlation of stocks since the proliferation of ETFs?
  • Momentum – Emulating the results seen in published papers on momentum and looking for new findings.
  • Volume – How can an index’s internal volume best be utilised in a trading system?
  • Breadth Data – What is effective?
  • Identifying The Best – A rising tide lifts all boats but how can one identify the best/worst performers within an asset group?

What kinds of tests would you like to see us perform?  Please leave your suggestions below: