Dec 192011
 

So far, in my previous posts, we have been downloading DAILY stock prices from the well known sites Yahoo and Google. Both sites offer very easy access to free daily historical data. Both sites also offer nice charts of the stock prices. But some of you may have noticed that Google charts not only daily prices, but they also chart INTRADAY prices down to two minute resolution. However, there are no published links to get the raw data used to build the charts. Bummer. Let’s do a little investigating and see what we can figure out.

For those of you using the excellent Chrome browser, it has a built in utility to examine virtually all aspects of a web page (or site). It is called Developer Tools. It can be started clicking on the wrench icon in the upper right corner, selecting Tools, then Developer Tools (or simply Ctrl+Shift+I). This utility has an enormous amount of functionality, but we are mainly interested in any network traffic a page generates. So lets start by pulling up the Google Finance summary page for AAPL. Then start the developer tools and click on the Network tab. Finally, on the chart for AAPL, click on the 2min interval link. At the bottom of the page will be a list of all the network requests that were generated by clicking on the 2min link. Looking at them, there is one called getprices. That looks promising! Now click on that link and then the “response” tab along the top to see what the server response was to the request. Here is what my browser looked like:

The data we are interested in looks like this:

EXCHANGE%3DNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=120
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-300
a1324322040,383.71,383.82,383.56,383.655,8114
1,383.89,383.9,383.61,383.7,15512
2,383.844,384.03,383.81,383.83,15864
3,383.51,383.91,383.41,383.84,25327
4,383.58,383.7,383.44,383.44,16367
5,383.16,383.53,383,383.51,37838
6,383,383.19,382.95,383.18,16285
7,383.15,383.21,383,383.13,0
8,382.99,383.2,382.95,383.1998,0
9,382.888,383.018,382.77,383.018,0
10,382.96,383.02,382.75,382.9,0

Now we are getting somewhere! That sure looks like price and volume data, along with some sort of time stamp and a list of offsets. Here is what the link looked like:

http://www.google.com/finance/getprices?q=AAPL&x=NASD&i=120&sessions=ext_hours&p=5d&f=d,c,v,o,h,l&df=cpct&auto=1&ts=1324323553905

If we copy that link and put it directly in the url bar of our browser, we can get the full data set. It appears to have some header data followed by a list of prices. The dates appear to consist of a periodic time value followed by a series of offsets.

So there are two things to figure out. First we need to decipher the url string just like we did for the daily data, and then we need to decode the time stamp data in the returned results.

Let’s tackle the url string first. We will do it just like the other two previous attempts. We end up with:

  • The base url is http://www.google.com/finance/getprices
  • q is the symbol (AAPL)
  • x is the exchange (NASD)
  • i is the interval in seconds (120 = seconds = 2 minutes)
  • sessions is the session requested (ext_hours)
  • p is the time period (5d = 5 days)
  • f is the requested fields (d,c,v,o,h,l)
  • df ?? (cpct)
  • auto ?? (1)
  • ts is potentially a time stamp (1324323553 905)

Again, I spent some time playing around with the query parameters and found that you can get away with only q, i, p, and f. A simplified url (to get 10 days of 5 minute AAPL data) looks like this:

http://www.google.com/finance/getprices?q=AAPL&i=300&p=10d&f=d,o,h,l,c,v

Go ahead and click on the link to try it. Now we can download intraday data for a specific period back in time. There may be a way to get it between any two dates, but I haven’t figured that out yet. If anyone knows, I sure would be interested in finding out how.

The next problem is to figure out how to deal with the weird date formats. In data returned from the example above, I got “a1323181800” in the first date field and moving down, the last “a” prefixed number displayed was “a1324045800”. Those look suspiciously like Unix time stamps to me. All major programming languages support them and they are simply the number of seconds since 1/1/1970. Again, using my friend Google, I found an online Unix time stamp converter and plugged in the above numbers (without the “a”). I got: “Tue, 06 Dec 2011 14:30:00 GMT” and “Fri, 16 Dec 2011 14:30:00 GMT” for the second. Bingo! They are Unix time stamps. And the data appears to be listed with the oldest first. So, to get the date for each time stamp, we just need to convert any “a” prefixed times to the Unix time, then for each succeeding time, we add the offset multiplied by the number of interval seconds to the last recorded Unix time until we get an new “a” prefixed time. It sounds a lot more complicated than it really is.

Since I am only interested in the DOHLC values, I am going to ignore the header fields in our dataset. And since this post is getting pretty long, I will present the Python class I wrote to parse the data:

class GoogleIntradayQuote(Quote):
  ''' Intraday quotes from Google. Specify interval seconds and number of days '''
  def __init__(self,symbol,interval_seconds=300,num_days=5):
    super(GoogleIntradayQuote,self).__init__()
    self.symbol = symbol.upper()
    url_string = "http://www.google.com/finance/getprices?q={0}".format(self.symbol)
    url_string += "&i={0}&p={1}d&f=d,o,h,l,c,v".format(interval_seconds,num_days)
    csv = urllib.urlopen(url_string).readlines()
    for bar in xrange(7,len(csv)):
      if csv[bar].count(',')!=5: continue
      offset,close,high,low,open_,volume = csv[bar].split(',')
      if offset[0]=='a':
        day = float(offset[1:])
        offset = 0
      else:
        offset = float(offset)
      open_,high,low,close = [float(x) for x in [open_,high,low,close]]
      dt = datetime.datetime.fromtimestamp(day+(interval_seconds*offset))
      self.append(dt,open_,high,low,close,volume)

It should be pretty self explanatory. The main difference between this class and the others is that is takes an interval (in seconds) and a lookback (in days) instead of a start/end time. I ignore the first five header fields, then check for date fields prefixed with an “a”. I keep track of the offset and calculate the correct date with the line:

dt = datetime.datetime.fromtimestamp(day+(interval_seconds*offset))

There is one last tricky part: Notice that the OHLC data is listed in a non-standard format. It actually is returned as CHLO, no matter what order the f fields are listed.

It should be noted that it appears that there is only the most recent 10-20 days of intraday data available at any given time. Buy hey, what do you want for nothing! Later on, we will try to find some interesting things to do with this data and these limitations.

And finally, a complete example: google_intraday.py

Dec 192011
 

Now that we have a very simple function to download historical stock data from Yahoo, let’s see if we can use the same technique to download data from Google Finance.

If we head over to the main page for Google finance and pull up a quote for AAPL, we will see a link for historical prices on the upper left corner. Clicking on that link will bring up the historical prices page. By default, the page shows a nicely formatted table of prices with the latest price at the top of the page. Notice that there is an option to pick any two date ranges for the prices, and, just like Yahoo, there is a link to download the data to a spreadsheet. Now we pick a date range (in this case, Nov 1st – Nov 30th) and click the update button. Then we can download the data using the link on the right side of the page. By default it downloads as “data.csv”. We end up with the following data:

Date,Open,High,Low,Close,Volume
30-Nov-11,381.29,382.28,378.30,382.20,14502386
29-Nov-11,375.84,378.83,370.20,373.20,13429694
28-Nov-11,372.35,376.72,370.33,376.12,12380122
.
.
.
3-Nov-11,399.07,403.40,395.36,403.07,15768725
2-Nov-11,400.09,400.44,395.11,397.41,11723381
1-Nov-11,397.41,399.50,393.22,396.51,18992411

This data looks very similar to the Yahoo data, but without the adjusted close. It appears that Google adjusts the prices for us, so we will not have to. Also, notice the dates are formatted differently. It should be very easy to convert to our standard format that we defined in our Quote class in the previous post.

Now we need to decipher the link used to get the data. We copy the link by right clicking it and selecting “Copy link address” and end up with the following:

http://www.google.com/finance/historical?cid=22144&startdate=Nov+1%2C+2011&enddate=Nov+30%2C+2011&num=30&output=csv

We have the query options: cid, startdate, enddate, num, and output. Take a look at the date values. What’s up with the percent signs? Anytime you see “%” signs in a url, it means that it has been url encoded. By searching google, we can easily find many sites that will encode and decode url strings. By decoding the raw url string, we end up with:

http://www.google.com/finance/historical?cid=22144&startdate=Nov 1, 2010&enddate=Nov 30, 2011&num=30&output=csv

That looks much easier to figure out.

We now have:

  • cid = 22144
  • startdate = Nov 1, 2011
  • enddate = Nov 30, 2011
  • num = 30
  • output = csv

There is only one problem. What is the cid? Probably some sort of database id. But by observing the query string for the main page, we can see that it uses a “q” option that takes the exchange and symbol name. By playing around with different combinations of these parameters, I found that we can use the q option without the exchange, and the num option does not seem to be necessary. We can simplify the url string to:

http://www.google.com/finance/historical?q=AAPL&startdate=Nov 1, 2011&enddate=Nov 30, 2011&output=csv

The only thing left to do now is to subclass our original Quote class and customize it to build our url string and convert the dates to our common format. Here is what I ended up with:

class GoogleQuote(Quote):
  ''' Daily quotes from Google. Date format='yyyy-mm-dd' '''
  def __init__(self,symbol,start_date,end_date=datetime.date.today().isoformat()):
    super(GoogleQuote,self).__init__()
    self.symbol = symbol.upper()
    start = datetime.date(int(start_date[0:4]),int(start_date[5:7]),int(start_date[8:10]))
    end = datetime.date(int(end_date[0:4]),int(end_date[5:7]),int(end_date[8:10]))
    url_string = "http://www.google.com/finance/historical?q={0}".format(self.symbol)
    url_string += "&startdate={0}&enddate={1}&output=csv".format(
                      start.strftime('%b %d, %Y'),end.strftime('%b %d, %Y'))
    csv = urllib.urlopen(url_string).readlines()
    csv.reverse()
    for bar in xrange(0,len(csv)-1):
      ds,open_,high,low,close,volume = csv[bar].rstrip().split(',')
      open_,high,low,close = [float(x) for x in [open_,high,low,close]]
      dt = datetime.datetime.strptime(ds,'%d-%b-%y')
      self.append(dt,open_,high,low,close,volume)

This one is even simpler than the Yahoo version. There is no need to calculate adjusted prices. The main difference is the different url format and time formatting.

You can download a file with the Quote class and a few command prompt tests here: google.py

In a future post, we will consolidate all of these classes and create a Python module/command line utility to download historical data from these various sources save them to disk in a consistent format. Stay tuned!

Dec 032011
 

In order to create and run a trading system, the first thing you need is some historical data to work with. Fortunately, there are several excellent free sources available on the internet. Probably the most popular is from the Yahoo Finance site.

Normally, in order to access the data, you would enter in a stock symbol and pull up the overview page for a company. For example, if we pull up the page for Apple Inc., you will see a link on the left side for Historical Prices. Clicking on that link will bring up a page with a nicely formatted table of the most recent prices for Apple. At the bottom of that page, there is a link to download all the prices as a comma separated file that is suitable for importing into a spreadsheet. The downloaded file is formatted like this:

Date,Open,High,Low,Close,Volume,Adj Close
2011-12-02,389.83,393.63,388.58,389.70,13537700,389.70
2011-12-01,382.54,389.00,380.75,387.93,13709400,387.93
2011-11-30,381.29,382.28,378.30,382.20,14497800,382.20
.
.
.
1984-09-11,26.62,27.37,26.62,26.87,5444000,3.07
1984-09-10,26.50,26.62,25.87,26.37,2346400,3.01
1984-09-07,26.50,26.87,26.25,26.50,2981600,3.02

The data is very useful, but it is not very easy to download and manage lots of different symbols for different date ranges.

Suppose we wanted to automate this process? If we could figure out the url format, it would allow us to automatically generate new url’s and download the data via a Python script.

Let’s start by examining the url that generates the download data. In this example, it is from September 7th, 1984 to December 3rd, 2011:

http://ichart.finance.yahoo.com/table.csv?s=AAPL&d=11&e=3&f=2011&g=d&a=8&b=7&c=1984&ignore=.csv

It looks pretty easy to decipher. There is the main part of the url, followed by a number of query options. They are:

s=AAPL, d=11, e=3, f=2011, g=d, a=8, b=7, c=1984, ignore=.csv

  • s is obviously the symbol
  • d appears to be the end month minus one
  • e is the end day
  • f is the end year
  • g Not sure. Perhaps d is for daily?
  • a appears to be the start month minus one
  • b is the start day
  • c is the start year
  • ignore=.csv appears to specify csv as the output format

If we play around and build some new urls by hand by replacing the query values with new data and test them out, you find that we can leave the g and ignore options out and the downloaded files are identical.

So, now we know how to build a url to download historical data for any stock, and that’s all we need are the stock symbol and the start/end dates.

In the next post, we will create a set of Python classes to easily download and store the data in a consistent way that will be useful for our trading system.