Dec 192011
 

So far, in my previous posts, we have been downloading DAILY stock prices from the well known sites Yahoo and Google. Both sites offer very easy access to free daily historical data. Both sites also offer nice charts of the stock prices. But some of you may have noticed that Google charts not only daily prices, but they also chart INTRADAY prices down to two minute resolution. However, there are no published links to get the raw data used to build the charts. Bummer. Let’s do a little investigating and see what we can figure out.

For those of you using the excellent Chrome browser, it has a built in utility to examine virtually all aspects of a web page (or site). It is called Developer Tools. It can be started clicking on the wrench icon in the upper right corner, selecting Tools, then Developer Tools (or simply Ctrl+Shift+I). This utility has an enormous amount of functionality, but we are mainly interested in any network traffic a page generates. So lets start by pulling up the Google Finance summary page for AAPL. Then start the developer tools and click on the Network tab. Finally, on the chart for AAPL, click on the 2min interval link. At the bottom of the page will be a list of all the network requests that were generated by clicking on the 2min link. Looking at them, there is one called getprices. That looks promising! Now click on that link and then the “response” tab along the top to see what the server response was to the request. Here is what my browser looked like:

The data we are interested in looks like this:

EXCHANGE%3DNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=120
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-300
a1324322040,383.71,383.82,383.56,383.655,8114
1,383.89,383.9,383.61,383.7,15512
2,383.844,384.03,383.81,383.83,15864
3,383.51,383.91,383.41,383.84,25327
4,383.58,383.7,383.44,383.44,16367
5,383.16,383.53,383,383.51,37838
6,383,383.19,382.95,383.18,16285
7,383.15,383.21,383,383.13,0
8,382.99,383.2,382.95,383.1998,0
9,382.888,383.018,382.77,383.018,0
10,382.96,383.02,382.75,382.9,0

Now we are getting somewhere! That sure looks like price and volume data, along with some sort of time stamp and a list of offsets. Here is what the link looked like:

http://www.google.com/finance/getprices?q=AAPL&x=NASD&i=120&sessions=ext_hours&p=5d&f=d,c,v,o,h,l&df=cpct&auto=1&ts=1324323553905

If we copy that link and put it directly in the url bar of our browser, we can get the full data set. It appears to have some header data followed by a list of prices. The dates appear to consist of a periodic time value followed by a series of offsets.

So there are two things to figure out. First we need to decipher the url string just like we did for the daily data, and then we need to decode the time stamp data in the returned results.

Let’s tackle the url string first. We will do it just like the other two previous attempts. We end up with:

  • The base url is http://www.google.com/finance/getprices
  • q is the symbol (AAPL)
  • x is the exchange (NASD)
  • i is the interval in seconds (120 = seconds = 2 minutes)
  • sessions is the session requested (ext_hours)
  • p is the time period (5d = 5 days)
  • f is the requested fields (d,c,v,o,h,l)
  • df ?? (cpct)
  • auto ?? (1)
  • ts is potentially a time stamp (1324323553 905)

Again, I spent some time playing around with the query parameters and found that you can get away with only q, i, p, and f. A simplified url (to get 10 days of 5 minute AAPL data) looks like this:

http://www.google.com/finance/getprices?q=AAPL&i=300&p=10d&f=d,o,h,l,c,v

Go ahead and click on the link to try it. Now we can download intraday data for a specific period back in time. There may be a way to get it between any two dates, but I haven’t figured that out yet. If anyone knows, I sure would be interested in finding out how.

The next problem is to figure out how to deal with the weird date formats. In data returned from the example above, I got “a1323181800” in the first date field and moving down, the last “a” prefixed number displayed was “a1324045800”. Those look suspiciously like Unix time stamps to me. All major programming languages support them and they are simply the number of seconds since 1/1/1970. Again, using my friend Google, I found an online Unix time stamp converter and plugged in the above numbers (without the “a”). I got: “Tue, 06 Dec 2011 14:30:00 GMT” and “Fri, 16 Dec 2011 14:30:00 GMT” for the second. Bingo! They are Unix time stamps. And the data appears to be listed with the oldest first. So, to get the date for each time stamp, we just need to convert any “a” prefixed times to the Unix time, then for each succeeding time, we add the offset multiplied by the number of interval seconds to the last recorded Unix time until we get an new “a” prefixed time. It sounds a lot more complicated than it really is.

Since I am only interested in the DOHLC values, I am going to ignore the header fields in our dataset. And since this post is getting pretty long, I will present the Python class I wrote to parse the data:

class GoogleIntradayQuote(Quote):
  ''' Intraday quotes from Google. Specify interval seconds and number of days '''
  def __init__(self,symbol,interval_seconds=300,num_days=5):
    super(GoogleIntradayQuote,self).__init__()
    self.symbol = symbol.upper()
    url_string = "http://www.google.com/finance/getprices?q={0}".format(self.symbol)
    url_string += "&i={0}&p={1}d&f=d,o,h,l,c,v".format(interval_seconds,num_days)
    csv = urllib.urlopen(url_string).readlines()
    for bar in xrange(7,len(csv)):
      if csv[bar].count(',')!=5: continue
      offset,close,high,low,open_,volume = csv[bar].split(',')
      if offset[0]=='a':
        day = float(offset[1:])
        offset = 0
      else:
        offset = float(offset)
      open_,high,low,close = [float(x) for x in [open_,high,low,close]]
      dt = datetime.datetime.fromtimestamp(day+(interval_seconds*offset))
      self.append(dt,open_,high,low,close,volume)

It should be pretty self explanatory. The main difference between this class and the others is that is takes an interval (in seconds) and a lookback (in days) instead of a start/end time. I ignore the first five header fields, then check for date fields prefixed with an “a”. I keep track of the offset and calculate the correct date with the line:

dt = datetime.datetime.fromtimestamp(day+(interval_seconds*offset))

There is one last tricky part: Notice that the OHLC data is listed in a non-standard format. It actually is returned as CHLO, no matter what order the f fields are listed.

It should be noted that it appears that there is only the most recent 10-20 days of intraday data available at any given time. Buy hey, what do you want for nothing! Later on, we will try to find some interesting things to do with this data and these limitations.

And finally, a complete example: google_intraday.py

Dec 192011
 

Now that we have a very simple function to download historical stock data from Yahoo, let’s see if we can use the same technique to download data from Google Finance.

If we head over to the main page for Google finance and pull up a quote for AAPL, we will see a link for historical prices on the upper left corner. Clicking on that link will bring up the historical prices page. By default, the page shows a nicely formatted table of prices with the latest price at the top of the page. Notice that there is an option to pick any two date ranges for the prices, and, just like Yahoo, there is a link to download the data to a spreadsheet. Now we pick a date range (in this case, Nov 1st – Nov 30th) and click the update button. Then we can download the data using the link on the right side of the page. By default it downloads as “data.csv”. We end up with the following data:

Date,Open,High,Low,Close,Volume
30-Nov-11,381.29,382.28,378.30,382.20,14502386
29-Nov-11,375.84,378.83,370.20,373.20,13429694
28-Nov-11,372.35,376.72,370.33,376.12,12380122
.
.
.
3-Nov-11,399.07,403.40,395.36,403.07,15768725
2-Nov-11,400.09,400.44,395.11,397.41,11723381
1-Nov-11,397.41,399.50,393.22,396.51,18992411

This data looks very similar to the Yahoo data, but without the adjusted close. It appears that Google adjusts the prices for us, so we will not have to. Also, notice the dates are formatted differently. It should be very easy to convert to our standard format that we defined in our Quote class in the previous post.

Now we need to decipher the link used to get the data. We copy the link by right clicking it and selecting “Copy link address” and end up with the following:

http://www.google.com/finance/historical?cid=22144&startdate=Nov+1%2C+2011&enddate=Nov+30%2C+2011&num=30&output=csv

We have the query options: cid, startdate, enddate, num, and output. Take a look at the date values. What’s up with the percent signs? Anytime you see “%” signs in a url, it means that it has been url encoded. By searching google, we can easily find many sites that will encode and decode url strings. By decoding the raw url string, we end up with:

http://www.google.com/finance/historical?cid=22144&startdate=Nov 1, 2010&enddate=Nov 30, 2011&num=30&output=csv

That looks much easier to figure out.

We now have:

  • cid = 22144
  • startdate = Nov 1, 2011
  • enddate = Nov 30, 2011
  • num = 30
  • output = csv

There is only one problem. What is the cid? Probably some sort of database id. But by observing the query string for the main page, we can see that it uses a “q” option that takes the exchange and symbol name. By playing around with different combinations of these parameters, I found that we can use the q option without the exchange, and the num option does not seem to be necessary. We can simplify the url string to:

http://www.google.com/finance/historical?q=AAPL&startdate=Nov 1, 2011&enddate=Nov 30, 2011&output=csv

The only thing left to do now is to subclass our original Quote class and customize it to build our url string and convert the dates to our common format. Here is what I ended up with:

class GoogleQuote(Quote):
  ''' Daily quotes from Google. Date format='yyyy-mm-dd' '''
  def __init__(self,symbol,start_date,end_date=datetime.date.today().isoformat()):
    super(GoogleQuote,self).__init__()
    self.symbol = symbol.upper()
    start = datetime.date(int(start_date[0:4]),int(start_date[5:7]),int(start_date[8:10]))
    end = datetime.date(int(end_date[0:4]),int(end_date[5:7]),int(end_date[8:10]))
    url_string = "http://www.google.com/finance/historical?q={0}".format(self.symbol)
    url_string += "&startdate={0}&enddate={1}&output=csv".format(
                      start.strftime('%b %d, %Y'),end.strftime('%b %d, %Y'))
    csv = urllib.urlopen(url_string).readlines()
    csv.reverse()
    for bar in xrange(0,len(csv)-1):
      ds,open_,high,low,close,volume = csv[bar].rstrip().split(',')
      open_,high,low,close = [float(x) for x in [open_,high,low,close]]
      dt = datetime.datetime.strptime(ds,'%d-%b-%y')
      self.append(dt,open_,high,low,close,volume)

This one is even simpler than the Yahoo version. There is no need to calculate adjusted prices. The main difference is the different url format and time formatting.

You can download a file with the Quote class and a few command prompt tests here: google.py

In a future post, we will consolidate all of these classes and create a Python module/command line utility to download historical data from these various sources save them to disk in a consistent format. Stay tuned!

Dec 032011
 

In the previous post, we learned to manually build a url to download historical data for any stock symbol between any two dates. Now we want to build a set of simple Python classes to retrieve the data and store it in a consistent manner.

However, there is one more thing to figure out. Some of you may have noticed that there is an extra field (Adj Close) in addition to the normal Open/High/Low/Close/Volume data associated with a daily stock price. It turns out that stock prices can be adjusted for either dividends paid out or stock splits. For example, if a stock is trading at $500, a company board may decide to create 5 shares of $100 stock for each original $500 share. That way it will be easier for some investors to buy the stock. The day after this happens, the quoted price will drop from $500 to $100, but the company will still be worth the same (with five times as many shares). Yahoo data includes the actual OHLC prices on the date they were published, as well as an Adjusted Close price. Normally for the most recent prices, the close price is the same as the adjusted close price and nothing needs to be done. But for data that was published before the most recent split or dividend, the two prices will be different. In order to get a consistent representation of the price for those days, the OHLC prices must be multiplied by the adjusted close divided by the close for that day. You will end up with a new set of prices for that day that were different than what was published on that day, but be proportionally accurate relative to today’s prices. This is essential to back test trading systems accurately.

Now we can finally start writing some code! Let’s think about what we want to accomplish.

  • We want to save the symbol name
  • We want to save the date and time. (Time is not required for daily quotes, but we will be adding an intraday source later.)
  • We want to save the Open, High, Low, Close, and Volume (OHLCV) values for each day.
  • We want to save the OHLCV as Lists (arrays) in memory that are easy to manipulate.
  • We want the ability to save the data locally in comma separated files.
  • We want to be able to read our local files back into memory.
  • We want consistent date/time formatting.
  • We want the prices ordered from oldest to newest to make it easy on our later calculations.

First we will create a base class that contains all of the common functions we require.

class Quote(object):
  
  DATE_FMT = '%Y-%m-%d'
  TIME_FMT = '%H:%M:%S'
  
  def __init__(self):
    self.symbol = ''
    self.date,self.time,self.open_,self.high,self.low,self.close,self.volume = ([] for _ in range(7))

  def append(self,dt,open_,high,low,close,volume):
    self.date.append(dt.date())
    self.time.append(dt.time())
    self.open_.append(float(open_))
    self.high.append(float(high))
    self.low.append(float(low))
    self.close.append(float(close))
    self.volume.append(int(volume))
      
  def to_csv(self):
    return ''.join(["{0},{1},{2},{3:.2f},{4:.2f},{5:.2f},{6:.2f},{7}\n".format(self.symbol,
              self.date[bar].strftime('%Y-%m-%d'),self.time[bar].strftime('%H:%M:%S'),
              self.open_[bar],self.high[bar],self.low[bar],self.close[bar],self.volume[bar]) 
              for bar in xrange(len(self.close))])
    
  def write_csv(self,filename):
    with open(filename,'w') as f:
      f.write(self.to_csv())
        
  def read_csv(self,filename):
    self.symbol = ''
    self.date,self.time,self.open_,self.high,self.low,self.close,self.volume = ([] for _ in range(7))
    for line in open(filename,'r'):
      symbol,ds,ts,open_,high,low,close,volume = line.rstrip().split(',')
      self.symbol = symbol
      dt = datetime.datetime.strptime(ds+' '+ts,self.DATE_FMT+' '+self.TIME_FMT)
      self.append(dt,open_,high,low,close,volume)
    return True

  def __repr__(self):
    return self.to_csv()

The __init__ method creates class attributes that will store our data. There are two string constants that will be used later to format dates and times consistently.
The append method takes a python datetime object and the OHLCV values and appends them to the end of the lists created above.
The csv methods should be self explanatory. They convert the python lists to csv format and vice versa. They also handle reading and writing to data to disk for permanent storage.
For those of you new to python, the __repr__ function returns a printable representation of an object.

Now it is time to subclass the Quote class and customize it to download our Yahoo data.

class YahooQuote(Quote):
  ''' Daily quotes from Yahoo. Date format='yyyy-mm-dd' '''
  def __init__(self,symbol,start_date,end_date=datetime.date.today().isoformat()):
    super(YahooQuote,self).__init__()
    self.symbol = symbol.upper()
    start_year,start_month,start_day = start_date.split('-')
    start_month = str(int(start_month)-1)
    end_year,end_month,end_day = end_date.split('-')
    end_month = str(int(end_month)-1)
    url_string = "http://ichart.finance.yahoo.com/table.csv?s={0}".format(symbol)
    url_string += "&a={0}&b={1}&c={2}".format(start_month,start_day,start_year)
    url_string += "&d={0}&e={1}&f={2}".format(end_month,end_day,end_year)
    csv = urllib.urlopen(url_string).readlines()
    csv.reverse()
    for bar in xrange(0,len(csv)-1):
      ds,open_,high,low,close,volume,adjc = csv[bar].rstrip().split(',')
      open_,high,low,close,adjc = [float(x) for x in [open_,high,low,close,adjc]]
      if close != adjc:
        factor = adjc/close
        open_,high,low,close = [x*factor for x in [open_,high,low,close]]
      dt = datetime.datetime.strptime(ds,'%Y-%m-%d')
      self.append(dt,open_,high,low,close,volume)

This class first calls the super class initializer, it then proceeds to build the proper url string using the given symbol and dates. After retrieving the csv data from Yahoo using the url, it reverses the array so we get it in the order we desire. Then it loops through each day and adjusts the OHLC prices if necessary (as determined by the adjusted close) and stores them in lists.

Next, we need a way to demonstrate and test the code when it is run from the command line:

if __name__ == '__main__':
  q = YahooQuote('aapl','2011-01-01')              # download year to date Apple data
  print q                                          # print it out
  q = YahooQuote('orcl','2011-02-01','2011-02-28') # download Oracle data for February 2011
  q.write_csv('orcl.csv')                          # save it to disk
  q = Quote()                                      # create a generic quote object
  q.read_csv('orcl.csv')                           # populate it with our previously saved data
  print q                                          # print it out

That’s it! There isn’t much there is there? That is mostly because of the power of Python. If there are any bits you don’t understand, they should be very easy to figure out with the Python docs. If all else fails, you can always ask me.

Finally, here is a link to the complete file: yahoo.py