Sign up for my email newsletter

Get new updates, usually once a week – it features long-form essays on what’s going on here in Silicon Valley.

I’ve written 550+ essays which have been featured and quoted in The New York Times, Fortune, Wired, and WSJ. The topics range from mobile product design to fundraising to “growth hacking.”

Thanks for reading. -Andrew

Close

@andrewchen

Subscribe · Featured · Recent essays

The first 6 steps to homegrowing basic startup analytics

Quick intro to getting set up on analytics
I’ve been asked a few times recently, “Wow, these analytics you write about are great, but how does a startup begin to bite off the relevant parts?” This blog is to address these questions.

First, let me recommend reading a previous blog, called omg I’m just a startup, I can’t do those fancy metrics. In it, I cover some more general philosophical ideas about how to approach what to measure and what not to measure. Might be worth taking a look if it’s not too important.

Now let’s move on to the first couple topics:

Step 0: Pre-product
Initally, the product development process should likely be focused on big-picture qualitative information, like whether or not your business is addressing the right audience as well as the preferences for that audience. So don’t measure anything yet :)

Instead, spend your time gathering qualitative data, interviewing users, understanding the problem-behind-the-problem you’re trying to solve, and prototyping concepts.

Do this for a couple weeks!

Step 1: Prototypes
As you create prototypes of your product, you should throw up some free, simple analytics to get you some rough ideas of what’s happening inside the functionality. This likely means something like Google Analytics, although there is a very large universe of equivalent tools out there as well.

Google analytics can’t really tell you much – it’s not very actionable. The main things I like to look at are new versus return visitors, top content pages, what pages are causing bounces, etc. Again, at this stage you are still primarily driven by qualitative research and ideas, and it’s hard for analytics to drive much of your thinking.

This prototype phase might last a month or a couple months

Step 2: Traffic comes in, so data must be collected
As your product begins to mature, and you get a better sense for what you are trying to do with it, the next thing I might do is to figure out what the important pieces of data are, and confirm that it’s being measured. Nothing is worse than throwing data away that you might want to use later.

Generally, I prefer a single table or log that can be queried later that stores events. The right granularity of events is at the “business” event level, like “someone updated their profile” or “someone downloaded a video” rather than at the URL level. This ensures that you are getting a good amount of information from the logs but it’s not so overwhelming that you’re blowing up your database.

You might, for example, hold events in the rough key/value form:

user_id, event_name, value, datetime

Where it might look something like:

1000, profile.photo.update, 1, 9:30AM 3/14/2008

Make sense?

I prefer to start out via SQL so that the manipulations of the data are easy, although many large-scale systems eventually move to flat-files of some format.

Design-wise, here are some things to consider:

  • What’s your “event” hierarchy and what level of granularity do you want?
  • Do you want your analytics DB to be the same as your webapp DB?
  • How should you join data between your webapp stats and your analytics stats?
  • Where does it make sense to throw data away versus trying to store it forever?
  • How do you pass data into the analytics DB? Via a JS interface called by the client (like Google Analytics) or server-side within your methods?

There’s really no wrong answers to the above – I’ve seen it done in many ways.

Step 3: Identifying your user flows
Every web product ultimately has a bunch of user flows contained within it. For example, there might be a series of flows in how users come into the site, starting with ads, SEO, or otherwise. Similarly, once they get on the site, you might be trying to optimize their usage of their site.Identifying these flows is key since you are trying to find the”critical path” that is then optimized. Figure these flows out, and make sure you’re collecting the right data to optimize.
A good place to learn about these user flows is to read about ecommerce “funnels” and how folks go about breaking those down and optimizing them.

Step 4: Trying ad hoc queries
As users are coming into the system, it can then become a good idea to start gathering data into a standard format. This means creating a small set of queries that you might try to run to learn more about the critical paths that users are taking, and where you can adjust their flow. At this point, it’s important to have the vision of the product become fairly stable so that you are starting to optimize the edges rather than reinventing the core constantly.

The kinds of ad hoc queries worth doing revolve around whatever are the tactical goals of your business. If you are trying to come up with a monetization strategy, you should try to figure out your average order size and what percentage of users that start a buying process finish it. Once you create a small list of these queries, then you can start to formalize the ideas into specific metrics that you track daily.

If any ad hoc queries return data that is similar to what you could get out of Google Analytics (for example, aggregate numbers like pageviews and uniques), it’s probably a dumb idea to try to do those in-house. Don’t do more work than you have to! Instead, the only homegrown stuff should be so specific to your business that it’s easier to do in-house than to shoehorn it into a 3rd party analytics stuff. Don’t waste your effort on numbers a off-the-shelf analytics pacakge would get you.

Assuming that your product is stable, most startups will want to tackle this within the first few weeks (but obviously not until you have data)

Step 5: Formal in-house reporting
Once the product features (and thus the user flows) are sufficiently mature to invest in this area, then it makes sense to formalize out the reports. Typically I would start out with a series of pretty plain HTML pages using tables that just print out SQL queries. You can add finishing touches like percentage %s, key ratios, etc. as you go. I generally invest zero time into cute visualizations and graphs, and prefer to read the key numbers.

How many reports should you generate? I find that it’s pretty addictive to build reports and get a clear understanding of what’s actually happening in your product. So create enough that you can make key decisions, but don’t go too far either – you’ll hit diminishing returns quickly. Generally, 2-3 reports are good enough to start, but ultimately you’ll probably track dozens of dashboards each focusing on specific aspects of your business like.

  • System performance and uptime
  • User acquisition via each method you use
  • Aggregate metrics
  • Retention
  • Engagement
  • Content creation?
  • Ads and monetization?
  • Pricing and revenue?
  • etc.

Anyway, get enough data but not too much – it’s a fine balance. For timing, it probably only makes sense to do this once the product is quite stable and the key user flows are stable as well. This is likely at least a month or two out from the prototype stage.

Step 6: Too much data! Reports are too slow!
If you’re lucky, eventually your reports will be too slow. At Revenue Science, we were gathering somewhere like 1 billion pixel hits per day, and that had to be translated into reporting. Ouch. So you likely will go through a couple specific steps:

  • Reports will initially query the production server – eventually this doesn’t work and slows down the site
  • Reports and data are then moved off to a slave machine, where the queries still happen in real-time – but eventually this doesn’t work either because it’s too slow and there’s too much data
  • Reports and data are then pre-processed every hour, and then served up – which is fine, until your queries take too long, and you have go keep moving
  • Data is then replicated across a number of slave machines, where the pre-processing happens
  • etc.

There are many many layers of incremental improvements you can make here – but the toughest nut to crack, in the case where your web product is HUGE is that you will be inserting more data into the system than the system can process within a reasonable time.

Then the more exotic technologies like Hadoop, HBase, Hypertable, etc start to make a difference. Most sites don’t have to deal with this so I’ll stop here!

Conclusion
Eventually, most serious analytics-driven businesses have to build their own internal analytics. It’s not pretty, but it has to be done. Hopefully the above article gives some background on the key issues you might want to look at as you scale up your product.

If you liked this blog post, please recommend it to a colleague and/or click here to get updates via email or RSS.

Like this post?
Get new updates via newsletter..

  • dmourati

    Bangin! Checked out Splunk?

    http://www.splunk.com

  • http://andrewchen.typepad.com Andrew Chen

    yeah, splunk is really rad…

  • http://andrewchen.typepad.com Andrew Chen

    splunk is neat since it's somewhere in-between writing your own log parsing and buying off-the-shelf reporting. Comes with a nice query language, etc.

  • http://clouin.com Pierre Henri Clouin

    Very interesting roadmap to building a robust web analytics capability.

    Regarding step 6 about data overload, have you looked into CEP (complex event processing) to filter out the noise and process relevant data in real time? It is also useful not only for reporting but also to take advantage of a real time feedback loop.

  • spanky

    On top of that, generally the first thing I do from an acquisition standpoint is to set up a landing page system – where you can dynamically create new landing pages that write to a field in the database called acquired_by. For all traffic landing on that page, you can easily calculate CPA on each campaign by querying the database against each unique acquired_by. Then over time you can add reports that will give you the CPA, activation rate, viral rates, retention rates, etc. against each acquired_by source.

  • http://www.twilio.com Jeff Lawson

    Good ideas Andrew. WRT your mysql table, I've found that storing raw data (for later analysis) can potentially become an application bottleneck, as that table will grow large and probably be abused in ad-hoc reporting. That's why we're using Amazon's SimpleDB to throw lots of data at. We can pull it back down later into optimized table formats, or write reports direcly from their API, but we'll never worry about the growth and scale of that database, and we'll be free to over-log, instead of losing data.

  • http://andrewchen.typepad.com Andrew Chen

    yeah, I think it just depends on what phase of the business you're at. Early on, I think that you want to optimize for easy investigation of ideas, which is why I prefer SQL. If the tables get large, you can always just rename the table and start writing into a new one (believe it or not this works fine and we did it with multi-million row tables). Later on, I agree that some of the more exotic technologies can work – but you want to make sure you're not prematurely optimizing.

  • http://andrewchen.typepad.com Andrew Chen

    Yep, you definitely want to capture both the source as well as potentially referrer URL – great idea.

  • http://500hats.typepad.com dave mcclure

    wow, andrew…

    in the infamous words of Wayne & Garth:

    “WE'RE NOT WORTHY! WE'RE NOT WORTHY!”

  • http://andrewchen.typepad.com Andrew Chen

    My inclination on this stuff, especially in the early product stages, is to keep things as simple as possible and make sure you're spending the bulk of your attention on product with analytics as a important 2nd priority. So I tend to architect initial analytics systems very simplistically.

  • http://20bits.com Jesse Farmer

    I generally invest zero time into cute visualizations and graphs, and prefer to read the key numbers.

    I don't know about “cute,” but I think this is a mistake. Good visualizations are hard to make, but they remove unwanted detail and clearly reveal the nature of the underlying data in a way that's easy to understand.

    Doing it for every little report is wasted effort, but not for key metrics — especially metrics you might want to show to other people some day (say, investors).

  • Ed Baker

    awesome post, andrew!!!

  • http://www.nosnivelling.com daveschappell

    Andrew — this was fantastic — I got it from a Dave McClure tweet, and it's one of the rare posts that I've actually read top-to-bottom in weeks! Your flow was precise, and has mirrored our progress almost perfectly — we're trapped right now in a little bit of step 4-5 madness.

    Thanks,

    Dave (Founder and CEO – TeachStreet)

  • http://www.leadsexplorer.com/blog Engago team

    You're quite right: one needs to measure in order to know.
    Still metrics about traffic are not leads. One can have traffic that don't convert to leads.
    It is leads you need to make sales in B2B.
    Get the company names of your website visitors. These “warm” companies you can call.

  • Pete Mauro

    When I suggested this post I had not idea you were gonna blow it out like this. Awesome work Andrew!

    Regarding #3 User flows: I would love some more detail about collecting and reporting data as it pertains to user flows. GA does a decent job with funnels but I have found it limiting. How do you approach collection and reporting and what tools do you use?

  • http://andrewchen.typepad.com Andrew Chen

    all in house, unfortunately… I agree that GA sucks ;-)

  • http://doctype.cx andrew korf

    Thanks for excellent clear concise post. Really helpful and valuable for those of working on starting something.

  • http://andrewchen.typepad.com Andrew Chen

    spam

  • http://hyperbio.net Leila Boujnane

    Bloody awesome blog post Andrew. I am sitting between step 2 and 3 for one of our products so this was amazingly timely. Thanks!

  • gargouri2001

    Nice write up and blog , Thanks for sharing all those good info

    best regards
    John
    http://xtupload.com

  • http://www.femgineer.com Poornima

    I just finished implementing an internal tool to help with analytics, and its hella slow, and yes I'm querying slaves, but its the production db. I'm going to think about your suggestion to move it into another data driven db for analytics. Thanks!

  • http://www.femgineer.com Poornima

    I just finished implementing an internal tool to help with analytics, and its hella slow, and yes I'm querying slaves, but its the production db. I'm going to think about your suggestion to move it into another data driven db for analytics. Thanks!

Want more? Featured essays and book recommendations