Fixed the publishr information (to enable leaderboard) to show the number
Fixed the publishr information (to enable leaderboard) to show the number
of children publishers and the total for all sub-publishers

file:a/.gitignore -> file:b/.gitignore
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,6 @@
 # Packages
@@ -13,6 +15,10 @@
+# Private info
 # Installer logs

file:a/ (deleted)
--- a/
+++ /dev/null
@@ -1,4 +1,1 @@
-For creating detailed reports of CKAN analytics, sliced by group

file:b/README.rst (new)
--- /dev/null
+++ b/README.rst
@@ -1,1 +1,102 @@
+**Status:** Development
+**CKAN Version:** 1.7.1+
+For creating detailed reports of CKAN analytics, including totals per group.
+Whereas ckanext-googleanalytics focusses on providing page view stats a recent period and for all time (aimed at end users), ckanext-ga-report is more interested in building regular periodic reports (more for site managers to monitor).
+Contents of this extension:
+ * Use the CLI tool to download Google Analytics data for each time period into this extension's database tables
+ * Users can view the data as web page reports
+1. Activate you CKAN python environment and install this extension's software::
+    $ pyenv/bin/activate
+    $ pip install -e  git+
+2. Ensure you development.ini (or similar) contains the info about your Google Analytics account and configuration::
+ = UA-1010101-1
+      googleanalytics.account = Account name (i.e., see top level item at
+      ga-report.period = monthly
+   Note that your credentials will be readable by system administrators on your server. Rather than use sensitive account details, it is suggested you give access to the GA account to a new Google account that you create just for this purpose.
+3. Set up this extension's database tables using a paster command. (Ensure your CKAN pyenv is still activated, run the command from ``src/ckanext-ga-report``, alter the ``--config`` option to point to your site config file)::
+    $ paster initdb --config=../ckan/development.ini
+4. Enable the extension in your CKAN config file by adding it to ``ckan.plugins``::
+    ckan.plugins = ga-report
+Before you can access the data, you need to set up the OAUTH details which you can do by following the `instructions <>`_ the outcome of which will be a file called credentials.json which should look like credentials.json.template with the relevant fields completed. These steps are below for convenience:
+1. Visit the `Google APIs Console <>`_
+2. Sign-in and create a project or use an existing project.
+3. In the `Services pane <>`_ , activate Analytics API for your project. If prompted, read and accept the terms of service.
+4. Go to the `API Access pane <>`_
+5. Click Create an OAuth 2.0 client ID....
+6. Fill out the Branding Information fields and click Next.
+7. In Client ID Settings, set Application type to Installed application.
+8. Click Create client ID
+9. The details you need below are Client ID, Client secret, and  Redirect URIs
+Once you have set up your credentials.json file you can generate an oauth token file by using the
+following command, which will store your oauth token in a file called token.dat once you have finished
+giving permission in the browser::
+    $ paster getauthtoken --config=../ckan/development.ini
+Download some GA data and store it in CKAN's db. (Ensure your CKAN pyenv is still activated, run the command from ``src/ckanext-ga-report``, alter the ``--config`` option to point to your site config file) and specifying the name of your auth file (token.dat by default) from the previous step::
+    $ paster loadanalytics token.dat latest --config=../ckan/development.ini
+The value after the token file is how much data you want to retrieve, this can be
+* **all**         - data for all time (since 2010)
+* **latest**      - (default) just the 'latest' data
+* **YYYY-MM-DD**  - just data for all time periods going back to (and including) this date
+Software Licence
+This software is developed by Cabinet Office. It is Crown Copyright and opened up under the Open Government Licence (OGL) (which is compatible with Creative Commons Attibution License).
+OGL terms:

--- /dev/null
+++ b/ckanext/
@@ -1,1 +1,8 @@
+# this is a namespace package
+    import pkg_resources
+    pkg_resources.declare_namespace(__name__)
+except ImportError:
+    import pkgutil
+    __path__ = pkgutil.extend_path(__path__, __name__)

--- /dev/null
+++ b/ckanext/ga_report/
@@ -1,1 +1,8 @@
+# this is a namespace package
+    import pkg_resources
+    pkg_resources.declare_namespace(__name__)
+except ImportError:
+    import pkgutil
+    __path__ = pkgutil.extend_path(__path__, __name__)

--- /dev/null
+++ b/ckanext/ga_report/
@@ -1,1 +1,101 @@
+import logging
+import datetime
+from ckan.lib.cli import CkanCommand
+# No other CKAN imports allowed until _load_config is run,
+# or logging is disabled
+class InitDB(CkanCommand):
+    """Initialise the extension's database tables
+    """
+    summary = __doc__.split('\n')[0]
+    usage = __doc__
+    max_args = 0
+    min_args = 0
+    def command(self):
+        self._load_config()
+        import ckan.model as model
+        model.Session.remove()
+        model.Session.configure(bind=model.meta.engine)
+        log = logging.getLogger('')
+        import ga_model
+        ga_model.init_tables()
+"DB tables are setup")
+class GetAuthToken(CkanCommand):
+    """ Get's the Google auth token
+    Usage: paster getauthtoken <credentials_file>
+    Where <credentials_file> is the file name containing the details
+    for the service (obtained from
+    By default this is set to credentials.json
+    """
+    summary = __doc__.split('\n')[0]
+    usage = __doc__
+    max_args = 0
+    min_args = 0
+    def command(self):
+        """
+        In this case we don't want a valid service, but rather just to
+        force the user through the auth flow. We allow this to complete to
+        act as a form of verification instead of just getting the token and
+        assuming it is correct.
+        """
+        from ga_auth import init_service
+        init_service('token.dat',
+                      self.args[0] if self.args
+                                   else 'credentials.json')
+class LoadAnalytics(CkanCommand):
+    """Get data from Google Analytics API and save it
+    in the ga_model
+    Usage: paster loadanalytics <tokenfile> <time-period>
+    Where <tokenfile> is the name of the auth token file from
+    the getauthtoken step.
+    And where <time-period> is:
+        all         - data for all time
+        latest      - (default) just the 'latest' data
+        YYYY-MM-DD  - just data for all time periods going
+                      back to (and including) this date
+    """
+    summary = __doc__.split('\n')[0]
+    usage = __doc__
+    max_args = 2
+    min_args = 1
+    def command(self):
+        self._load_config()
+        from download_analytics import DownloadAnalytics
+        from ga_auth import (init_service, get_profile_id)
+        try:
+            svc = init_service(self.args[0], None)
+        except TypeError:
+            print ('Have you correctly run the getauthtoken task and '
+                   'specified the correct file here')
+            return
+        downloader = DownloadAnalytics(svc, profile_id=get_profile_id(svc))
+        time_period = self.args[1] if self.args and len(self.args) > 1 \
+            else 'latest'
+        if time_period == 'all':
+            downloader.all_()
+        elif time_period == 'latest':
+            downloader.latest()
+        else:
+            since_date = datetime.datetime.strptime(time_period, '%Y-%m-%d')
+            downloader.since_date(since_date)

--- /dev/null
+++ b/ckanext/ga_report/
@@ -1,1 +1,10 @@
+import logging
+from ckan.lib.base import BaseController, c, render
+import report_model
+log = logging.getLogger('')
+class GaReport(BaseController):
+    def index(self):
+        return render('index.html')

--- /dev/null
+++ b/ckanext/ga_report/
@@ -1,1 +1,299 @@
+import os
+import logging
+import datetime
+from pylons import config
+import ga_model
+#from ga_client import GA
+log = logging.getLogger('')
+class DownloadAnalytics(object):
+    '''Downloads and stores analytics info'''
+    def __init__(self, service=None, profile_id=None):
+        self.period = config['ga-report.period']
+        self.service = service
+        self.profile_id = profile_id
+    def all_(self):
+        self.since_date(datetime.datetime(2010, 1, 1))
+    def latest(self):
+        if self.period == 'monthly':
+            # from first of this month to today
+            now =
+            first_of_this_month = datetime.datetime(now.year, now.month, 1)
+            periods = ((now.strftime(FORMAT_MONTH),
+              ,
+                        first_of_this_month, now),)
+        else:
+            raise NotImplementedError
+        self.download_and_store(periods)
+    def since_date(self, since_date):
+        assert isinstance(since_date, datetime.datetime)
+        periods = [] # (period_name, period_complete_day, start_date, end_date)
+        if self.period == 'monthly':
+            first_of_the_months_until_now = []
+            year = since_date.year
+            month = since_date.month
+            now =
+            first_of_this_month = datetime.datetime(now.year, now.month, 1)
+            while True:
+                first_of_the_month = datetime.datetime(year, month, 1)
+                if first_of_the_month == first_of_this_month:
+                    periods.append((now.strftime(FORMAT_MONTH),
+                          ,
+                                    first_of_this_month, now))
+                    break
+                elif first_of_the_month < first_of_this_month:
+                    in_the_next_month = first_of_the_month + datetime.timedelta(40)
+                    last_of_the_month = datetime.datetime(in_the_next_month.year,
+                                                           in_the_next_month.month, 1)\
+                                                           - datetime.timedelta(1)
+                    periods.append((now.strftime(FORMAT_MONTH), 0,
+                                    first_of_the_month, last_of_the_month))
+                else:
+                    # first_of_the_month has got to the future somehow
+                    break
+                month += 1
+                if month > 12:
+                    year += 1
+                    month = 1
+        else:
+            raise NotImplementedError
+        self.download_and_store(periods)
+    @staticmethod
+    def get_full_period_name(period_name, period_complete_day):
+        if period_complete_day:
+            return period_name + ' (up to %ith)' % period_complete_day
+        else:
+            return period_name
+    def download_and_store(self, periods):
+        for period_name, period_complete_day, start_date, end_date in periods:
+  'Downloading Analytics for period "%s" (%s - %s)',
+                     self.get_full_period_name(period_name, period_complete_day),
+                     start_date.strftime('%Y %m %d'),
+                     end_date.strftime('%Y %m %d'))
+            """
+            data =, end_date, '~/dataset/[a-z0-9-_]+')
+  'Storing Dataset Analytics for period "%s"',
+                     self.get_full_period_name(period_name, period_complete_day))
+  , period_complete_day, data, )
+            data =, end_date, '~/publisher/[a-z0-9-_]+')
+  'Storing Publisher Analytics for period "%s"',
+                     self.get_full_period_name(period_name, period_complete_day))
+  , period_complete_day, data,)
+            """
+            ga_model.update_publisher_stats(period_name) # about 30 seconds.
+            self.sitewide_stats( period_name )
+    def download(self, start_date, end_date, path='~/dataset/[a-z0-9-_]+'):
+        '''Get data from GA for a given time period'''
+        start_date = start_date.strftime('%Y-%m-%d')
+        end_date = end_date.strftime('%Y-%m-%d')
+        query = 'ga:pagePath=%s$' % path
+        metrics = 'ga:uniquePageviews, ga:visitors'
+        sort = '-ga:uniquePageviews'
+        # Supported query params at
+        #
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 filters=query,
+                                 start_date=start_date,
+                                 metrics=metrics,
+                                 sort=sort,
+                                 dimensions="ga:pagePath",
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        if os.getenv('DEBUG'):
+            import pprint
+            pprint.pprint(results)
+            print 'Total results: %s' % results.get('totalResults')
+        packages = []
+        for entry in results.get('rows'):
+            (loc,pageviews,visits) = entry
+            packages.append( ('http:/' + loc, pageviews, visits,) ) # Temporary hack
+        return dict(url=packages)
+    def store(self, period_name, period_complete_day, data):
+        if 'url' in data:
+            ga_model.update_url_stats(period_name, period_complete_day, data['url'])
+    def sitewide_stats(self, period_name):
+        import calendar
+        year, month = period_name.split('-')
+        _, last_day_of_month = calendar.monthrange(int(year), int(month))
+        start_date = '%s-01' % period_name
+        end_date = '%s-%s' % (period_name, last_day_of_month)
+        print 'Sitewide_stats for %s (%s -> %s)' % (period_name, start_date, end_date)
+        funcs = ['_totals_stats', '_social_stats', '_os_stats',
+                 '_locale_stats', '_browser_stats', '_mobile_stats']
+        for f in funcs:
+            print ' + Fetching %s stats' % f.split('_')[1]
+            getattr(self, f)(start_date, end_date, period_name)
+    def _get_results(result_data, f):
+        data = {}
+        for result in result_data:
+            key = f(result)
+            data[key] = data.get(key,0) + result[1]
+        return data
+    def _totals_stats(self, start_date, end_date, period_name):
+        """ Fetches distinct totals, total pageviews etc """
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:uniquePageviews',
+                                 sort='-ga:uniquePageviews',
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        ga_model.update_sitewide_stats(period_name, "Totals", {'Total pageviews': result_data[0][0]})
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:pageviewsPerVisit,ga:bounces,ga:avgTimeOnSite,ga:percentNewVisits',
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        data = {
+            'Pages per visit': result_data[0][0],
+            'Bounces': result_data[0][1],
+            'Average time on site': result_data[0][2],
+            'Percent new visits': result_data[0][3],
+        }
+        ga_model.update_sitewide_stats(period_name, "Totals", data)
+    def _locale_stats(self, start_date, end_date, period_name):
+        """ Fetches stats about language and country """
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:uniquePageviews',
+                                 sort='-ga:uniquePageviews',
+                                 dimensions="ga:language,ga:country",
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        data = {}
+        for result in result_data:
+            data[result[0]] = data.get(result[0], 0) + int(result[2])
+        ga_model.update_sitewide_stats(period_name, "Languages", data)
+        data = {}
+        for result in result_data:
+            data[result[1]] = data.get(result[1], 0) + int(result[2])
+        ga_model.update_sitewide_stats(period_name, "Country", data)
+    def _social_stats(self, start_date, end_date, period_name):
+        """ Finds out which social sites people are referred from """
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:uniquePageviews',
+                                 sort='-ga:uniquePageviews',
+                                 dimensions="ga:socialNetwork,ga:referralPath",
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        twitter_links = []
+        data = {}
+        for result in result_data:
+            if not result[0] == '(not set)':
+                data[result[0]] = data.get(result[0], 0) + int(result[2])
+                if result[0] == 'Twitter':
+                    twitter_links.append(result[1])
+        ga_model.update_sitewide_stats(period_name, "Social sources", data)
+    def _os_stats(self, start_date, end_date, period_name):
+        """ Operating system stats """
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:uniquePageviews',
+                                 sort='-ga:uniquePageviews',
+                                 dimensions="ga:operatingSystem,ga:operatingSystemVersion",
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        data = {}
+        for result in result_data:
+            data[result[0]] = data.get(result[0], 0) + int(result[2])
+        ga_model.update_sitewide_stats(period_name, "Operating Systems", data)
+        data = {}
+        for result in result_data:
+            key = "%s (%s)" % (result[0],result[1])
+            data[key] = result[2]
+        ga_model.update_sitewide_stats(period_name, "Operating Systems versions", data)
+    def _browser_stats(self, start_date, end_date, period_name):
+        """ Information about browsers and browser versions """
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:uniquePageviews',
+                                 sort='-ga:uniquePageviews',
+                                 dimensions="ga:browser,ga:browserVersion",
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        data = {}
+        for result in result_data:
+            data[result[0]] = data.get(result[0], 0) + int(result[2])
+        ga_model.update_sitewide_stats(period_name, "Browsers", data)
+        data = {}
+        for result in result_data:
+            key = "%s (%s)" % (result[0], result[1])
+            data[key] = result[2]
+        ga_model.update_sitewide_stats(period_name, "Browser versions", data)
+    def _mobile_stats(self, start_date, end_date, period_name):
+        """ Info about mobile devices """
+        results =
+                                 ids='ga:' + self.profile_id,
+                                 start_date=start_date,
+                                 metrics='ga:uniquePageviews',
+                                 sort='-ga:uniquePageviews',
+                                 dimensions="ga:mobileDeviceBranding, ga:mobileDeviceInfo",
+                                 max_results=10000,
+                                 end_date=end_date).execute()
+        result_data = results.get('rows')
+        data = {}
+        for result in result_data:
+            data[result[0]] = data.get(result[0], 0) + int(result[2])
+        ga_model.update_sitewide_stats(period_name, "Mobile brands", data)
+        data = {}
+        for result in result_data:
+            data[result[1]] = data.get(result[1], 0) + int(result[2])
+        ga_model.update_sitewide_stats(period_name, "Mobile devices", data)

--- /dev/null
+++ b/ckanext/ga_report/
@@ -1,1 +1,70 @@
+import os
+import httplib2
+from apiclient.discovery import build
+from oauth2client.client import flow_from_clientsecrets
+from oauth2client.file import Storage
+from import run
+from pylons import config
+def _prepare_credentials(token_filename, credentials_filename):
+    """
+    Either returns the user's oauth credentials or uses the credentials
+    file to generate a token (by forcing the user to login in the browser)
+    """
+    storage = Storage(token_filename)
+    credentials = storage.get()
+    if credentials is None or credentials.invalid:
+        flow = flow_from_clientsecrets(credentials_filename,
+                scope='',
+                message="Can't find the credentials file")
+        credentials = run(flow, storage)
+    return credentials
+def init_service(token_file, credentials_file):
+    """
+    Given a file containing the user's oauth token (and another with
+    credentials in case we need to generate the token) will return a
+    service object representing the analytics API.
+    """
+    http = httplib2.Http()
+    credentials = _prepare_credentials(token_file, credentials_file)
+    http = credentials.authorize(http)  # authorize the http object
+    return build('analytics', 'v3', http=http)
+def get_profile_id(service):
+    """
+    Get the profile ID for this user and the service specified by the
+    '' configuration option. This function iterates
+    over all of the accounts available to the user who invoked the
+    service to find one where the account name matches (in case the
+    user has several).
+    """
+    accounts =
+    if not accounts.get('items'):
+        return None
+    accountName = config.get('googleanalytics.account')
+    webPropertyId = config.get('')
+    for acc in accounts.get('items'):
+        if acc.get('name') == accountName:
+            accountId = acc.get('id')
+    webproperties =
+    profiles =
+        accountId=accountId, webPropertyId=webPropertyId).execute()
+    if profiles.get('items'):
+        return profiles.get('items')[0].get('id')
+    return None

--- /dev/null
+++ b/ckanext/ga_report/
@@ -1,1 +1,250 @@
+import re
+import uuid
+from sqlalchemy import Table, Column, MetaData
+from sqlalchemy import types
+from sqlalchemy.sql import select
+from sqlalchemy.orm import mapper
+from sqlalchemy import func
+import ckan.model as model
+from ckan.lib.base import *
+def make_uuid():
+    return unicode(uuid.uuid4())
+class GA_Url(object):
+    def __init__(self, **kwargs):
+        for k,v in kwargs.items():
+            setattr(self, k, v)
+class GA_Stat(object):
+    def __init__(self, **kwargs):
+        for k,v in kwargs.items():
+            setattr(self, k, v)
+class GA_Publisher(object):
+    def __init__(self, **kwargs):
+        for k,v in kwargs.items():
+            setattr(self, k, v)
+metadata = MetaData()
+url_table = Table('ga_url', metadata,
+                      Column('id', types.UnicodeText, primary_key=True,
+                             default=make_uuid),
+                      Column('period_name', types.UnicodeText),
+                      Column('period_complete_day', types.Integer),
+                      Column('pageviews', types.UnicodeText),
+                      Column('visitors', types.UnicodeText),
+                      Column('url', types.UnicodeText),
+                      Column('department_id', types.UnicodeText),
+                )
+mapper(GA_Url, url_table)
+stat_table = Table('ga_stat', metadata,
+                  Column('id', types.UnicodeText, primary_key=True,
+                         default=make_uuid),
+                  Column('period_name', types.UnicodeText),
+                  Column('stat_name', types.UnicodeText),
+                  Column('key', types.UnicodeText),
+                  Column('value', types.UnicodeText), )
+mapper(GA_Stat, stat_table)
+pub_table = Table('ga_publisher', metadata,
+                  Column('id', types.UnicodeText, primary_key=True,
+                         default=make_uuid),
+                  Column('period_name', types.UnicodeText),
+                  Column('publisher_name', types.UnicodeText),
+                  Column('views', types.UnicodeText),
+                  Column('visitors', types.UnicodeText),
+                  Column('toplevel', types.Boolean, default=False),
+                  Column('subpublishercount', types.Integer, default=0),
+                  Column('parent', types.UnicodeText),
+mapper(GA_Publisher, pub_table)
+def init_tables():
+    metadata.create_all(model.meta.engine)
+cached_tables = {}
+def get_table(name):
+    if name not in cached_tables:
+        meta = MetaData()
+        meta.reflect(bind=model.meta.engine)
+        table = meta.tables[name]
+        cached_tables[name] = table
+    return cached_tables[name]
+def _normalize_url(url):
+    '''Strip off the hostname etc. Do this before storing it.
+    >>> normalize_url('')
+    '/dataset/weekly_fuel_prices'
+    '''
+    url = re.sub('https?://(www\.)?', '', url)
+    return url
+def _get_department_id_of_url(url):
+    # e.g. /dataset/fuel_prices
+    # e.g. /dataset/fuel_prices/resource/e63380d4
+    dataset_match = re.match('/dataset/([^/]+)(/.*)?', url)
+    if dataset_match:
+        dataset_ref = dataset_match.groups()[0]
+        dataset = model.Package.get(dataset_ref)
+        if dataset:
+            publisher_groups = dataset.get_groups('publisher')
+            if publisher_groups:
+                return publisher_groups[0].name
+    else:
+        publisher_match = re.match('/publisher/([^/]+)(/.*)?', url)
+        if publisher_match:
+            return publisher_match.groups()[0]
+def update_sitewide_stats(period_name, stat_name, data):
+    for k,v in data.iteritems():
+        item = model.Session.query(GA_Stat).\
+            filter(GA_Stat.period_name==period_name).\
+            filter(GA_Stat.key==k).\
+            filter(GA_Stat.stat_name==stat_name).first()
+        if item:
+            item.period_name = period_name
+            item.key = k
+            item.value = v
+            model.Session.add(item)
+        else:
+            # create the row
+            values = {'id': make_uuid(),
+                     'period_name': period_name,
+                     'key': k,
+                     'value': v,
+                     'stat_name': stat_name
+                     }
+            model.Session.add(GA_Stat(**values))