<?php | <?php |
include_once('include/common.inc.php'); | include_once('include/common.inc.php'); |
include_header('About'); | include_header('About'); |
?> | ?> |
<div class="foundation-header"> | <div class="foundation-header"> |
<h1><a href="about.php">About/FAQ</a></h1> | <h1><a href="about.php">About/FAQ</a></h1> |
<h4 class="subheader">Lorem ipsum.</h4> | |
</div> | </div> |
<h2> What is this? </h2> | |
Disclo.gs is a project to monitor Australian Federal Government agencies | |
compliance with their <a href="http://www.oaic.gov.au/publications/other_operational/foi_policy_frequently_asked_questions.html#_Toc291837571">"proactive disclosure requirements" to make a transparency league table as suggested by gov2 taskforce http://gov2.net.au/blog/2009/09/19/a-league-ladder-of-psi-openness/</a>. | |
<h2> Attributions </h2> | <h2> Attributions </h2> |
National Archives of Australia, Australian Governments’ Interactive Functions Thesaurus, 2nd edition, September 2005, published at http://www.naa.gov.au/recordkeeping/thesaurus/index.htm <br/> | National Archives of Australia, Australian Governments’ Interactive Functions Thesaurus, 2nd edition, September 2005, published at http://www.naa.gov.au/recordkeeping/thesaurus/index.htm <br/> |
data.gov.au http://data.gov.au/dataset/directory-gov-au-full-data-export/ <br/> | data.gov.au http://data.gov.au/dataset/directory-gov-au-full-data-export/ <br/> |
directory.gov.au <br/> | directory.gov.au <br/> |
australia.gov.au http://australia.gov.au/about/copyright <br/> | australia.gov.au http://australia.gov.au/about/copyright <br/> |
<h2> Open everything </h2> | |
All documents released CC-BY 3 AU | |
Open source git @ | |
<h2>Organisational Data Sources</h2> | <h2>Organisational Data Sources</h2> |
http://www.comlaw.gov.au/Browse/Results/ByTitle/AdministrativeArrangementsOrders/Current/Ad/0 defines departments | http://www.comlaw.gov.au/Browse/Results/ByTitle/AdministrativeArrangementsOrders/Current/Ad/0 defines departments |
Agencies can be found in the Schedule to an Appropriation Bill (budget), Schedule to FMA Regulations and/or Public Service Act.<br> | Agencies can be found in the Schedule to an Appropriation Bill (budget), Schedule to FMA Regulations and/or Public Service Act.<br> |
http://www.finance.gov.au/publications/flipchart/docs/FMACACFlipchart.pdf summarises these. view-source:https://www.tenders.gov.au/?event=public.advancedsearch.home is great for the suspended/active status<br> | http://www.finance.gov.au/publications/flipchart/docs/FMACACFlipchart.pdf summarises these. view-source:https://www.tenders.gov.au/?event=public.advancedsearch.home is great for the suspended/active status<br> |
Fraud in gov depts by Fairfax Media http://www.smh.com.au/national/public-service-keeps-fraud-cases-private-20110923-1kpdr.html | Fraud in gov depts by Fairfax Media http://www.smh.com.au/national/public-service-keeps-fraud-cases-private-20110923-1kpdr.html <br> |
When defining the hierachy, this system is designed towards monitoring accountablity. Thus large agencies that have registered their own ABN | When defining the hierachy, this system is designed towards monitoring accountablity. Thus large agencies that have registered their own ABN |
and have their own accountablity mechanisms/website receive a seperate record as a child of their department. | and have their own accountablity mechanisms/website receive a seperate record as a child of their department.<br> |
Some small agencies will choose to simply rely on their parent department's accountablity measures.<br> | Some small agencies will choose to simply rely on their parent department's accountablity measures.<br> |
This flows through to organisation name and other/past names. A department that completely accounts for an agency will list that agency as an other child name. | This flows through to organisation name and other/past names. A department that completely accounts for an agency will list that agency as an other child name.<br> |
As agencies themselves shift between departments, there may be scope for providing time ranges but typically the newest hierarchy will be the one recorded. | As agencies themselves shift between departments, there may be scope for providing time ranges but typically the newest hierarchy will be the one recorded.<br> |
A department/agency name will be the newest active name assigned to that ABN.<br> | A department/agency name will be the newest active name assigned to that ABN.<br> |
ABN information is derived from the ABR. This is the definitive umpire about which former name should be linked to which current name. | ABN information is derived from the ABR. This is the definitive umpire about which former name should be linked to which current name. <br> |
For example "Department of Transport and Regional Services" became "Department of Infrastructure, Transport, Regional Development and Local Government" (same ABN) | For example "Department of Transport and Regional Services" became "Department of Infrastructure, Transport, Regional Development and Local Government" (same ABN) |
however it later split into "Department of Infrastructure and Transport" (same ABN) | however it later split into "Department of Infrastructure and Transport" (same ABN) |
and "Department of Regional Australia, Regional Development and Local Government" (new ABN).<br> | and "Department of Regional Australia, Regional Development and Local Government" (new ABN).<br> |
Statistical information from http://www.apsc.gov.au/stateoftheservice/1011/statsbulletin/section1.html#t2total https://www.apsedii.gov.au/apsedii/CustomQueryx33.shtml | Statistical information from http://www.apsc.gov.au/stateoftheservice/1011/statsbulletin/section1.html#t2total https://www.apsedii.gov.au/apsedii/CustomQueryx33.shtml |
and individual annual reports.<br> | and individual annual reports.<br> |
<h2>Webpage Assessment</h2> | |
Much due care has been put into correctly recording disclosure URLs. Typically the "About", "Corporate", "Publications" and "Sitemap" sections are checked at the very least. | |
Occasionally it is nessicary to use a site or Google search. In several rare cases, there is a secret "Disclosure" navigation menu you can find if you find one of the mandatory publishing obligations in that category (seriously).<br> | |
Some rules about leniency:<br> | |
<ul> | |
<li>An empty FOI disclosure log counts, a page outlining what the FOI Act is does not.</li> | |
<li>A disclosure log in PDF or Word format counts :(</li> | |
<li>An empty File/Record list counts (although that's very minimalistic that you have no files, electronic or paper)</li> | |
<li>Only a current information publication scheme page counts, not a s.9 FOI Act page or an organisation chart.</li> | |
<li>If there isn't a page easily listing all current and past Annual Reports, the most current one (html, pdf) counts.</li> | |
<li>Consultancy contracts might not need it's own webpage (if in Annual Report), grants/appointments might not apply to all organisations but Legal Services Expenditure (and all other obligations) does need a webpage. </li> | |
<h2>Open Government Scoring</h2> | |
+1 point for every true Has... attribute<br> | |
-1 point for every false Has... (ie. Has Not) attribute</br> | |
Don't like this? Make your own score, suggest a better scoring mechanism.</br> | |
<?php | <?php |
include_footer(); | include_footer(); |
?> | ?> |
<!DOCTYPE html> | |
<html xmlns="http://www.w3.org/1999/xhtml"> | |
<head> | |
<meta charset="UTF-8"/> | |
<title>Minimal BubbleTree Demo</title> | |
<script type="text/javascript" src="http://code.jquery.com/jquery-1.7.2.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/jquery.history.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/raphael.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/vis4.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/Tween.js"></script> | |
<script type="text/javascript" src="js/bubbletree/build/bubbletree.js"></script> | |
<link rel="stylesheet" type="text/css" href="js/bubbletree/build/bubbletree.css" /> | |
<script type="text/javascript" src="js/bubbletree/styles/cofog.js"></script> | |
<script type="text/javascript"> | |
$(function() { | |
<?php | |
include_once('include/common.inc.php'); | |
include("lib/Color.php"); | |
$color = new Lux_Color(); | |
$portfolios = Array(); | |
$total = 0; | |
$db = $server->get_db('disclosr-agencies'); | |
try { | |
$rows = $db->get_view("app", "byDeptStateName", null, true)->rows; | |
foreach ($rows as $row) { | |
$portfolios[trim(str_replace(Array("Department of", "Department", "the", "'", "`"), "", $row->key))] = $row->value; | |
} | |
} catch (SetteeRestClientException $e) { | |
setteErrorHandler($e); | |
} | |
$agencies = Array(); | |
try { | |
$rows = $db->get_view("app", "byCanonicalName", null, true)->rows; | |
//print_r($rows); | |
foreach ($rows as $row) { | |
$employees = 0; | |
$portfolioid = 0; | |
if (isset($row->value->employees)) { | |
$employees = $row->value->employees; | |
} | |
if (isset($row->value->statistics->employees)) { | |
$agencyEmployeesArray = object_to_array($row->value->statistics->employees); | |
if (isset($agencyEmployeesArray["2010-2011"]["value"])) { | |
$employees = $agencyEmployeesArray["2010-2011"]["value"]; | |
} else { | |
// bailout for agencies that are closed for business | |
continue; | |
} | |
} | |
if (!($employees > 0)) { | |
$employees = 0; | |
} | |
if (isset($row->value->parentOrg)) { | |
$portfolioid = $row->value->parentOrg; | |
} | |
if (isset($row->value->orgType) && $row->value->orgType == "FMA-DepartmentOfState") { | |
$portfolioid = $row->id; | |
} | |
$agencies[$portfolioid][$row->value->name] = $employees; | |
} | |
} catch (SetteeRestClientException $e) { | |
setteErrorHandler($e); | |
} | |
//print_r($portfolios); | |
//print_r($agencies); | |
// http://martin.ankerl.com/2009/12/09/how-to-create-random-colors-programmatically/ | |
$golden_ratio_conjugate = 0.618033988749895; | |
$h = 0.00+rand(0,10)/10; # use random start value | |
foreach ($portfolios as $portfolioName => $portfolioID) { | |
$h += $golden_ratio_conjugate; | |
$h = fmod($h,1); | |
$portfolioColor = $color->hsv2hex(Array($h, .3, .99)); | |
$subnodes = Array(); | |
$portfolioEmployees = 0; | |
foreach ($agencies[$portfolioID] as $agencyName => $agencyEmployees) { | |
$agencyColor = $color->hsv2hex(Array($h / 10, rand(1, 10) / 10, abs(($h * (1 / 10)) - .5) + .5)); | |
$subnodes[] = Array( | |
"label" => str_replace(Array("'", "`"), "", $agencyName), | |
"amount" => $agencyEmployees, | |
//"color" => "#" . $agencyColor | |
); | |
$portfolioEmployees += $agencyEmployees; | |
} | |
$nodes[] = Array( | |
"label" => $portfolioName, | |
"amount" => $portfolioEmployees, | |
//"color" => "#" . $portfolioColor, | |
"children" => $subnodes | |
); | |
$total += $portfolioEmployees; | |
} | |
$data = Array( | |
"label" => "Australian Federal Government", | |
"amount" => $total, | |
//"color" => "#000000", | |
"children" => $nodes | |
); | |
echo "var data =eval('('+'" . json_encode($data) . "'+')');"; | |
?> | |
new BubbleTree({ | |
data: data, | |
container: '.bubbletree' | |
}); | |
}); | |
</script> | |
</head> | |
<body> | |
<div class="bubbletree-wrapper"> | |
<div class="bubbletree"></div> | |
</div> | |
</body> | |
</html> | |
<!DOCTYPE html> | |
<html xmlns="http://www.w3.org/1999/xhtml"> | |
<head> | |
<meta charset="UTF-8"/> | |
<title>Minimal BubbleTree Demo</title> | |
<script type="text/javascript" src="http://code.jquery.com/jquery-1.7.2.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/jquery.history.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/raphael.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/vis4.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/Tween.js"></script> | |
<script type="text/javascript" src="js/bubbletree/build/bubbletree.js"></script> | |
<link rel="stylesheet" type="text/css" href="js/bubbletree/build/bubbletree.css" /> | |
<script type="text/javascript" src="js/bubbletree/styles/cofog.js"></script> | |
<script type="text/javascript"> | |
$(function() { | |
<?php | |
include_once('include/common.inc.php'); | |
include("lib/Color.php"); | |
$color = new Lux_Color(); | |
$portfolios = Array(); | |
$total = 0; | |
$db = $server->get_db('disclosr-agencies'); | |
try { | |
$rows = $db->get_view("app", "byDeptStateName", null, true)->rows; | |
foreach ($rows as $row) { | |
$portfolios[trim(str_replace(Array("Department of", "Department", "the", "'", "`"), "", $row->key))] = $row->value; | |
} | |
} catch (SetteeRestClientException $e) { | |
setteErrorHandler($e); | |
} | |
$agencies = Array(); | |
try { | |
$rows = $db->get_view("app", "byCanonicalName", null, true)->rows; | |
//print_r($rows); | |
foreach ($rows as $row) { | |
$employees = 0; | |
$portfolioid = 0; | |
if (isset($row->value->statistics->budget)) { | |
$agencyEmployeesArray = object_to_array($row->value->statistics->budget); | |
//print_r($agencyEmployeesArray); | |
if (isset($agencyEmployeesArray["2011-2012"]["value"])) { | |
$employees = $agencyEmployeesArray["2011-2012"]["value"]; | |
} else { | |
// bailout for agencies that are closed for business | |
continue; | |
} | |
} | |
if (!($employees > 0)) { | |
$employees = 0; | |
} | |
if (isset($row->value->parentOrg)) { | |
$portfolioid = $row->value->parentOrg; | |
} | |
if (isset($row->value->orgType) && $row->value->orgType == "FMA-DepartmentOfState") { | |
$portfolioid = $row->id; | |
} | |
$agencies[$portfolioid][$row->value->name] = $employees; | |
} | |
} catch (SetteeRestClientException $e) { | |
setteErrorHandler($e); | |
} | |
//print_r($portfolios); | |
//print_r($agencies); | |
// http://martin.ankerl.com/2009/12/09/how-to-create-random-colors-programmatically/ | |
$golden_ratio_conjugate = 0.618033988749895; | |
$h = 0.00+rand(0,10)/10; # use random start value | |
foreach ($portfolios as $portfolioName => $portfolioID) { | |
$h += $golden_ratio_conjugate; | |
$h = fmod($h,1); | |
$portfolioColor = $color->hsv2hex(Array($h, .3, .99)); | |
$subnodes = Array(); | |
$portfolioEmployees = 0; | |
foreach ($agencies[$portfolioID] as $agencyName => $agencyEmployees) { | |
$agencyColor = $color->hsv2hex(Array($h / 10, rand(1, 10) / 10, abs(($h * (1 / 10)) - .5) + .5)); | |
$subnodes[] = Array( | |
"label" => str_replace(Array("'", "`"), "", $agencyName), | |
"amount" => $agencyEmployees, | |
//"color" => "#" . $agencyColor | |
); | |
$portfolioEmployees += $agencyEmployees; | |
} | |
$nodes[] = Array( | |
"label" => $portfolioName, | |
"amount" => $portfolioEmployees, | |
//"color" => "#" . $portfolioColor, | |
"children" => $subnodes | |
); | |
$total += $portfolioEmployees; | |
} | |
$data = Array( | |
"label" => "Australian Federal Government", | |
"amount" => $total, | |
//"color" => "#000000", | |
"children" => $nodes | |
); | |
echo "var data =eval('('+'" . json_encode($data) . "'+')');"; | |
?> | |
new BubbleTree({ | |
data: data, | |
container: '.bubbletree' | |
}); | |
}); | |
</script> | |
</head> | |
<body> | |
<div class="bubbletree-wrapper"> | |
<div class="bubbletree"></div> | |
</div> | |
</body> | |
</html> | |
import sys | import sys |
import os | import os |
sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) | sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) |
import scrape | import scrape |
from bs4 import BeautifulSoup | from bs4 import BeautifulSoup |
from time import mktime | from time import mktime |
import feedparser | import feedparser |
import abc | import abc |
import unicodedata | import unicodedata |
import re | import re |
import dateutil | import dateutil |
from dateutil.parser import * | from dateutil.parser import * |
from datetime import * | from datetime import * |
import codecs | import codecs |
import difflib | import difflib |
from StringIO import StringIO | from StringIO import StringIO |
from pdfminer.pdfparser import PDFDocument, PDFParser | from pdfminer.pdfparser import PDFDocument, PDFParser |
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf |
from pdfminer.pdfdevice import PDFDevice, TagExtractor | from pdfminer.pdfdevice import PDFDevice, TagExtractor |
from pdfminer.converter import TextConverter | from pdfminer.converter import TextConverter |
from pdfminer.cmapdb import CMapDB | from pdfminer.cmapdb import CMapDB |
from pdfminer.layout import LAParams | from pdfminer.layout import LAParams |
class GenericDisclogScraper(object): | class GenericDisclogScraper(object): |
__metaclass__ = abc.ABCMeta | __metaclass__ = abc.ABCMeta |
agencyID = None | agencyID = None |
disclogURL = None | disclogURL = None |
def remove_control_chars(self, input): | def remove_control_chars(self, input): |
return "".join([i for i in input if ord(i) in range(32, 127)]) | return "".join([i for i in input if ord(i) in range(32, 127)]) |
def getAgencyID(self): | def getAgencyID(self): |
""" disclosr agency id """ | """ disclosr agency id """ |
if self.agencyID is None: | if self.agencyID is None: |
self.agencyID = os.path.basename(sys.argv[0]).replace(".py", "") | self.agencyID = os.path.basename(sys.argv[0]).replace(".py", "") |
return self.agencyID | return self.agencyID |
def getURL(self): | def getURL(self): |
""" disclog URL""" | """ disclog URL""" |
if self.disclogURL is None: | if self.disclogURL is None: |
agency = scrape.agencydb.get(self.getAgencyID()) | agency = scrape.agencydb.get(self.getAgencyID()) |
self.disclogURL = agency['FOIDocumentsURL'] | self.disclogURL = agency['FOIDocumentsURL'] |
return self.disclogURL | return self.disclogURL |
@abc.abstractmethod | @abc.abstractmethod |
def doScrape(self): | def doScrape(self): |
""" do the scraping """ | """ do the scraping """ |
return | return |
class GenericHTMLDisclogScraper(GenericDisclogScraper): | class GenericHTMLDisclogScraper(GenericDisclogScraper): |
def doScrape(self): | def doScrape(self): |
foidocsdb = scrape.couch['disclosr-foidocuments'] | foidocsdb = scrape.couch['disclosr-foidocuments'] |
(url, mime_type, rcontent) = scrape.fetchURL(scrape.docsdb, | (url, mime_type, rcontent) = scrape.fetchURL(scrape.docsdb, |
self.getURL(), "foidocuments", self.getAgencyID()) | self.getURL(), "foidocuments", self.getAgencyID()) |
content = rcontent | content = rcontent |
dochash = scrape.mkhash(content) | dochash = scrape.mkhash(content) |
doc = foidocsdb.get(dochash) | doc = foidocsdb.get(dochash) |
if doc is None: | if doc is None: |
print "saving " + dochash | print "saving " + dochash |
description = "This log may have updated but as it was not in a table last time we viewed it, we cannot extract what has changed. Please refer to the agency's website Disclosure Log to see the most recent entries" | description = "This log may have updated but as it was not in a table last time we viewed it, we cannot extract what has changed. Please refer to the agency's website Disclosure Log to see the most recent entries" |
last_attach = scrape.getLastAttachment(scrape.docsdb, self.getURL()) | last_attach = scrape.getLastAttachment(scrape.docsdb, self.getURL()) |
if last_attach != None: | if last_attach != None: |
html_diff = difflib.HtmlDiff() | html_diff = difflib.HtmlDiff() |
diff = html_diff.make_table(last_attach.read().split('\n'), | diff = html_diff.make_table(last_attach.read().split('\n'), |
content.split('\n')) | content.split('\n')) |
edate = date.today().strftime("%Y-%m-%d") | edate = date.today().strftime("%Y-%m-%d") |
doc = {'_id': dochash, 'agencyID': self.getAgencyID() | doc = {'_id': dochash, 'agencyID': self.getAgencyID() |
, 'url': self.getURL(), 'docID': dochash, | , 'url': self.getURL(), 'docID': dochash, |
"date": edate, "title": "Disclosure Log Updated", "description": description, "diff": diff} | "date": edate, "title": "Disclosure Log Updated", "description": self.remove_control_chars(description), "diff": diff} |
foidocsdb.save(doc) | foidocsdb.save(doc) |
else: | else: |
print "already saved" | print "already saved" |
class GenericPDFDisclogScraper(GenericDisclogScraper): | class GenericPDFDisclogScraper(GenericDisclogScraper): |
def doScrape(self): | def doScrape(self): |
foidocsdb = scrape.couch['disclosr-foidocuments'] | foidocsdb = scrape.couch['disclosr-foidocuments'] |
(url, mime_type, content) = scrape.fetchURL(scrape.docsdb, | (url, mime_type, content) = scrape.fetchURL(scrape.docsdb, |
self.getURL(), "foidocuments", self.getAgencyID()) | self.getURL(), "foidocuments", self.getAgencyID()) |
laparams = LAParams() | laparams = LAParams() |
rsrcmgr = PDFResourceManager(caching=True) | rsrcmgr = PDFResourceManager(caching=True) |
outfp = StringIO() | outfp = StringIO() |
device = TextConverter(rsrcmgr, outfp, codec='utf-8', | device = TextConverter(rsrcmgr, outfp, codec='utf-8', |
laparams=laparams) | laparams=laparams) |
fp = StringIO() | fp = StringIO() |
fp.write(content) | fp.write(content) |
process_pdf(rsrcmgr, device, fp, set(), caching=True, | process_pdf(rsrcmgr, device, fp, set(), caching=True, |
check_extractable=True) | check_extractable=True) |
description = outfp.getvalue() | description = outfp.getvalue() |
fp.close() | fp.close() |
device.close() | device.close() |
outfp.close() | outfp.close() |
dochash = scrape.mkhash(description) | dochash = scrape.mkhash(description) |
doc = foidocsdb.get(dochash) | doc = foidocsdb.get(dochash) |
if doc is None: | if doc is None: |
print "saving " + dochash | print "saving " + dochash |
edate = date.today().strftime("%Y-%m-%d") | edate = date.today().strftime("%Y-%m-%d") |
doc = {'_id': dochash, 'agencyID': self.getAgencyID() | doc = {'_id': dochash, 'agencyID': self.getAgencyID() |
, 'url': self.getURL(), 'docID': dochash, | , 'url': self.getURL(), 'docID': dochash, |
"date": edate, "title": "Disclosure Log Updated", "description": description} | "date": edate, "title": "Disclosure Log Updated", "description": self.remove_control_chars(description)} |
foidocsdb.save(doc) | foidocsdb.save(doc) |
else: | else: |
print "already saved" | print "already saved" |
class GenericDOCXDisclogScraper(GenericDisclogScraper): | class GenericDOCXDisclogScraper(GenericDisclogScraper): |
def doScrape(self): | def doScrape(self): |
foidocsdb = scrape.couch['disclosr-foidocuments'] | foidocsdb = scrape.couch['disclosr-foidocuments'] |
(url, mime_type, content) = scrape.fetchURL(scrape.docsdb | (url, mime_type, content) = scrape.fetchURL(scrape.docsdb |
, self.getURL(), "foidocuments", self.getAgencyID()) | , self.getURL(), "foidocuments", self.getAgencyID()) |
mydoc = zipfile.ZipFile(file) | mydoc = zipfile.ZipFile(file) |
xmlcontent = mydoc.read('word/document.xml') | xmlcontent = mydoc.read('word/document.xml') |
document = etree.fromstring(xmlcontent) | document = etree.fromstring(xmlcontent) |
## Fetch all the text out of the document we just created | ## Fetch all the text out of the document we just created |
paratextlist = getdocumenttext(document) | paratextlist = getdocumenttext(document) |
# Make explicit unicode version | # Make explicit unicode version |
newparatextlist = [] | newparatextlist = [] |
for paratext in paratextlist: | for paratext in paratextlist: |
newparatextlist.append(paratext.encode("utf-8")) | newparatextlist.append(paratext.encode("utf-8")) |
## Print our documnts test with two newlines under each paragraph | ## Print our documnts test with two newlines under each paragraph |
description = '\n\n'.join(newparatextlist).strip(' \t\n\r') | description = '\n\n'.join(newparatextlist).strip(' \t\n\r') |
dochash = scrape.mkhash(description) | dochash = scrape.mkhash(description) |
doc = foidocsdb.get(dochash) | doc = foidocsdb.get(dochash) |
if doc is None: | if doc is None: |
print "saving " + dochash | print "saving " + dochash |
edate = time().strftime("%Y-%m-%d") | edate = time().strftime("%Y-%m-%d") |
doc = {'_id': dochash, 'agencyID': self.getAgencyID() | doc = {'_id': dochash, 'agencyID': self.getAgencyID() |
, 'url': self.getURL(), 'docID': dochash, | , 'url': self.getURL(), 'docID': dochash, |
"date": edate, "title": "Disclosure Log Updated", "description": description} | "date": edate, "title": "Disclosure Log Updated", "description": description} |
foidocsdb.save(doc) | foidocsdb.save(doc) |
else: | else: |
print "already saved" | print "already saved" |
class GenericRSSDisclogScraper(GenericDisclogScraper): | class GenericRSSDisclogScraper(GenericDisclogScraper): |
def doScrape(self): | def doScrape(self): |
foidocsdb = scrape.couch['disclosr-foidocuments'] | foidocsdb = scrape.couch['disclosr-foidocuments'] |
(url, mime_type, content) = scrape.fetchURL(scrape.docsdb, | (url, mime_type, content) = scrape.fetchURL(scrape.docsdb, |
self.getURL(), "foidocuments", self.getAgencyID()) | self.getURL(), "foidocuments", self.getAgencyID()) |
feed = feedparser.parse(content) | feed = feedparser.parse(content) |
for entry in feed.entries: | for entry in feed.entries: |
#print entry | #print entry |
print entry.id | print entry.id |
dochash = scrape.mkhash(entry.id) | dochash = scrape.mkhash(entry.id) |
doc = foidocsdb.get(dochash) | doc = foidocsdb.get(dochash) |
#print doc | #print doc |
if doc is None: | if doc is None: |
print "saving " + dochash | print "saving " + dochash |
edate = datetime.fromtimestamp( | edate = datetime.fromtimestamp( |
mktime(entry.published_parsed)).strftime("%Y-%m-%d") | mktime(entry.published_parsed)).strftime("%Y-%m-%d") |
doc = {'_id': dochash, 'agencyID': self.getAgencyID(), | doc = {'_id': dochash, 'agencyID': self.getAgencyID(), |
'url': entry.link, 'docID': entry.id, | 'url': entry.link, 'docID': entry.id, |
"date": edate, "title": entry.title} | "date": edate, "title": entry.title} |
self.getDescription(entry, entry, doc) | self.getDescription(entry, entry, doc) |
foidocsdb.save(doc) | foidocsdb.save(doc) |
else: | else: |
print "already saved" | print "already saved" |
def getDescription(self, content, entry, doc): | def getDescription(self, content, entry, doc): |
""" get description from rss entry""" | """ get description from rss entry""" |
doc.update({'description': content.summary}) | doc.update({'description': content.summary}) |
return | return |
class GenericOAICDisclogScraper(GenericDisclogScraper): | class GenericOAICDisclogScraper(GenericDisclogScraper): |
__metaclass__ = abc.ABCMeta | __metaclass__ = abc.ABCMeta |
@abc.abstractmethod | @abc.abstractmethod |
def getColumns(self, columns): | def getColumns(self, columns): |
""" rearranges columns if required """ | """ rearranges columns if required """ |
return | return |
def getColumnCount(self): | def getColumnCount(self): |
return 5 | return 5 |
def getDescription(self, content, entry, doc): | def getDescription(self, content, entry, doc): |
""" get description from rss entry""" | """ get description from rss entry""" |
descriptiontxt = "" | descriptiontxt = "" |
for string in content.stripped_strings: | for string in content.stripped_strings: |
descriptiontxt = descriptiontxt + " \n" + string | descriptiontxt = descriptiontxt + " \n" + string |
doc.update({'description': descriptiontxt}) | doc.update({'description': descriptiontxt}) |
def getTitle(self, content, entry, doc): | def getTitle(self, content, entry, doc): |
doc.update({'title': (''.join(content.stripped_strings))}) | doc.update({'title': (''.join(content.stripped_strings))}) |
def getTable(self, soup): | def getTable(self, soup): |
return soup.table | return soup.table |
def getRows(self, table): | def getRows(self, table): |
return table.find_all('tr') | return table.find_all('tr') |
def getDate(self, content, entry, doc): | def getDate(self, content, entry, doc): |
date = ''.join(content.stripped_strings).strip() | date = ''.join(content.stripped_strings).strip() |
(a, b, c) = date.partition("(") | (a, b, c) = date.partition("(") |
date = self.remove_control_chars(a.replace("Octber", "October")) | date = self.remove_control_chars(a.replace("Octber", "October").replace("1012","2012")) |
print date | print date |
edate = parse(date, dayfirst=True, fuzzy=True).strftime("%Y-%m-%d") | edate = parse(date, dayfirst=True, fuzzy=True).strftime("%Y-%m-%d") |
print edate | print edate |
doc.update({'date': edate}) | doc.update({'date': edate}) |
return | return |
def getLinks(self, content, entry, doc): | def getLinks(self, content, entry, doc): |
links = [] | links = [] |
for atag in entry.find_all("a"): | for atag in entry.find_all("a"): |
if atag.has_key('href'): | if atag.has_key('href'): |
links.append(scrape.fullurl(content, atag['href'])) | links.append(scrape.fullurl(content, atag['href'])) |
if links != []: | if links != []: |
doc.update({'links': links}) | doc.update({'links': links}) |
return | return |
def doScrape(self): | def doScrape(self): |
foidocsdb = scrape.couch['disclosr-foidocuments'] | foidocsdb = scrape.couch['disclosr-foidocuments'] |
(url, mime_type, content) = scrape.fetchURL(scrape.docsdb, | (url, mime_type, content) = scrape.fetchURL(scrape.docsdb, |
self.getURL(), "foidocuments", self.getAgencyID()) | self.getURL(), "foidocuments", self.getAgencyID()) |
if content is not None: | if content is not None: |
if mime_type == "text/html" or mime_type == "application/xhtml+xml" or mime_type == "application/xml": | if mime_type == "text/html" or mime_type == "application/xhtml+xml" or mime_type == "application/xml": |
# http://www.crummy.com/software/BeautifulSoup/documentation.html | # http://www.crummy.com/software/BeautifulSoup/documentation.html |
print "parsing" | print "parsing" |
soup = BeautifulSoup(content) | soup = BeautifulSoup(content) |
table = self.getTable(soup) | table = self.getTable(soup) |
for row in self.getRows(table): | for row in self.getRows(table): |
columns = row.find_all('td') | columns = row.find_all('td') |
if len(columns) is self.getColumnCount(): | if len(columns) is self.getColumnCount(): |
(id, date, title, | (id, date, title, |
description, notes) = self.getColumns(columns) | description, notes) = self.getColumns(columns) |
print self.remove_control_chars( | print self.remove_control_chars( |
''.join(id.stripped_strings)) | ''.join(id.stripped_strings)) |
if id.string is None: | if id.string is None: |
dochash = scrape.mkhash( | dochash = scrape.mkhash( |
self.remove_control_chars( | self.remove_control_chars( |
url + (''.join(date.stripped_strings)))) | url + (''.join(date.stripped_strings)))) |
else: | else: |
dochash = scrape.mkhash( | dochash = scrape.mkhash( |
self.remove_control_chars( | self.remove_control_chars( |
url + (''.join(id.stripped_strings)))) | url + (''.join(id.stripped_strings)))) |
doc = foidocsdb.get(dochash) | doc = foidocsdb.get(dochash) |
if doc is None: | if doc is None: |
print "saving " + dochash | print "saving " + dochash |
doc = {'_id': dochash, | doc = {'_id': dochash, |
'agencyID': self.getAgencyID(), | 'agencyID': self.getAgencyID(), |
'url': self.getURL(), | 'url': self.getURL(), |
'docID': (''.join(id.stripped_strings))} | 'docID': (''.join(id.stripped_strings))} |
self.getLinks(self.getURL(), row, doc) | self.getLinks(self.getURL(), row, doc) |
self.getTitle(title, row, doc) | self.getTitle(title, row, doc) |
self.getDate(date, row, doc) | self.getDate(date, row, doc) |
self.getDescription(description, row, doc) | self.getDescription(description, row, doc) |
if notes is not None: | if notes is not None: |
doc.update({'notes': ( | doc.update({'notes': ( |
''.join(notes.stripped_strings))}) | ''.join(notes.stripped_strings))}) |
badtitles = ['-', 'Summary of FOI Request' | badtitles = ['-', 'Summary of FOI Request' |
, 'FOI request(in summary form)' | , 'FOI request(in summary form)' |
, 'Summary of FOI request received by the ASC', | , 'Summary of FOI request received by the ASC', |
'Summary of FOI request received by agency/minister', | 'Summary of FOI request received by agency/minister', |
'Description of Documents Requested', 'FOI request', | 'Description of Documents Requested', 'FOI request', |
'Description of FOI Request', 'Summary of request', 'Description', 'Summary', | 'Description of FOI Request', 'Summary of request', 'Description', 'Summary', |
'Summary of FOIrequest received by agency/minister', | 'Summary of FOIrequest received by agency/minister', |
'Summary of FOI request received', 'Description of FOI Request', | 'Summary of FOI request received', 'Description of FOI Request', |
"FOI request", 'Results 1 to 67 of 67'] | "FOI request", 'Results 1 to 67 of 67'] |
if doc['title'] not in badtitles\ | if doc['title'] not in badtitles\ |
and doc['description'] != '': | and doc['description'] != '': |
print "saving" | print "saving" |
foidocsdb.save(doc) | foidocsdb.save(doc) |
else: | else: |
print "already saved " + dochash | print "already saved " + dochash |
elif len(row.find_all('th')) is self.getColumnCount(): | elif len(row.find_all('th')) is self.getColumnCount(): |
print "header row" | print "header row" |
else: | else: |
print "ERROR number of columns incorrect" | print "ERROR number of columns incorrect" |
print row | print row |
#http://packages.python.org/CouchDB/client.html | #http://packages.python.org/CouchDB/client.html |
import couchdb | import couchdb |
import urllib2 | import urllib2 |
from BeautifulSoup import BeautifulSoup | from BeautifulSoup import BeautifulSoup |
import re | import re |
import hashlib | import hashlib |
from urlparse import urljoin | from urlparse import urljoin |
import time | import time |
import os | import os |
import mimetypes | import mimetypes |
import urllib | import urllib |
import urlparse | import urlparse |
import socket | import socket |
def mkhash(input): | def mkhash(input): |
return hashlib.md5(input).hexdigest().encode("utf-8") | return hashlib.md5(input).hexdigest().encode("utf-8") |
def canonurl(url): | def canonurl(url): |
r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or '' | r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or '' |
if the URL looks invalid. | if the URL looks invalid. |
>>> canonurl('\xe2\x9e\xa1.ws') # tinyarro.ws | >>> canonurl('\xe2\x9e\xa1.ws') # tinyarro.ws |
'http://xn--hgi.ws/' | 'http://xn--hgi.ws/' |
""" | """ |
# strip spaces at the ends and ensure it's prefixed with 'scheme://' | # strip spaces at the ends and ensure it's prefixed with 'scheme://' |
url = url.strip() | url = url.strip() |
if not url: | if not url: |
return '' | return '' |
if not urlparse.urlsplit(url).scheme: | if not urlparse.urlsplit(url).scheme: |
url = 'http://' + url | url = 'http://' + url |
# turn it into Unicode | # turn it into Unicode |
#try: | #try: |
# url = unicode(url, 'utf-8') | # url = unicode(url, 'utf-8') |
#except UnicodeDecodeError: | #except UnicodeDecodeError: |
# return '' # bad UTF-8 chars in URL | # return '' # bad UTF-8 chars in URL |
# parse the URL into its components | # parse the URL into its components |
parsed = urlparse.urlsplit(url) | parsed = urlparse.urlsplit(url) |
scheme, netloc, path, query, fragment = parsed | scheme, netloc, path, query, fragment = parsed |
# ensure scheme is a letter followed by letters, digits, and '+-.' chars | # ensure scheme is a letter followed by letters, digits, and '+-.' chars |
if not re.match(r'[a-z][-+.a-z0-9]*$', scheme, flags=re.I): | if not re.match(r'[a-z][-+.a-z0-9]*$', scheme, flags=re.I): |
return '' | return '' |
scheme = str(scheme) | scheme = str(scheme) |
# ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port] | # ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port] |
match = re.match(r'(.+\.[a-z0-9]{1,6})(:\d{1,5})?$', netloc, flags=re.I) | match = re.match(r'(.+\.[a-z0-9]{1,6})(:\d{1,5})?$', netloc, flags=re.I) |
if not match: | if not match: |
return '' | return '' |
domain, port = match.groups() | domain, port = match.groups() |
netloc = domain + (port if port else '') | netloc = domain + (port if port else '') |
netloc = netloc.encode('idna') | netloc = netloc.encode('idna') |
# ensure path is valid and convert Unicode chars to %-encoded | # ensure path is valid and convert Unicode chars to %-encoded |
if not path: | if not path: |
path = '/' # eg: 'http://google.com' -> 'http://google.com/' | path = '/' # eg: 'http://google.com' -> 'http://google.com/' |
path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;') | path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;') |
# ensure query is valid | # ensure query is valid |
query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/') | query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/') |
# ensure fragment is valid | # ensure fragment is valid |
fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8'))) | fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8'))) |
# piece it all back together, truncating it to a maximum of 4KB | # piece it all back together, truncating it to a maximum of 4KB |
url = urlparse.urlunsplit((scheme, netloc, path, query, fragment)) | url = urlparse.urlunsplit((scheme, netloc, path, query, fragment)) |
return url[:4096] | return url[:4096] |
def fullurl(url, href): | def fullurl(url, href): |
href = href.replace(" ", "%20") | href = href.replace(" ", "%20") |
href = re.sub('#.*$', '', href) | href = re.sub('#.*$', '', href) |
return urljoin(url, href) | return urljoin(url, href) |
#http://diveintopython.org/http_web_services/etags.html | #http://diveintopython.org/http_web_services/etags.html |
class NotModifiedHandler(urllib2.BaseHandler): | class NotModifiedHandler(urllib2.BaseHandler): |
def http_error_304(self, req, fp, code, message, headers): | def http_error_304(self, req, fp, code, message, headers): |
addinfourl = urllib2.addinfourl(fp, headers, req.get_full_url()) | addinfourl = urllib2.addinfourl(fp, headers, req.get_full_url()) |
addinfourl.code = code | addinfourl.code = code |
return addinfourl | return addinfourl |
def getLastAttachment(docsdb, url): | def getLastAttachment(docsdb, url): |
hash = mkhash(url) | hash = mkhash(url) |
doc = docsdb.get(hash) | doc = docsdb.get(hash) |
if doc != None: | if doc != None: |
last_attachment_fname = doc["_attachments"].keys()[-1] | last_attachment_fname = doc["_attachments"].keys()[-1] |
last_attachment = docsdb.get_attachment(doc, last_attachment_fname) | last_attachment = docsdb.get_attachment(doc, last_attachment_fname) |
return last_attachment | return last_attachment |
else: | else: |
return None | return None |
def fetchURL(docsdb, url, fieldName, agencyID, scrape_again=True): | def fetchURL(docsdb, url, fieldName, agencyID, scrape_again=True): |
url = canonurl(url) | url = canonurl(url) |
hash = mkhash(url) | hash = mkhash(url) |
req = urllib2.Request(url) | req = urllib2.Request(url) |
print "Fetching %s (%s)" % (url, hash) | print "Fetching %s (%s)" % (url, hash) |
if url.startswith("mailto") or url.startswith("javascript") or url.startswith("#") or url == None or url == "": | if url.startswith("mailto") or url.startswith("javascript") or url.startswith("#") or url == None or url == "": |
print "Not a valid HTTP url" | print "Not a valid HTTP url" |
return (None, None, None) | return (None, None, None) |
doc = docsdb.get(hash) | doc = docsdb.get(hash) |
if doc == None: | if doc == None: |
doc = {'_id': hash, 'agencyID': agencyID, 'url': url, 'fieldName': fieldName, 'type': 'website'} | doc = {'_id': hash, 'agencyID': agencyID, 'url': url, 'fieldName': fieldName, 'type': 'website'} |
else: | else: |
if (('page_scraped' in doc) and (time.time() - doc['page_scraped']) < 60 * 24 * 14 * 1000): | if (('page_scraped' in doc) and (time.time() - doc['page_scraped']) < 60 * 24 * 14): |
print "Uh oh, trying to scrape URL again too soon!" + hash | print "Uh oh, trying to scrape URL again too soon!" + hash |
last_attachment_fname = doc["_attachments"].keys()[-1] | last_attachment_fname = doc["_attachments"].keys()[-1] |
last_attachment = docsdb.get_attachment(doc, last_attachment_fname) | last_attachment = docsdb.get_attachment(doc, last_attachment_fname) |
content = last_attachment | content = last_attachment |
return (doc['url'], doc['mime_type'], content.read()) | return (doc['url'], doc['mime_type'], content.read()) |
if scrape_again == False: | if scrape_again == False: |
print "Not scraping this URL again as requested" | print "Not scraping this URL again as requested" |
return (doc['url'], doc['mime_type'], content.read()) | return (doc['url'], doc['mime_type'], content.read()) |
req.add_header("User-Agent", "Mozilla/4.0 (compatible; Prometheus webspider; owner maxious@lambdacomplex.org)") | req.add_header("User-Agent", "Mozilla/4.0 (compatible; Prometheus webspider; owner maxious@lambdacomplex.org)") |
#if there is a previous version stored in couchdb, load caching helper tags | #if there is a previous version stored in couchdb, load caching helper tags |
if doc.has_key('etag'): | if doc.has_key('etag'): |
req.add_header("If-None-Match", doc['etag']) | req.add_header("If-None-Match", doc['etag']) |
if doc.has_key('last_modified'): | if doc.has_key('last_modified'): |
req.add_header("If-Modified-Since", doc['last_modified']) | req.add_header("If-Modified-Since", doc['last_modified']) |
opener = urllib2.build_opener(NotModifiedHandler()) | opener = urllib2.build_opener(NotModifiedHandler()) |
try: | try: |
url_handle = opener.open(req, None, 20) | url_handle = opener.open(req, None, 20) |
doc['url'] = url_handle.geturl() # may have followed a redirect to a new url | doc['url'] = url_handle.geturl() # may have followed a redirect to a new url |
headers = url_handle.info() # the addinfourls have the .info() too | headers = url_handle.info() # the addinfourls have the .info() too |
doc['etag'] = headers.getheader("ETag") | doc['etag'] = headers.getheader("ETag") |
doc['last_modified'] = headers.getheader("Last-Modified") | doc['last_modified'] = headers.getheader("Last-Modified") |
doc['date'] = headers.getheader("Date") | doc['date'] = headers.getheader("Date") |
doc['page_scraped'] = time.time() | doc['page_scraped'] = time.time() |
doc['web_server'] = headers.getheader("Server") | doc['web_server'] = headers.getheader("Server") |
doc['via'] = headers.getheader("Via") | doc['via'] = headers.getheader("Via") |
doc['powered_by'] = headers.getheader("X-Powered-By") | doc['powered_by'] = headers.getheader("X-Powered-By") |
doc['file_size'] = headers.getheader("Content-Length") | doc['file_size'] = headers.getheader("Content-Length") |
content_type = headers.getheader("Content-Type") | content_type = headers.getheader("Content-Type") |
if content_type != None: | if content_type != None: |
doc['mime_type'] = content_type.split(";")[0] | doc['mime_type'] = content_type.split(";")[0] |
else: | else: |
(type, encoding) = mimetypes.guess_type(url) | (type, encoding) = mimetypes.guess_type(url) |
doc['mime_type'] = type | doc['mime_type'] = type |
if hasattr(url_handle, 'code'): | if hasattr(url_handle, 'code'): |
if url_handle.code == 304: | if url_handle.code == 304: |
print "the web page has not been modified" + hash | print "the web page has not been modified" + hash |
last_attachment_fname = doc["_attachments"].keys()[-1] | last_attachment_fname = doc["_attachments"].keys()[-1] |
last_attachment = docsdb.get_attachment(doc, last_attachment_fname) | last_attachment = docsdb.get_attachment(doc, last_attachment_fname) |
content = last_attachment | content = last_attachment |
return (doc['url'], doc['mime_type'], content.read()) | return (doc['url'], doc['mime_type'], content.read()) |
else: | else: |
print "new webpage loaded" | print "new webpage loaded" |
content = url_handle.read() | content = url_handle.read() |
docsdb.save(doc) | docsdb.save(doc) |
doc = docsdb.get(hash) # need to get a _rev | doc = docsdb.get(hash) # need to get a _rev |
docsdb.put_attachment(doc, content, str(time.time()) + "-" + os.path.basename(url), doc['mime_type']) | docsdb.put_attachment(doc, content, str(time.time()) + "-" + os.path.basename(url), doc['mime_type']) |
return (doc['url'], doc['mime_type'], content) | return (doc['url'], doc['mime_type'], content) |
#store as attachment epoch-filename | #store as attachment epoch-filename |
except (urllib2.URLError, socket.timeout) as e: | except (urllib2.URLError, socket.timeout) as e: |
print "error!" | print "error!" |
error = "" | error = "" |
if hasattr(e, 'reason'): | if hasattr(e, 'reason'): |
error = "error %s in downloading %s" % (str(e.reason), url) | error = "error %s in downloading %s" % (str(e.reason), url) |
elif hasattr(e, 'code'): | elif hasattr(e, 'code'): |
error = "error %s in downloading %s" % (e.code, url) | error = "error %s in downloading %s" % (e.code, url) |
print error | print error |
doc['error'] = error | doc['error'] = error |
docsdb.save(doc) | docsdb.save(doc) |
return (None, None, None) | return (None, None, None) |
def scrapeAndStore(docsdb, url, depth, fieldName, agencyID): | def scrapeAndStore(docsdb, url, depth, fieldName, agencyID): |
(url, mime_type, content) = fetchURL(docsdb, url, fieldName, agencyID) | (url, mime_type, content) = fetchURL(docsdb, url, fieldName, agencyID) |
badURLs = ["http://www.ausport.gov.au/supporting/funding/grants_and_scholarships/grant_funding_report"] | badURLs = ["http://www.ausport.gov.au/supporting/funding/grants_and_scholarships/grant_funding_report"] |
if content != None and depth > 0 and url != "http://www.ausport.gov.au/supporting/funding/grants_and_scholarships/grant_funding_report": | if content != None and depth > 0 and url != "http://www.ausport.gov.au/supporting/funding/grants_and_scholarships/grant_funding_report": |
if mime_type == "text/html" or mime_type == "application/xhtml+xml" or mime_type == "application/xml": | if mime_type == "text/html" or mime_type == "application/xhtml+xml" or mime_type == "application/xml": |
# http://www.crummy.com/software/BeautifulSoup/documentation.html | # http://www.crummy.com/software/BeautifulSoup/documentation.html |
soup = BeautifulSoup(content) | soup = BeautifulSoup(content) |
navIDs = soup.findAll( | navIDs = soup.findAll( |
id=re.compile('nav|Nav|menu|bar|left|right|sidebar|more-links|breadcrumb|footer|header')) | id=re.compile('nav|Nav|menu|bar|left|right|sidebar|more-links|breadcrumb|footer|header')) |
for nav in navIDs: | for nav in navIDs: |
print "Removing element", nav['id'] | print "Removing element", nav['id'] |
nav.extract() | nav.extract() |
navClasses = soup.findAll( | navClasses = soup.findAll( |
attrs={'class': re.compile('nav|menu|bar|left|right|sidebar|more-links|breadcrumb|footer|header')}) | attrs={'class': re.compile('nav|menu|bar|left|right|sidebar|more-links|breadcrumb|footer|header')}) |
for nav in navClasses: | for nav in navClasses: |
print "Removing element", nav['class'] | print "Removing element", nav['class'] |
nav.extract() | nav.extract() |
links = soup.findAll('a') # soup.findAll('a', id=re.compile("^p-")) | links = soup.findAll('a') # soup.findAll('a', id=re.compile("^p-")) |
linkurls = set([]) | linkurls = set([]) |
for link in links: | for link in links: |
if link.has_key("href"): | if link.has_key("href"): |
if link['href'].startswith("http"): | if link['href'].startswith("http"): |
# lets not do external links for now | # lets not do external links for now |
# linkurls.add(link['href']) | # linkurls.add(link['href']) |
None | None |
if link['href'].startswith("mailto"): | if link['href'].startswith("mailto"): |
# not http | # not http |
None | None |
if link['href'].startswith("javascript"): | if link['href'].startswith("javascript"): |
# not http | # not http |
None | None |
else: | else: |
# remove anchors and spaces in urls | # remove anchors and spaces in urls |
linkurls.add(fullurl(url, link['href'])) | linkurls.add(fullurl(url, link['href'])) |
for linkurl in linkurls: | for linkurl in linkurls: |
#print linkurl | #print linkurl |
scrapeAndStore(docsdb, linkurl, depth - 1, fieldName, agencyID) | scrapeAndStore(docsdb, linkurl, depth - 1, fieldName, agencyID) |
#couch = couchdb.Server('http://192.168.1.148:5984/') | #couch = couchdb.Server('http://192.168.1.148:5984/') |
couch = couchdb.Server('http://192.168.1.113:5984/') | #couch = couchdb.Server('http://192.168.1.113:5984/') |
#couch = couchdb.Server('http://127.0.0.1:5984/') | couch = couchdb.Server('http://127.0.0.1:5984/') |
# select database | # select database |
agencydb = couch['disclosr-agencies'] | agencydb = couch['disclosr-agencies'] |
docsdb = couch['disclosr-documents'] | docsdb = couch['disclosr-documents'] |
if __name__ == "__main__": | if __name__ == "__main__": |
for row in agencydb.view('app/all'): #not recently scraped agencies view? | for row in agencydb.view('app/all'): #not recently scraped agencies view? |
agency = agencydb.get(row.id) | agency = agencydb.get(row.id) |
print agency['name'] | print agency['name'] |
for key in agency.keys(): | for key in agency.keys(): |
if key == "FOIDocumentsURL" and "status" not in agency.keys() and False: | if key == "FOIDocumentsURL" and "status" not in agency.keys() and False: |
scrapeAndStore(docsdb, agency[key], 0, key, agency['_id']) | scrapeAndStore(docsdb, agency[key], 0, key, agency['_id']) |
if key == 'website' and True: | if key == 'website' and True: |
scrapeAndStore(docsdb, agency[key], 0, key, agency['_id']) | scrapeAndStore(docsdb, agency[key], 0, key, agency['_id']) |
if "metadata" not in agency.keys(): | if "metadata" not in agency.keys(): |
agency['metadata'] = {} | agency['metadata'] = {} |
agency['metadata']['lastScraped'] = time.time() | agency['metadata']['lastScraped'] = time.time() |
if key.endswith('URL') and False: | if key.endswith('URL') and False: |
print key | print key |
depth = 1 | depth = 1 |
if 'scrapeDepth' in agency.keys(): | if 'scrapeDepth' in agency.keys(): |
depth = agency['scrapeDepth'] | depth = agency['scrapeDepth'] |
scrapeAndStore(docsdb, agency[key], depth, key, agency['_id']) | scrapeAndStore(docsdb, agency[key], depth, key, agency['_id']) |
agencydb.save(agency) | agencydb.save(agency) |
import sys | import sys |
import os | import os |
sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) | sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) |
import genericScrapers | import genericScrapers |
import traceback | import traceback |
try: | try: |
import amonpy | import amonpy |
amonpy.config.address = 'http://amon_instance:port' | amonpy.config.address = 'http://amon_instance:port' |
amonpy.config.secret_key = 'the secret key from /etc/amon.conf' | amonpy.config.secret_key = 'the secret key from /etc/amon.conf' |
amon_available = True | amon_available = True |
except ImportError: | except ImportError: |
amon_available = False | amon_available = False |
class ScraperImplementation(genericScrapers.GenericPDFDisclogScraper): | class ScraperImplementation(genericScrapers.GenericPDFDisclogScraper): |
def __init__(self): | def __init__(self): |
super(ScraperImplementation, self).__init__() | super(ScraperImplementation, self).__init__() |
if __name__ == '__main__': | if __name__ == '__main__': |
print 'Subclass:', issubclass(ScraperImplementation, | print 'Subclass:', issubclass(ScraperImplementation, |
genericScrapers.GenericPDFDisclogScraper) | genericScrapers.GenericPDFDisclogScraper) |
print 'Instance:', isinstance(ScraperImplementation(), | print 'Instance:', isinstance(ScraperImplementation(), |
genericScrapers.GenericPDFDisclogScraper) | genericScrapers.GenericPDFDisclogScraper) |
try: | try: |
ScraperImplementation().doScrape() | ScraperImplementation().doScrape() |
except Exception, err: | except Exception, err: |
sys.stderr.write('ERROR: %s\n' % str(err)) | sys.stderr.write('ERROR: %s\n' % str(err)) |
print ‘Error Reason: ‘, err.__doc__ | print "Error Reason: ", err.__doc__ |
print ‘Exception: ‘, err.__class__ | print "Exception: ", err.__class__ |
print traceback.format_exc() | print traceback.format_exc() |
if amon_available: | if amon_available: |
data = { | data = { |
'exception_class': '', | 'exception_class': '', |
'url': '', | 'url': '', |
'backtrace': ['exception line ', 'another exception line'], | 'backtrace': ['exception line ', 'another exception line'], |
'enviroment': '', | 'enviroment': '', |
# In 'data' you can add request information, session variables - it's a recursive | # In 'data' you can add request information, session variables - it's a recursive |
# dictionary, so you can literally add everything important for your specific case | # dictionary, so you can literally add everything important for your specific case |
# The dictionary doesn't have a specified structure, the keys below are only example | # The dictionary doesn't have a specified structure, the keys below are only example |
'data': {'request': '', 'session': '', 'more': ''} | 'data': {'request': '', 'session': '', 'more': ''} |
} | } |
amonpy.exception(data) | amonpy.exception(data) |
pass | pass |
import sys,os | import sys,os |
sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) | sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) |
import genericScrapers | import genericScrapers |
import scrape | import scrape |
from bs4 import BeautifulSoup | from bs4 import BeautifulSoup |
#http://www.doughellmann.com/PyMOTW/abc/ | #http://www.doughellmann.com/PyMOTW/abc/ |
class ScraperImplementation(genericScrapers.GenericOAICDisclogScraper): | class ScraperImplementation(genericScrapers.GenericOAICDisclogScraper): |
def getTable(self,soup): | |
return soup.find(_class = "article-content").table | |
def getColumnCount(self): | def getColumnCount(self): |
return 5 | return 5 |
def getColumns(self,columns): | def getColumns(self,columns): |
(id, title, date, description, notes) = columns | (id, title, date, description, notes) = columns |
return (id, date, title, description, notes) | return (id, date, title, description, notes) |
if __name__ == '__main__': | if __name__ == '__main__': |
print 'Subclass:', issubclass(ScraperImplementation, genericScrapers.GenericOAICDisclogScraper) | print 'Subclass:', issubclass(ScraperImplementation, genericScrapers.GenericOAICDisclogScraper) |
print 'Instance:', isinstance(ScraperImplementation(), genericScrapers.GenericOAICDisclogScraper) | print 'Instance:', isinstance(ScraperImplementation(), genericScrapers.GenericOAICDisclogScraper) |
ScraperImplementation().doScrape() | ScraperImplementation().doScrape() |
import sys,os | import sys,os |
sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) | sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) |
import genericScrapers | import genericScrapers |
import scrape | import scrape |
from bs4 import BeautifulSoup | from bs4 import BeautifulSoup |
import dateutil | import dateutil |
from dateutil.parser import * | from dateutil.parser import * |
from datetime import * | from datetime import * |
#http://www.doughellmann.com/PyMOTW/abc/ | #http://www.doughellmann.com/PyMOTW/abc/ |
class ScraperImplementation(genericScrapers.GenericOAICDisclogScraper): | class ScraperImplementation(genericScrapers.GenericOAICDisclogScraper): |
def getColumnCount(self): | def getColumnCount(self): |
return 3 | return 3 |
def getColumns(self,columns): | def getColumns(self,columns): |
(date, title, description) = columns | (date, title, description) = columns |
return (date, date, title, description, None) | return (date, date, title, description, None) |
def getTitle(self, content, entry, doc): | def getTitle(self, content, entry, doc): |
i = 0 | i = 0 |
title = "" | title = "" |
for string in content.stripped_strings: | for string in content.stripped_strings: |
if i < 2: | if i < 2: |
title = title + string | title = title + string |
i = i+1 | i = i+1 |
title = self.remove_control_chars(title) | |
doc.update({'title': title}) | doc.update({'title': title}) |
print title | print title |
return | return |
if __name__ == '__main__': | if __name__ == '__main__': |
print 'Subclass:', issubclass(ScraperImplementation, genericScrapers.GenericOAICDisclogScraper) | print 'Subclass:', issubclass(ScraperImplementation, genericScrapers.GenericOAICDisclogScraper) |
print 'Instance:', isinstance(ScraperImplementation(), genericScrapers.GenericOAICDisclogScraper) | print 'Instance:', isinstance(ScraperImplementation(), genericScrapers.GenericOAICDisclogScraper) |
ScraperImplementation().doScrape() | ScraperImplementation().doScrape() |
import sys,os | import sys,os |
sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) | sys.path.insert(0, os.path.join(os.path.dirname(__file__) or '.', '../')) |
import genericScrapers | import genericScrapers |
#RSS feed not detailed | #RSS feed not detailed |
#http://www.doughellmann.com/PyMOTW/abc/ | #http://www.doughellmann.com/PyMOTW/abc/ |
class ScraperImplementation(genericScrapers.GenericRSSDisclogScraper): | class ScraperImplementation(genericScrapers.GenericRSSDisclogScraper): |
def getColumns(self,columns): | def getColumns(self,columns): |
(id, date, title, description, notes) = columns | (id, date, title, description, notes) = columns |
return (id, date, title, description, notes) | return (id, date, title, description, notes) |
if __name__ == '__main__': | if __name__ == '__main__': |
print 'Subclass:', issubclass(ScraperImplementation, genericScrapers.GenericRSSDisclogScraper) | print 'Subclass:', issubclass(ScraperImplementation, genericScrapers.GenericRSSDisclogScraper) |
print 'Instance:', isinstance(ScraperImplementation(), genericScrapers.GenericRSSDisclogScraper) | print 'Instance:', isinstance(ScraperImplementation(), genericScrapers.GenericRSSDisclogScraper) |
ScraperImplementation().doScrape() | ScraperImplementation().doScrape() |
www.finance.gov.au/foi/disclosure-log/foi-rss.xml | |
<?php | <?php |
include_once('include/common.inc.php'); | include_once('include/common.inc.php'); |
function displayValue($key, $value, $mode) { | function displayValue($key, $value, $mode) { |
global $db, $schemas; | global $db, $schemas; |
$ignoreKeys = Array("metadata" ,"metaTags", "statistics","rtkURLs","rtkDescriptions"); | $ignoreKeys = Array("metadata" ,"metaTags", "statistics","rtkURLs","rtkDescriptions"); |
if ($mode == "view") { | if ($mode == "view") { |
if (strpos($key, "_") === 0 || in_array($key,$ignoreKeys)) | if (strpos($key, "_") === 0 || in_array($key,$ignoreKeys)) |
return; | return; |
echo "<tr>"; | echo "<tr>"; |
echo "<td class='$key'>"; | echo "<td class='$key'>"; |
if (isset($schemas['agency']["properties"][$key])) { | if (isset($schemas['agency']["properties"][$key])) { |
echo $schemas['agency']["properties"][$key]['x-title'] . "<br><small>" . $schemas['agency']["properties"][$key]['description'] . "</small>"; | echo $schemas['agency']["properties"][$key]['x-title'] . "<br><small>" . $schemas['agency']["properties"][$key]['description'] . "</small>"; |
} | } |
echo "</td><td>"; | echo "</td><td>"; |
if (is_array($value)) { | if (is_array($value)) { |
echo "<ol>"; | echo "<ol>"; |
foreach ($value as $subkey => $subvalue) { | foreach ($value as $subkey => $subvalue) { |
echo "<li "; | echo "<li "; |
if (isset($schemas['agency']["properties"][$key]['x-property'])) { | if (isset($schemas['agency']["properties"][$key]['x-property'])) { |
echo ' property="' . $schemas['agency']["properties"][$key]['x-property'] . '" '; | echo ' property="' . $schemas['agency']["properties"][$key]['x-property'] . '" '; |
} if (isset($schemas['agency']["properties"][$key]['x-itemprop'])) { | } if (isset($schemas['agency']["properties"][$key]['x-itemprop'])) { |
echo ' itemprop="' . $schemas['agency']["properties"][$key]['x-itemprop'] . '" '; | echo ' itemprop="' . $schemas['agency']["properties"][$key]['x-itemprop'] . '" '; |
} | } |
echo " >"; | echo " >"; |
echo "$subvalue</li>"; | echo "$subvalue</li>"; |
} | } |
echo "</ol></td></tr>"; | echo "</ol></td></tr>"; |
} else { | } else { |
if (isset($schemas['agency']["properties"][$key]['x-property'])) { | if (isset($schemas['agency']["properties"][$key]['x-property'])) { |
echo '<span property="' . $schemas['agency']["properties"][$key]['x-property'] . '">'; | echo '<span property="' . $schemas['agency']["properties"][$key]['x-property'] . '">'; |
} else { | } else { |
echo "<span>"; | echo "<span>"; |
} | } |
if ((strpos($key, "URL") > 0 || $key == 'website') && $value != "") { | if ((strpos($key, "URL") > 0 || $key == 'website') && $value != "") { |
echo "<a " . ($key == 'website' ? 'itemprop="url"' : '') . " href='$value'>$value</a>"; | echo "<a " . ($key == 'website' ? 'itemprop="url"' : '') . " href='$value'>$value</a>"; |
} else if ($key == 'abn') { | } else if ($key == 'abn') { |
echo "<a href='http://www.abr.business.gov.au/SearchByAbn.aspx?SearchText=$value'>$value</a>"; | echo "<a href='http://www.abr.business.gov.au/SearchByAbn.aspx?SearchText=$value'>$value</a>"; |
} else { | } else { |
echo "$value"; | echo "$value"; |
} | } |
echo "</span>"; | echo "</span>"; |
} | } |
echo "</td></tr>"; | echo "</td></tr>"; |
} | } |
if ($mode == "edit") { | if ($mode == "edit") { |
if (is_array($value)) { | if (is_array($value)) { |
echo '<div class="row"> | echo '<div class="row"> |
<div class="seven columns"> | <div class="seven columns"> |
<fieldset> | <fieldset> |
<h5>' . $key . '</h5>'; | <h5>' . $key . '</h5>'; |
foreach ($value as $subkey => $subvalue) { | foreach ($value as $subkey => $subvalue) { |
echo "<label>$subkey</label><input class='input-text' type='text' id='$key$subkey' name='$key" . '[' . $subkey . "]' value='$subvalue'/></tr>"; | echo "<label>$subkey</label><input class='input-text' type='text' id='$key$subkey' name='$key" . '[' . $subkey . "]' value='$subvalue'/></tr>"; |
} | } |
echo "</fieldset> | echo "</fieldset> |
</div> | </div> |
</div>"; | </div>"; |
} else { | } else { |
if (strpos($key, "_") === 0) { | if (strpos($key, "_") === 0) { |
echo"<input type='hidden' id='$key' name='$key' value='$value'/>"; | echo"<input type='hidden' id='$key' name='$key' value='$value'/>"; |
} else if ($key == "parentOrg") { | } else if ($key == "parentOrg") { |
echo "<label for='$key'>$key</label><select id='$key' name='$key'><option value=''> Select... </option>"; | echo "<label for='$key'>$key</label><select id='$key' name='$key'><option value=''> Select... </option>"; |
$rows = $db->get_view("app", "byDeptStateName")->rows; | $rows = $db->get_view("app", "byDeptStateName")->rows; |
//print_r($rows); | //print_r($rows); |
foreach ($rows as $row) { | foreach ($rows as $row) { |
echo "<option value='{$row->value}'" . (($row->value == $value) ? "SELECTED" : "") . " >" . str_replace("Department of ", "", $row->key) . "</option>"; | echo "<option value='{$row->value}'" . (($row->value == $value) ? "SELECTED" : "") . " >" . str_replace("Department of ", "", $row->key) . "</option>"; |
} | } |
echo" </select>"; | echo" </select>"; |
} else { | } else { |
echo "<label>$key</label><input class='input-text' type='text' id='$key' name='$key' value='$value'/>"; | echo "<label>$key</label><input class='input-text' type='text' id='$key' name='$key' value='$value'/>"; |
if ((strpos($key, "URL") > 0 || $key == 'website') && $value != "") { | if ((strpos($key, "URL") > 0 || $key == 'website') && $value != "") { |
echo "<a " . ($key == 'website' ? 'itemprop="url"' : '') . " href='$value'>view</a>"; | echo "<a " . ($key == 'website' ? 'itemprop="url"' : '') . " href='$value'>view</a>"; |
} | } |
if ($key == 'abn') { | if ($key == 'abn') { |
echo "<a href='http://www.abr.business.gov.au/SearchByAbn.aspx?SearchText=$value'>view abn</a>"; | echo "<a href='http://www.abr.business.gov.au/SearchByAbn.aspx?SearchText=$value'>view abn</a>"; |
} | } |
} | } |
} | } |
} | } |
// | // |
} | } |
function addDefaultFields($row) { | function addDefaultFields($row) { |
global $schemas; | global $schemas; |
$defaultFields = array_keys($schemas['agency']['properties']); | $defaultFields = array_keys($schemas['agency']['properties']); |
foreach ($defaultFields as $defaultField) { | foreach ($defaultFields as $defaultField) { |
if (!isset($row[$defaultField])) { | if (!isset($row[$defaultField])) { |
if ($schemas['agency']['properties'][$defaultField]['type'] == "string") { | if ($schemas['agency']['properties'][$defaultField]['type'] == "string") { |
$row[$defaultField] = ""; | $row[$defaultField] = ""; |
} | } |
if ($schemas['agency']['properties'][$defaultField]['type'] == "array") { | if ($schemas['agency']['properties'][$defaultField]['type'] == "array") { |
$row[$defaultField] = Array(""); | $row[$defaultField] = Array(""); |
} | } |
} else if ($schemas['agency']['properties'][$defaultField]['type'] == "array") { | } else if ($schemas['agency']['properties'][$defaultField]['type'] == "array") { |
if (is_array($row[$defaultField])) { | if (is_array($row[$defaultField])) { |
$row[$defaultField][] = ""; | $row[$defaultField][] = ""; |
$row[$defaultField][] = ""; | $row[$defaultField][] = ""; |
$row[$defaultField][] = ""; | $row[$defaultField][] = ""; |
} else { | } else { |
$value = $row[$defaultField]; | $value = $row[$defaultField]; |
$row[$defaultField] = Array($value); | $row[$defaultField] = Array($value); |
$row[$defaultField][] = ""; | $row[$defaultField][] = ""; |
$row[$defaultField][] = ""; | $row[$defaultField][] = ""; |
} | } |
} | } |
} | } |
return $row; | return $row; |
} | } |
$db = $server->get_db('disclosr-agencies'); | $db = $server->get_db('disclosr-agencies'); |
if (isset($_REQUEST['id'])) { | if (isset($_REQUEST['id'])) { |
//get an agency record as json/html, search by name/abn/id | //get an agency record as json/html, search by name/abn/id |
// by name = startkey="Ham"&endkey="Ham\ufff0" | // by name = startkey="Ham"&endkey="Ham\ufff0" |
// edit? | // edit? |
$obj = $db->get($_REQUEST['id']); | $obj = $db->get($_REQUEST['id']); |
include_header(isset($obj->name) ? $obj->name : ""); | include_header(isset($obj->name) ? $obj->name : ""); |
//print_r($row); | //print_r($row); |
if (sizeof($_POST) > 0) { | if (sizeof($_POST) > 0) { |
//print_r($_POST); | //print_r($_POST); |
foreach ($_POST as $postkey => $postvalue) { | foreach ($_POST as $postkey => $postvalue) { |
if ($postvalue == "") { | if ($postvalue == "") { |
unset($_POST[$postkey]); | unset($_POST[$postkey]); |
} | } |
if (is_array($postvalue)) { | if (is_array($postvalue)) { |
if (count($postvalue) == 1 && $postvalue[0] == "") { | if (count($postvalue) == 1 && $postvalue[0] == "") { |
unset($_POST[$postkey]); | unset($_POST[$postkey]); |
} else { | } else { |
foreach ($_POST[$postkey] as $key => &$value) { | foreach ($_POST[$postkey] as $key => &$value) { |
if ($value == "") { | if ($value == "") { |
unset($_POST[$postkey][$key]); | unset($_POST[$postkey][$key]); |
} | } |
} | } |
} | } |
} | } |
} | } |
if (isset($_POST['_id']) && $db->get_rev($_POST['_id']) == $_POST['_rev']) { | if (isset($_POST['_id']) && $db->get_rev($_POST['_id']) == $_POST['_rev']) { |
echo "Edited version was latest version, continue saving"; | echo "Edited version was latest version, continue saving"; |
$newdoc = $_POST; | $newdoc = $_POST; |
$newdoc['metadata']['lastModified'] = time(); | $newdoc['metadata']['lastModified'] = time(); |
$obj = $db->save($newdoc); | $obj = $db->save($newdoc); |
} else { | } else { |
echo "ALERT doc revised by someone else while editing. Document not saved."; | echo "ALERT doc revised by someone else while editing. Document not saved."; |
} | } |
} | } |
$mode = "view"; | $mode = "view"; |
$rowArray = object_to_array($obj); | $rowArray = object_to_array($obj); |
ksort($rowArray); | ksort($rowArray); |
if ($mode == "edit") { | if ($mode == "edit") { |
$row = addDefaultFields($rowArray); | $row = addDefaultFields($rowArray); |
} else { | } else { |
$row = $rowArray; | $row = $rowArray; |
} | } |
if ($mode == "view") { | if ($mode == "view") { |
echo ' <div class="container-fluid"> | echo ' <div class="container-fluid"> |
<div class="row-fluid"> | <div class="row-fluid"> |
<div class="span3"> | <div class="span3"> |
<div class="well sidebar-nav"> | <div class="well sidebar-nav"> |
<ul class="nav nav-list"> | <ul class="nav nav-list"> |
<li class="nav-header">Statistics</li>'; | <li class="nav-header">Statistics</li>'; |
if (isset($row['statistics']['employees'])) { | if (isset($row['statistics']['employees'])) { |
echo '<div><i class="icon-user" style="float:left"></i><p style="margin-left:16px;">'; | echo '<div><i class="icon-user" style="float:left"></i><p style="margin-left:16px;">'; |
$keys = array_keys($row['statistics']['employees']); | $keys = array_keys($row['statistics']['employees']); |
$lastkey = $keys[count($keys)-1]; | $lastkey = $keys[count($keys)-1]; |
echo $row['statistics']['employees'][$lastkey]['value'].' employees <small>('.$lastkey.')</small>'; | echo $row['statistics']['employees'][$lastkey]['value'].' employees <small>('.$lastkey.')</small>'; |
echo '</div>'; | echo '</div>'; |
} | } |
if (isset($row['statistics']['budget'])) { | if (isset($row['statistics']['budget'])) { |
echo '<div><i class="icon-shopping-cart" style="float:left"></i><p style="margin-left:16px;">'; | echo '<div><i class="icon-shopping-cart" style="float:left"></i><p style="margin-left:16px;">'; |
$keys = array_keys($row['statistics']['budget']); | $keys = array_keys($row['statistics']['budget']); |
$lastkey = $keys[count($keys)-1]; | $lastkey = $keys[count($keys)-1]; |
echo money_format("%#10i",(float)$row['statistics']['budget'][$lastkey]['value']).' <small>('.$lastkey.' budget)</small>'; | echo "$".number_format(floatval($row['statistics']['budget'][$lastkey]['value'])).' <small>('.$lastkey.' budget)</small>'; |
echo '</div>'; | echo '</div>'; |
} | } |
echo ' </ul> | echo ' </ul> |
</div><!--/.well --> | </div><!--/.well --> |
</div><!--/span--> | </div><!--/span--> |
<div class="span9">'; | <div class="span9">'; |
echo '<div itemscope itemtype="http://schema.org/GovernmentOrganization" typeof="schema:GovernmentOrganization" about="#' . $row['_id'] . '">'; | echo '<div itemscope itemtype="http://schema.org/GovernmentOrganization" typeof="schema:GovernmentOrganization" about="#' . $row['_id'] . '">'; |
echo '<div class="hero-unit"> | echo '<div class="hero-unit"> |
<h1 itemprop="name">' . $row['name'] . '</h1>'; | <h1 itemprop="name">' . $row['name'] . '</h1>'; |
if (isset($row['description'])) { | if (isset($row['description'])) { |
echo '<p>'.$row['description'].'</p>'; | echo '<p>'.$row['description'].'</p>'; |
} | } |
echo '</div><table width="100%">'; | echo '</div><table width="100%">'; |
echo "<tr><th>Field Name</th><th>Field Value</th></tr>"; | echo "<tr><th>Field Name</th><th>Field Value</th></tr>"; |
} | } |
if ($mode == "edit") { | if ($mode == "edit") { |
?> | ?> |
<input id="addfield" type="button" value="Add Field"/> | <input id="addfield" type="button" value="Add Field"/> |
<script> | <script> |
window.onload = function() { | window.onload = function() { |
$(document).ready(function() { | $(document).ready(function() { |
// put all your jQuery goodness in here. | // put all your jQuery goodness in here. |
// http://charlie.griefer.com/blog/2009/09/17/jquery-dynamically-adding-form-elements/ | // http://charlie.griefer.com/blog/2009/09/17/jquery-dynamically-adding-form-elements/ |
$('#addfield').click(function() { | $('#addfield').click(function() { |
var field_name=window.prompt("fieldname?",""); | var field_name=window.prompt("fieldname?",""); |
if (field_name !="") { | if (field_name !="") { |
$('#submitbutton').before($('<span></span>') | $('#submitbutton').before($('<span></span>') |
.append("<label>"+field_name+"</label>") | .append("<label>"+field_name+"</label>") |
.append("<input class='input-text' type='text' id='"+field_name+"' name='"+field_name+"'/>") | .append("<input class='input-text' type='text' id='"+field_name+"' name='"+field_name+"'/>") |
); | ); |
} | } |
}); | }); |
}); | }); |
}; | }; |
</script> | </script> |
<form id="editform" class="nice" method="post"> | <form id="editform" class="nice" method="post"> |
<?php | <?php |
} | } |
foreach ($row as $key => $value) { | foreach ($row as $key => $value) { |
echo displayValue($key, $value, $mode); | echo displayValue($key, $value, $mode); |
} | } |
if ($mode == "view") { | if ($mode == "view") { |
echo "</table></div>"; | echo "</table></div>"; |
echo ' </div><!--/span--> | echo ' </div><!--/span--> |
</div><!--/row--> | </div><!--/row--> |
</div><!--/span--> | </div><!--/span--> |
</div><!--/row-->'; | </div><!--/row-->'; |
} | } |
if ($mode == "edit") { | if ($mode == "edit") { |
echo '<input id="submitbutton" type="submit"/></form>'; | echo '<input id="submitbutton" type="submit"/></form>'; |
} | } |
} else { | } else { |
// show all list | // show all list |
include_header('Agencies'); | include_header('Agencies'); |
echo ' <div class="container-fluid"> | echo ' <div class="container-fluid"> |
<div class="row-fluid"> | <div class="row-fluid"> |
<div class="span3"> | <div class="span3"> |
<div class="well sidebar-nav"> | <div class="well sidebar-nav"> |
<ul class="nav nav-list"> | <ul class="nav nav-list"> |
<li class="nav-header">Sidebar</li>'; | <li class="nav-header">Sidebar</li>'; |
echo ' </ul> | echo ' </ul> |
</div><!--/.well --> | </div><!--/.well --> |
</div><!--/span--> | </div><!--/span--> |
<div class="span9"> | <div class="span9"> |
<div class="hero-unit"> | <div class="hero-unit"> |
<h1>Hello, world!</h1> | <h1>Australian Government Agencies</h1> |
<p>This is a template for a simple marketing or informational website. It includes a large callout called the hero unit and three supporting pieces of content. Use it as a starting point to create something more unique.</p> | <p>Explore collected information about Australian Government Agencies below.</p> |
<p><a class="btn btn-primary btn-large">Learn more »</a></p> | |
</div> | </div> |
<div class="row-fluid"> | <div class="row-fluid"> |
<div class="span4">'; | <div class="span4">'; |
try { | try { |
$rows = $db->get_view("app", "byCanonicalName")->rows; | $rows = $db->get_view("app", "byCanonicalName")->rows; |
//print_r($rows); | //print_r($rows); |
$rowCount = count($rows); | $rowCount = count($rows); |
foreach ($rows as $i => $row) { | foreach ($rows as $i => $row) { |
if ($i % ($rowCount/3) == 0 && $i != 0 && $i != $rowCount -2 ) echo '</div><div class="span4">'; | if ($i % ($rowCount/3) == 0 && $i != 0 && $i != $rowCount -2 ) echo '</div><div class="span4">'; |
// print_r($row); | // print_r($row); |
echo '<span itemscope itemtype="http://schema.org/GovernmentOrganization" typeof="schema:GovernmentOrganization foaf:Organization" about="getAgency.php?id=' . $row->value->_id . '"> | echo '<span itemscope itemtype="http://schema.org/GovernmentOrganization" typeof="schema:GovernmentOrganization foaf:Organization" about="getAgency.php?id=' . $row->value->_id . '"> |
<a href="getAgency.php?id=' . $row->value->_id . '" rel="schema:url foaf:page" property="schema:name foaf:name" itemprop="url"><span itemprop="name">' . | <a href="getAgency.php?id=' . $row->value->_id . '" rel="schema:url foaf:page" property="schema:name foaf:name" itemprop="url"><span itemprop="name">' . |
(isset($row->value->name) ? $row->value->name : "ERROR NAME MISSING") | (isset($row->value->name) ? $row->value->name : "ERROR NAME MISSING") |
. '</span></a></span><br><br>'; | . '</span></a></span><br><br>'; |
} | } |
} catch (SetteeRestClientException $e) { | } catch (SetteeRestClientException $e) { |
setteErrorHandler($e); | setteErrorHandler($e); |
} | } |
echo ' </div><!--/span--> | echo ' </div><!--/span--> |
</div><!--/row--> | </div><!--/row--> |
</div><!--/span--> | </div><!--/span--> |
</div><!--/row-->'; | </div><!--/row-->'; |
} | } |
include_footer(); | include_footer(); |
?> | ?> |
<!DOCTYPE html> | |
<html xmlns="http://www.w3.org/1999/xhtml"> | |
<head> | |
<meta charset="UTF-8"/> | |
<title>Minimal BubbleTree Demo</title> | |
<script type="text/javascript" src="http://code.jquery.com/jquery-1.7.2.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/jquery.history.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/raphael.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/vis4.js"></script> | |
<script type="text/javascript" src="js/bubbletree/lib/Tween.js"></script> | |
<script type="text/javascript" src="js/bubbletree/build/bubbletree.js"></script> | |
<link rel="stylesheet" type="text/css" href="js/bubbletree/build/bubbletree.css" /> | |
<script type="text/javascript" src="js/bubbletree/styles/cofog.js"></script> | |
<script type="text/javascript"> | |
$(function() { | |
<?php | |
include_once('include/common.inc.php'); | |
include("lib/Color.php"); | |
$color = new Lux_Color(); | |
$portfolios = Array(); | |
$total = 0; | |
$db = $server->get_db('disclosr-agencies'); | |
try { | |
$rows = $db->get_view("app", "byDeptStateName", null, true)->rows; | |
foreach ($rows as $row) { | |
$portfolios[trim(str_replace(Array("Department of", "Department", "the", "'", "`"), "", $row->key))] = $row->value; | |
} | |
} catch (SetteeRestClientException $e) { | |
setteErrorHandler($e); | |
} | |
$agencies = Array(); | |
try { | |
$rows = $db->get_view("app", "byCanonicalName", null, true)->rows; | |
//print_r($rows); | |
foreach ($rows as $row) { | |
$employees = 0; | |
$portfolioid = 0; | |
if (isset($row->value->employees)) { | |
$employees = $row->value->employees; | |
} | |
if (isset($row->value->statistics->employees)) { | |
$agencyEmployeesArray = object_to_array($row->value->statistics->employees); | |
if (isset($agencyEmployeesArray["2010-2011"]["value"])) { | |
$employees = $agencyEmployeesArray["2010-2011"]["value"]; | |
} else { | |
// bailout for agencies that are closed for business | |
continue; | |
} | |
} | |
if (!($employees > 0)) { | |
$employees = 0; | |
} | |
if (isset($row->value->parentOrg)) { | |
$portfolioid = $row->value->parentOrg; | |
} | |
if (isset($row->value->orgType) && $row->value->orgType == "FMA-DepartmentOfState") { | |
$portfolioid = $row->id; | |
} | |
$agencies[$portfolioid][$row->value->name] = $employees; | |
} | |
} catch (SetteeRestClientException $e) { | |
setteErrorHandler($e); | |
} | |
//print_r($portfolios); | |
//print_r($agencies); | |
// http://martin.ankerl.com/2009/12/09/how-to-create-random-colors-programmatically/ | |
$golden_ratio_conjugate = 0.618033988749895; | |
$h = 0.00+rand(0,10)/10; # use random start value | |
foreach ($portfolios as $portfolioName => $portfolioID) { | |
$h += $golden_ratio_conjugate; | |
$h = fmod($h,1); | |
$portfolioColor = $color->hsv2hex(Array($h, .3, .99)); | |
$subnodes = Array(); | |
$portfolioEmployees = 0; | |
foreach ($agencies[$portfolioID] as $agencyName => $agencyEmployees) { | |
$agencyColor = $color->hsv2hex(Array($h / 10, rand(1, 10) / 10, abs(($h * (1 / 10)) - .5) + .5)); | |
$subnodes[] = Array( | |
"label" => str_replace(Array("'", "`"), "", $agencyName), | |
"amount" => $agencyEmployees, | |
//"color" => "#" . $agencyColor | |
); | |
$portfolioEmployees += $agencyEmployees; | |
} | |
$nodes[] = Array( | |
"label" => $portfolioName, | |
"amount" => $portfolioEmployees, | |
//"color" => "#" . $portfolioColor, | |
"children" => $subnodes | |
); | |
$total += $portfolioEmployees; | |
} | |
$data = Array( | |
"label" => "Australian Federal Government", | |
"amount" => $total, | |
//"color" => "#000000", | |
"children" => $nodes | |
); | |
echo "var data =eval('('+'" . json_encode($data) . "'+')');"; | |
?> | |
new BubbleTree({ | |
data: data, | |
container: '.bubbletree' | |
}); | |
}); | |
</script> | |
</head> | |
<body> | |
<div class="bubbletree-wrapper"> | |
<div class="bubbletree"></div> | |
</div> | |
</body> | |
</html> | |
<?php | <?php |
function include_header($title) { | function include_header($title) { |
global $basePath; | global $basePath; |
?> | ?> |
<!DOCTYPE html> | <!DOCTYPE html> |
<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ --> | <!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ --> |
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]--> | <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]--> |
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]--> | <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]--> |
<!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]--> | <!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]--> |
<!--[if gt IE 8]><!--> <html lang="en"> <!--<![endif]--> | <!--[if gt IE 8]><!--> <html lang="en"> <!--<![endif]--> |
<head> | <head> |
<meta charset="utf-8" /> | <meta charset="utf-8" /> |
<!-- Set the viewport width to device width for mobile --> | <!-- Set the viewport width to device width for mobile --> |
<meta name="viewport" content="width=device-width" /> | <meta name="viewport" content="width=device-width" /> |
<title><?php echo $title; ?> - Disclosr</title> | <title><?php echo $title; ?> - Disclosr</title> |
<!-- Included CSS Files --> | <!-- Included CSS Files --> |
<link href="<?php echo $basePath ?>css/bootstrap.min.css" rel="stylesheet"> | <link href="<?php echo $basePath ?>css/bootstrap.min.css" rel="stylesheet"> |
<style type="text/css"> | <style type="text/css"> |
body { | body { |
padding-top: 60px; | padding-top: 60px; |
padding-bottom: 40px; | padding-bottom: 40px; |
} | } |
.sidebar-nav { | .sidebar-nav { |
padding: 9px 0; | padding: 9px 0; |
} | } |
</style> | </style> |
<link href="<?php echo $basePath ?>css/bootstrap-responsive.min.css" rel="stylesheet"> | <link href="<?php echo $basePath ?>css/bootstrap-responsive.min.css" rel="stylesheet"> |
<!--[if lt IE 9]> | <!--[if lt IE 9]> |
<link rel="stylesheet" href="<?php echo $basePath ?>stylesheets/ie.css"> | <link rel="stylesheet" href="<?php echo $basePath ?>stylesheets/ie.css"> |
<![endif]--> | <![endif]--> |
<!-- IE Fix for HTML5 Tags --> | <!-- IE Fix for HTML5 Tags --> |
<!--[if lt IE 9]> | <!--[if lt IE 9]> |
<script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script> | <script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script> |
<![endif]--> | <![endif]--> |
</head> | </head> |
<body xmlns:schema="http://schema.org/" xmlns:foaf="http://xmlns.com/foaf/0.1/"> | <body xmlns:schema="http://schema.org/" xmlns:foaf="http://xmlns.com/foaf/0.1/"> |
<div class="navbar navbar-inverse navbar-fixed-top"> | <div class="navbar navbar-inverse navbar-fixed-top"> |
<div class="navbar-inner"> | <div class="navbar-inner"> |
<div class="container-fluid"> | <div class="container-fluid"> |
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> | <a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> |
<span class="icon-bar"></span> | <span class="icon-bar"></span> |
<span class="icon-bar"></span> | <span class="icon-bar"></span> |
<span class="icon-bar"></span> | <span class="icon-bar"></span> |
</a> | </a> |
<a class="brand" href="#">Disclosr</a> | <a class="brand" href="#">Disclosr</a> |
<div class="nav-collapse collapse"> | <div class="nav-collapse collapse"> |
<ul class="nav"> | <ul class="nav"> |
<li><a href="getAgency.php">Agencies</a></li> | <li><a href="getAgency.php">Agencies</a></li> |
<li><a href="headcount.php">Employee Headcount Graph</a></li> | |
<li><a href="budget.php">Budget Graph</a></li> | |
<li><a href="about.php">About/FAQ</a></li> | <li><a href="about.php">About/FAQ</a></li> |
</ul> | </ul> |
</div><!--/.nav-collapse --> | </div><!--/.nav-collapse --> |
</div> | </div> |
</div> | </div> |
</div> | </div> |
<div class="container-fluid"> | <div class="container-fluid"> |
<?php } | <?php } |
function include_footer() { | function include_footer() { |
global $basePath; | global $basePath; |
?> | ?> |
</div> <!-- /container --> | </div> <!-- /container --> |
<hr> | <hr> |
<footer> | <footer> |
<p>Not affiliated with or endorsed by any government agency.</p> | <p>Not affiliated with or endorsed by any government agency.</p> |
</footer> | </footer> |
<!-- Included JS Files --> | <!-- Included JS Files --> |
<script src="http://code.jquery.com/jquery-1.7.1.min.js"></script> | <script src="http://code.jquery.com/jquery-1.7.1.min.js"></script> |
<script type="text/javascript" src="<?php echo $basePath ?>js/flotr2/flotr2.js"></script> | <script type="text/javascript" src="<?php echo $basePath ?>js/flotr2/flotr2.js"></script> |
<?php | <?php |
if (strpos($_SERVER['SERVER_NAME'], ".gs")) { | if (strpos($_SERVER['SERVER_NAME'], ".gs")) { |
?> | ?> |
<script type="text/javascript"> | <script type="text/javascript"> |
var _gaq = _gaq || []; | var _gaq = _gaq || []; |
_gaq.push(['_setAccount', 'UA-12341040-2']); | _gaq.push(['_setAccount', 'UA-12341040-2']); |
_gaq.push(['_trackPageview']); | _gaq.push(['_trackPageview']); |
(function() { | (function() { |
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; | var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; |
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; | ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; |
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); | var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); |
})(); | })(); |
</script> | </script> |
</body> | </body> |
</html> | </html> |
<?php } | <?php } |
} | } |