Change has fields to be arrays of URLs that confirm those assertions
Former-commit-id: 83e592c1490220f07899eff9a51c768ee6067720
--- a/.gitmodules
+++ b/.gitmodules
@@ -10,4 +10,7 @@
[submodule "lib/php-diff"]
path = lib/php-diff
url = https://github.com/chrisboulton/php-diff.git
+[submodule "javascripts/flot"]
+ path = javascripts/flot
+ url = https://github.com/paradoxxxzero/flot.git
--- a/about.php
+++ b/about.php
@@ -13,48 +13,49 @@
Prometheus is the agent which polls agency websites to assess compliance.
<h2> Open everything </h2>
-all documents released CC-BY 3 AU
+All documents released CC-BY 3 AU
Open source git @
<h2>Organisational Data Sources</h2>
http://www.comlaw.gov.au/Browse/Results/ByTitle/AdministrativeArrangementsOrders/Current/Ad/0 defines departments
-Agencies can be found in the Schedule to an Appropriation Bill (budget), Schedule to FMA Regulations and/or Public Service Act.
+Agencies can be found in the Schedule to an Appropriation Bill (budget), Schedule to FMA Regulations and/or Public Service Act.<br>
-http://www.finance.gov.au/publications/flipchart/docs/FMACACFlipchart.pdf summarises these. view-source:https://www.tenders.gov.au/?event=public.advancedsearch.home is great for the suspended/active status
+http://www.finance.gov.au/publications/flipchart/docs/FMACACFlipchart.pdf summarises these. view-source:https://www.tenders.gov.au/?event=public.advancedsearch.home is great for the suspended/active status<br>
When defining the hierachy, this system is designed towards monitoring accountablity. Thus large agencies that have registered their own ABN
and have their own accountablity mechanisms/website receive a seperate record as a child of their department.
-Some small agencies will choose to simply rely on their parent department's accountablity measures.
+Some small agencies will choose to simply rely on their parent department's accountablity measures.<br>
This flows through to organisation name and other/past names. A department that completely accounts for an agency will list that agency as an other child name.
As agencies themselves shift between departments, there may be scope for providing time ranges but typically the newest hierarchy will be the one recorded.
-A department/agency name will be the newest active name assigned to that ABN.
+A department/agency name will be the newest active name assigned to that ABN.<br>
ABN information is derived from the ABR. This is the definitive umpire about which former name should be linked to which current name.
For example "Department of Transport and Regional Services" became "Department of Infrastructure, Transport, Regional Development and Local Government" (same ABN)
however it later split into "Department of Infrastructure and Transport" (same ABN)
-and "Department of Regional Australia, Regional Development and Local Government" (new ABN).
+and "Department of Regional Australia, Regional Development and Local Government" (new ABN).<br>
Statistical information from http://www.apsc.gov.au/stateoftheservice/1011/statsbulletin/section1.html#t2total https://www.apsedii.gov.au/apsedii/CustomQueryx33.shtml
-and individual annual reports.
+and individual annual reports.<br>
-Webpage Assessment
+<h2>Webpage Assessment</h2>
Much due care has been put into correctly recording disclosure URLs. Typically the "About", "Corporate", "Publications" and "Sitemap" sections are checked at the very least.
-Occasionally it is nessicary to use a site or Google search. In several rare cases, there is a secret "Disclosure" navigation menu you can find if you find one of the mandatory publishing obligations in that category (seriously).
-Some rules about leniency:
- An empty FOI disclosure log counts, a page outlining what the FOI Act is does not.
- A disclosure log in PDF or Word format counts :(
- An empty File/Record list counts (although that's very minimalistic that you have no files, electronic or paper)
- Only a current information publication scheme page counts, not a s.9 FOI Act page or an organisation chart.
- If there isn't a page easily listing all current and past Annual Reports, the most current one (html, pdf) counts.
- Consultancy contracts might not need it's own webpage (if in Annual Report), grants/appointments might not apply to all organisations but Legal Services Expenditure (and all other obligations) does need a webpage.
+Occasionally it is nessicary to use a site or Google search. In several rare cases, there is a secret "Disclosure" navigation menu you can find if you find one of the mandatory publishing obligations in that category (seriously).<br>
+Some rules about leniency:<br>
+<ul>
+ <li>An empty FOI disclosure log counts, a page outlining what the FOI Act is does not.</li>
+ <li>A disclosure log in PDF or Word format counts :(</li>
+ <li>An empty File/Record list counts (although that's very minimalistic that you have no files, electronic or paper)</li>
+ <li>Only a current information publication scheme page counts, not a s.9 FOI Act page or an organisation chart.</li>
+ <li>If there isn't a page easily listing all current and past Annual Reports, the most current one (html, pdf) counts.</li>
+ <li>Consultancy contracts might not need it's own webpage (if in Annual Report), grants/appointments might not apply to all organisations but Legal Services Expenditure (and all other obligations) does need a webpage. </li>
<h2>Open Government Scoring</h2>
-+1 point for every true Has... attribute
--1 point for every false Has... (ie. Has Not) attribute
++1 point for every true Has... attribute<br>
+-1 point for every false Has... (ie. Has Not) attribute</br>
-Don't like this? Make your own score, suggest a better scoring mechanism.
+Don't like this? Make your own score, suggest a better scoring mechanism.</br>
<?php
include_footer();
--- a/alaveteli/exportCategories.rb.php
+++ b/alaveteli/exportCategories.rb.php
@@ -1,19 +1,20 @@
<?php
+
include_once("../include/common.inc.php");
setlocale(LC_CTYPE, 'C');
- header('Content-Type: text/csv');
- header('Content-Disposition: attachment; filename="public_body_categories_en.rb"');
- header('Pragma: no-cache');
- header('Expires: 0');
-echo 'PublicBodyCategories.add(:en, ['.PHP_EOL;
-echo ' "Portfolios",'.PHP_EOL;
+header('Content-Type: text/csv');
+header('Content-Disposition: attachment; filename="public_body_categories_en.rb"');
+header('Pragma: no-cache');
+header('Expires: 0');
+echo 'PublicBodyCategories.add(:en, [' . PHP_EOL;
+echo ' "Portfolios",' . PHP_EOL;
$db = $server->get_db('disclosr-agencies');
try {
$rows = $db->get_view("app", "byDeptStateName", null, true)->rows;
//print_r($rows);
foreach ($rows as $row) {
- echo ' [ "'.phrase_to_tag(dept_to_portfolio($row->key)).'","'. dept_to_portfolio($row->key).'","part of the '.dept_to_portfolio($row->key).' portfolio" ],'.PHP_EOL;
+ echo ' [ "' . phrase_to_tag(dept_to_portfolio($row->key)) . '","' . dept_to_portfolio($row->key) . '","part of the ' . dept_to_portfolio($row->key) . ' portfolio" ],' . PHP_EOL;
}
} catch (SetteeRestClientException $e) {
setteErrorHandler($e);
--- /dev/null
+++ b/charts.php
@@ -1,1 +1,102 @@
+<?php
+include_once('include/common.inc.php');
+include_header();
+$db = $server->get_db('disclosr-agencies');
+
+?>
+<div class="foundation-header">
+ <h1><a href="about.php">Charts</a></h1>
+ <h4 class="subheader">Lorem ipsum.</h4>
+</div>
+<div id="placeholder" style="width:900px;height:600px;"></div>
+<script id="source">
+window.onload = function() {
+ $(document).ready(function() {
+ var d1 = [];
+ var labels = [];
+ <?php
+ try {
+ $rows = $db->get_view("app", "scoreHas?group=true", null, true)->rows;
+
+ /*foreach ($rows as $key => $row) {
+ echo " d1.push([$key, {$row->value}]);".PHP_EOL;
+ echo " labels.push('{$row->key}');".PHP_EOL;
+ }*/
+ $dataValues = Array();
+ foreach ($rows as $row) {
+ $dataValues[$row->value] = $row->key;
+ }
+ $i = 0;
+ ksort($dataValues);
+ foreach($dataValues as $value => $key) {
+
+ echo " d1.push([$i, $value]);".PHP_EOL;
+ echo " labels.push('$key');".PHP_EOL;
+ $i++;
+ }
+} catch (SetteeRestClientException $e) {
+ setteErrorHandler($e);
+}
+?>
+
+ $.plot($("#placeholder"), [ d1], {
+ grid: { hoverable: true },
+
+ series: {
+ bars: { show: true, barWidth: 0.6 }
+ },
+ xaxis: {
+ tickFormatter: function formatter(val, axis) {
+ if (labels[val]) {
+ return(labels[val]);
+
+ } else {
+ return "";
+ }
+
+ },
+ labelAngle: 90
+ }
+ });
+
+var previousPoint = null;
+$("#placeholder").bind("plothover", function (event, pos, item) {
+ if (item) {
+ if (previousPoint != item.datapoint) {
+ previousPoint = item.datapoint;
+
+ $("#tooltip").remove();
+ var x = item.datapoint[0],
+ y = item.datapoint[1] - item.datapoint[2];
+
+ showTooltip(item.pageX, item.pageY, y );
+ }
+ }
+ else {
+ $("#tooltip").remove();
+ previousPoint = null;
+ }
+});
+
+
+
+});
+};
+ function showTooltip(x, y, contents) {
+ $('<div id="tooltip">' + contents + '</div>').css( {
+ position: 'absolute',
+ display: 'none',
+ top: y + 5,
+ left: x + 5,
+ border: '1px solid #fdd',
+ padding: '2px',
+ 'background-color': '#fee',
+ opacity: 0.80
+ }).appendTo("body").fadeIn(200);
+ }
+</script>
+
+<?php
+include_footer();
+?>
--- a/getAgency.php
+++ b/getAgency.php
@@ -58,9 +58,7 @@
echo "<option value='{$row->value}'" . (($row->value == $value) ? "SELECTED" : "") . " >" . str_replace("Department of ", "", $row->key) . "</option>";
}
echo" </select>";
- } else if (strpos($key, "has") === 0) {
- echo "<label for='$key'><input type='checkbox' id='$key' name='$key' " . (($value == 'on' || $value == 'true') ? "checked='$value'" : "") . "> $key</label>";
- } else {
+ } else {
echo "<label>$key</label><input class='input-text' type='text' id='$key' name='$key' value='$value'/>";
if ((strpos($key, "URL") > 0 || $key == 'website') && $value != "") {
echo "<a href='$value'>view</a>";
@@ -80,11 +78,9 @@
foreach ($defaultFields as $defaultField) {
if (!isset($row[$defaultField])) {
if ($schemas['agency']['properties'][$defaultField]['type'] == "string") {
- if (strpos($defaultField, "has") === 0) {
- $row[$defaultField] = "false";
- } else {
+
$row[$defaultField] = "";
- }
+
}
if ($schemas['agency']['properties'][$defaultField]['type'] == "array") {
@@ -124,7 +120,7 @@
}
}
- $mode = "view";
+ $mode = "edit";
if ($mode == "edit") {
$row = addDefaultFields(object_to_array($row));
} else {
--- a/include/couchdb.inc.php
+++ b/include/couchdb.inc.php
@@ -1,8 +1,8 @@
<?php
-include $basePath."schemas/schemas.inc.php";
+include $basePath . "schemas/schemas.inc.php";
-require ($basePath.'couchdb/settee/src/settee.php');
+require ($basePath . 'couchdb/settee/src/settee.php');
function createAgencyDesignDoc() {
global $db;
@@ -11,38 +11,50 @@
$obj->language = "javascript";
$obj->views->all->map = "function(doc) { emit(doc._id, doc); };";
$obj->views->byABN->map = "function(doc) { emit(doc.abn, doc); };";
- $obj->views->byCanonicalName->map = "function(doc) {
+ $obj->views->byCanonicalName->map = "function(doc) {
if (doc.parentOrg || doc.orgType == 'FMA-DepartmentOfState') {
emit(doc.name, doc);
}
};";
- $obj->views->byDeptStateName->map = "function(doc) {
+ $obj->views->byDeptStateName->map = "function(doc) {
if (doc.orgType == 'FMA-DepartmentOfState') {
emit(doc.name, doc._id);
}
};";
- $obj->views->parentOrgs->map = "function(doc) {
+ $obj->views->parentOrgs->map = "function(doc) {
if (doc.parentOrg) {
emit(doc._id, doc.parentOrg);
}
};";
- $obj->views->byName->map = "function(doc) {
+ $obj->views->byName->map = 'function(doc) {
+ if (typeof(doc["status"]) == "undefined" || doc["status"] != "suspended") {
emit(doc.name, doc._id);
for (name in doc.otherNames) {
-if (doc.otherNames[name] != '' && doc.otherNames[name] != doc.name) {
+if (doc.otherNames[name] != "" && doc.otherNames[name] != doc.name) {
emit(doc.otherNames[name], doc._id);
}
}
-};";
-
- $obj->views->foiEmails->map = "function(doc) {
+ }
+};';
+
+ $obj->views->foiEmails->map = "function(doc) {
emit(doc._id, doc.foiEmail);
};";
-
+
$obj->views->byLastModified->map = "function(doc) { emit(doc.metadata.lastModified, doc); }";
$obj->views->getActive->map = 'function(doc) { if (doc.status == "active") { emit(doc._id, doc); } };';
$obj->views->getSuspended->map = 'function(doc) { if (doc.status == "suspended") { emit(doc._id, doc); } };';
- $obj->views->getScrapeRequired->map = "function(doc) { emit(doc.abn, doc); };";
+ $obj->views->getScrapeRequired->map = "function(doc) {
+
+var lastScrape = Date.parse(doc.metadata.lastScraped);
+
+var today = new Date();
+
+if (!lastScrape || lastScrape.getTime() + 1000 != today.getTime()) {
+ emit(doc._id, doc);
+}
+
+};";
$obj->views->showNamesABNs->map = "function(doc) { emit(doc._id, {name: doc.name, abn: doc.abn}); };";
$obj->views->getConflicts->map = "function(doc) {
if (doc._conflicts) {
@@ -50,39 +62,74 @@
}
}";
// http://stackoverflow.com/questions/646628/javascript-startswith
- $obj->views->score->map = 'if(!String.prototype.startsWith){
+ $obj->views->scoreHas->map = 'if(!String.prototype.startsWith){
String.prototype.startsWith = function (str) {
return !this.indexOf(str);
}
}
-
+if(!String.prototype.endsWith){
+ String.prototype.endsWith = function(suffix) {
+ return this.indexOf(suffix, this.length - suffix.length) !== -1;
+ };
+}
function(doc) {
-count = 0;
if (typeof(doc["status"]) == "undefined" || doc["status"] != "suspended") {
for(var propName in doc) {
- if(typeof(doc[propName]) != "undefined" && propName.startsWith("l")) {
- count++
+ if(typeof(doc[propName]) != "undefined" && (propName.startsWith("has") || propName.endsWith("URL"))) {
+ emit(propName, 1);
}
}
- emit(count+doc._id, {id:doc._id, name: doc.name, score:count});
+ emit("total", 1);
}
}';
-
+ $obj->views->scoreHas->map = 'if(!String.prototype.startsWith){
+ String.prototype.startsWith = function (str) {
+ return !this.indexOf(str);
+ }
+}
+if(!String.prototype.endsWith){
+ String.prototype.endsWith = function(suffix) {
+ return this.indexOf(suffix, this.length - suffix.length) !== -1;
+ };
+}
+function(doc) {
+if (typeof(doc["status"]) == "undefined" || doc["status"] != "suspended") {
+for(var propName in doc) {
+ if(typeof(doc[propName]) != "undefined" && (propName.startsWith("has") || propName.endsWith("URL"))) {
+ emit(propName, 1);
+ }
+}
+ emit("total", 1);
+ }
+}';
+ $obj->views->scoreHas->reduce = 'function (key, values, rereduce) {
+ return sum(values);
+}';
+ $obj->views->fieldNames->map = '
+function(doc) {
+for(var propName in doc) {
+ emit(propName, doc._id);
+ }
+
+}';
+ $obj->views->fieldNames->reduce = 'function (key, values, rereduce) {
+ return values.length;
+}';
// allow safe updates (even if slightly slower due to extra: rev-detection check).
return $db->save($obj, true);
}
+if (php_uname('n') == "vanille") {
-if( php_uname('n') == "vanille") {
+ $server = new SetteeServer('http://192.168.178.21:5984');
+} else
+if (php_uname('n') == "KYUUBEY") {
-$server = new SetteeServer('http://192.168.178.21:5984');
-} else
- if( php_uname('n') == "KYUUBEY") {
-
-$server = new SetteeServer('http://192.168.1.148:5984');
+ $server = new SetteeServer('http://192.168.1.148:5984');
} else {
$server = new SetteeServer('http://127.0.0.1:5984');
}
+
function setteErrorHandler($e) {
echo $e->getMessage() . "<br>" . PHP_EOL;
}
--- a/include/template.inc.php
+++ b/include/template.inc.php
@@ -69,6 +69,9 @@
<script src="<?php echo $basePath; ?>javascripts/foundation.js"></script>
<script src="<?php echo $basePath; ?>javascripts/app.js"></script>
<script src="http://code.jquery.com/jquery-1.7.1.min.js"></script>
+
+ <!--<script language="javascript" type="text/javascript" src="javascripts/jquery.js"></script>-->
+ <script language="javascript" type="text/javascript" src="javascripts/flot/jquery.flot.js"></script>
</body>
</html>
--- a/javascripts/app.js
+++ b/javascripts/app.js
@@ -43,7 +43,7 @@
/* PLACEHOLDER FOR FORMS ------------- */
/* Remove this and jquery.placeholder.min.js if you don't need :) */
- $('input, textarea').placeholder();
+ //$('input, textarea').placeholder();
--- /dev/null
+++ b/javascripts/flot
--- a/schemas/agency.json.php
+++ b/schemas/agency.json.php
@@ -24,17 +24,30 @@
"consultanciesURL" => Array("type" => "string", "required" => true, "x-title" => "Consultants Hired", "description" => ""),
"legalExpenditureURL" => Array("type" => "string", "required" => true, "x-title" => "Legal Services Expenditure", "description" => "Legal Services Expenditure mandated by Legal Services Directions 2005"),
"recordsListURL" => Array("type" => "string", "required" => true, "x-title" => "Files/Records Held", "description" => "Indexed lists of departmental and agency files, <a href='http://www.aph.gov.au/senate/pubs/standing_orders/d05.htm'>mandated by the Senate</a>"),
- "FOIDocumentsURL" => Array("type" => "string", "required" => true, "x-title" => "FOI Documents Released", "description" => ""),
- "infoPublicationSchemeURL" => Array("type" => "string", "required" => true, "x-title" => "Information Publication Scheme", "description" => ""),
+ "FOIDocumentsURL" => Array("type" => "string", "required" => true, "x-title" => "FOI Documents Released", "description" => "FOI Disclosure Log URL"),
+ "FOIDocumentsRSSURL" => Array("type" => "string", "required" => false, "x-title" => "RSS Feed of FOI Documents Released", "description" => "FOI Disclosure Log in RSS format"),
+ "hasFOIPDF" => Array("type" => "array", "required" => false, "x-title" => "Has FOI Documents Released in PDF", "description" => "FOI Disclosure Log contains any PDFs",
+ "items" => Array("type" => "string")),
+ "infoPublicationSchemeURL" => Array("type" => "string", "required" => true, "x-title" => "Information Publication Scheme", "description" => ""),
"appointmentsURL" => Array("type" => "string", "required" => true, "x-title" => "Agency Appointments/Boards", "description" => "Departmental and agency appointments and vacancies , <a href='http://www.aph.gov.au/senate/pubs/standing_orders/d05.htm'>mandated by the Senate</a>"),
"advertisingURL" => Array("type" => "string", "required" => true, "x-title" => "Approved Advertising Campaigns", "description" => " Agency advertising and public information projects, <a href='http://www.aph.gov.au/senate/pubs/standing_orders/d05.htm'>mandated by the Senate</a> "),
- "hasRSS" => Array("type" => "string", "required" => true, "x-title" => "Has RSS", "description" => ""),
- "hasMailingList" => Array("type" => "string", "required" => true, "x-title" => "Has Mailing List", "description" => ""),
- "hasTwitter" => Array("type" => "string", "required" => true, "x-title" => "Has Twitter", "description" => ""),
- "hasFacebook" => Array("type" => "string", "required" => true, "x-title" => "Has Facebook", "description" => ""),
- "hasYouTube" => Array("type" => "string", "required" => true, "x-title" => "Has YouTube", "description" => ""),
- "hasFlickr" => Array("type" => "string", "required" => true, "x-title" => "Has Flickr", "description" => ""),
- "hasCCBY" => Array("type" => "string", "required" => true, "x-title" => "Has CC-BY", "description" => "Has any page licenced Creative Commons - Attribution"),
+ "hasRSS" => Array("type" => "array", "required" => true, "x-title" => "Has RSS", "description" => ""),
+ "hasMailingList" => Array("type" => "array", "required" => true, "x-title" => "Has Mailing List", "description" => "",
+ "items" => Array("type" => "string")),
+ "hasTwitter" => Array("type" => "array", "required" => true, "x-title" => "Has Twitter", "description" => "",
+ "items" => Array("type" => "string")),
+ "hasFacebook" => Array("type" => "array", "required" => true, "x-title" => "Has Facebook", "description" => "",
+ "items" => Array("type" => "string")),
+ "hasYouTube" => Array("type" => "array", "required" => true, "x-title" => "Has YouTube", "description" => "",
+ "items" => Array("type" => "string")),
+ "hasFlickr" => Array("type" => "array", "required" => true, "x-title" => "Has Flickr", "description" => "",
+ "items" => Array("type" => "string")),
+ "hasCCBY" => Array("type" => "array", "required" => true, "x-title" => "Has CC-BY", "description" => "Has any page licenced Creative Commons - Attribution",
+ "items" => Array("type" => "string")),
+ "hasRestrictiveLicence" => Array("type" => "array","required" => true, "x-title" => "Has Restrictive Licence", "description" => "Has any page licenced under terms more restrictive than Crown Copyright",
+ "items" => Array("type" => "string")),
+ "hasCrownCopyright" => Array("type" => "array", "required" => true, "x-title" => "Has Standard Crown Copyright licence", "description" => "Has any page still licenced under the former Commonwealth Copyright Administration",
+ "items" => Array("type" => "string")),
),
/* "org":{"type":"object",
"properties":{
--- a/scrape.py
+++ b/scrape.py
@@ -3,15 +3,10 @@
import urllib2
from BeautifulSoup import BeautifulSoup
import re
-
-couch = couchdb.Server('http://192.168.1.148:5984/')
-
-# select database
-agencydb = couch['disclosr-agencies']
-
-for row in agencydb.view('app/getScrapeRequired'): #not recently scraped agencies view?
- agency = agencydb.get(row.id)
- print agency['agencyName']
+import hashlib
+from urlparse import urljoin
+import time
+import os
#http://diveintopython.org/http_web_services/etags.html
class NotModifiedHandler(urllib2.BaseHandler):
@@ -20,46 +15,102 @@
addinfourl.code = code
return addinfourl
-def scrapeAndStore(URL, depth, agency):
- URL = "http://www.hole.fi/jajvirta/weblog/"
- req = urllib2.Request(URL)
+def fetchURL(docsdb, url, fieldName, agencyID, scrape_again=True):
+ hash = hashlib.md5(url).hexdigest()
+ req = urllib2.Request(url)
+ print "Fetching %s" % url
+ doc = docsdb.get(hash)
+ if doc == None:
+ doc = {'_id': hash, 'agencyID': agencyID, 'url': url, 'fieldName':fieldName}
+ else:
+ if (time.time() - doc['page_scraped']) < 3600:
+ print "Uh oh, trying to scrape URL again too soon!"
+ last_attachment_fname = doc["_attachments"].keys()[-1]
+ last_attachment = docsdb.get_attachment(doc,last_attachment_fname)
+ return (doc['mime_type'],last_attachment)
+ if scrape_again == False:
+ print "Not scraping this URL again as requested"
+ return (None,None)
+
+ time.sleep(3) # wait 3 seconds to give webserver time to recover
- #if there is a previous version sotred in couchdb, load caching helper tags
- if etag:
- req.add_header("If-None-Match", etag)
- if last_modified:
- req.add_header("If-Modified-Since", last_modified)
+ #if there is a previous version stored in couchdb, load caching helper tags
+ if doc.has_key('etag'):
+ req.add_header("If-None-Match", doc['etag'])
+ if doc.has_key('last_modified'):
+ req.add_header("If-Modified-Since", doc['last_modified'])
opener = urllib2.build_opener(NotModifiedHandler())
url_handle = opener.open(req)
headers = url_handle.info() # the addinfourls have the .info() too
- etag = headers.getheader("ETag")
- last_modified = headers.getheader("Last-Modified")
- web_server = headers.getheader("Server")
- file_size = headers.getheader("Content-Length")
- mime_type = headers.getheader("Content-Type")
-
- if hasattr(url_handle, 'code')
+ doc['etag'] = headers.getheader("ETag")
+ doc['last_modified'] = headers.getheader("Last-Modified")
+ doc['date'] = headers.getheader("Date")
+ doc['page_scraped'] = time.time()
+ doc['web_server'] = headers.getheader("Server")
+ doc['powered_by'] = headers.getheader("X-Powered-By")
+ doc['file_size'] = headers.getheader("Content-Length")
+ doc['mime_type'] = headers.getheader("Content-Type").split(";")[0]
+ if hasattr(url_handle, 'code'):
if url_handle.code == 304:
print "the web page has not been modified"
+ return (None,None)
else:
- #do scraping
- html = url_handle.read()
- # http://www.crummy.com/software/BeautifulSoup/documentation.html
- soup = BeautifulSoup(html)
- links = soup.findAll('a') # soup.findAll('a', id=re.compile("^p-"))
- for link in links:
- print link['href']
- #for each unique link
- #if html mimetype
- # go down X levels,
- # diff with last stored attachment, store in document
- #if not
- # remember to save parentURL and title (link text that lead to document)
-
+ content = url_handle.read()
+ docsdb.save(doc)
+ doc = docsdb.get(hash) # need to get a _rev
+ docsdb.put_attachment(doc, content, str(time.time())+"-"+os.path.basename(url), doc['mime_type'])
+ return (doc['mime_type'], content)
#store as attachment epoch-filename
else:
- print "error %s in downloading %s", url_handle.code, URL
- #record/alert error to error database
-
-
+ print "error %s in downloading %s" % url_handle.code, URL
+ doc['error'] = "error %s in downloading %s" % url_handle.code, URL
+ docsdb.save(doc)
+ return (None,None)
+
+
+
+def scrapeAndStore(docsdb, url, depth, fieldName, agencyID):
+ (mime_type,content) = fetchURL(docsdb, url, fieldName, agencyID)
+ if content != None and depth > 0:
+ if mime_type == "text/html" or mime_type == "application/xhtml+xml" or mime_type =="application/xml":
+ # http://www.crummy.com/software/BeautifulSoup/documentation.html
+ soup = BeautifulSoup(content)
+ navIDs = soup.findAll(id=re.compile('nav|Nav|menu|bar'))
+ for nav in navIDs:
+ print "Removing element", nav['id']
+ nav.extract()
+ navClasses = soup.findAll(attrs={'class' : re.compile('nav|menu|bar')})
+ for nav in navClasses:
+ print "Removing element", nav['class']
+ nav.extract()
+ links = soup.findAll('a') # soup.findAll('a', id=re.compile("^p-"))
+ linkurls = set([])
+ for link in links:
+ if link.has_key("href"):
+ if link['href'].startswith("http"):
+ # lets not do external links for now
+ # linkurls.add(link['href'])
+ None
+ else:
+ linkurls.add(urljoin(url,link['href'].replace(" ","%20")))
+ for linkurl in linkurls:
+ #print linkurl
+ scrapeAndStore(docsdb, linkurl, depth-1, fieldName, agencyID)
+
+couch = couchdb.Server('http://127.0.0.1:5984/')
+
+# select database
+agencydb = couch['disclosr-agencies']
+docsdb = couch['disclosr-documents']
+
+for row in agencydb.view('app/getScrapeRequired'): #not recently scraped agencies view?
+ agency = agencydb.get(row.id)
+ print agency['name']
+ for key in agency.keys():
+ if key == 'website' or key.endswith('URL'):
+ print key
+ scrapeAndStore(docsdb, agency[key],agency['scrapeDepth'],key,agency['_id'])
+ agency['metadata']['lastscraped'] = time.time()
+ agencydb.save(agency)
+