Safeguarding Open Government Data: The 2008 End of Term Slither Venture.

Uploaded on:
Category: Art / Culture
US Government Printing Office. Venture History. Initially Meeting Canberra, Australia ... Out of extension: Local or state government Web destinations, or whatever other webpage ...
Slide 1

Abbie Grotke, Library of Congress Mark Phillips, University of North Texas Libraries George Barnum, U.S. Government Printing Office CNI Fall Task Force Meeting December 9, 2008 Preserving Public Government Information: The 2008 End of Term Crawl Project END OF TERM PROJECT

Slide 2

Outline Project Goals and History Nomination of URLs Partner Activities Future Work

Slide 3

Project Goals Work cooperatively to safeguard open U.S. Government Web locales toward the end of the current presidential organization finishing January 19, 2009. Record government organizations\' nearness on the Web amid the move of Presidential organizations. To improve the current accumulations of the five accomplice organizations.

Slide 4

Project History Collaborating Institutions : Library of Congress Internet Archive California Digital Library University of North Texas US Government Printing Office

Slide 5

Project History First Meeting – Canberra, Australia Early April 2008, at the National Library of Australia, International Internet Preservation Consortium (IIPC) – framed the organization and examined suggestions and conceivable parts for every foundation. Concurred from the earliest starting point to impart all substance to any accomplice who wished a duplicate.

Slide 6

Project History Monthly gatherings since that time – telephone calls and one vis-à-vis meeting. Characterized parts Released a declaration Sought assistance from pros to select URLs for gathering Shared innovation arranging Developed URL Nomination Tool

Slide 7

Pause for Vocabulary Seed List – List of URLs encouraged to the crawler for reaping. Crawler – Software which downloads document, parses content to concentrate URLs, adds URLs to rundown and rehashes Scope – Whether a URL ought to be incorporated or not Crawl – Running a crawler on a given seed list SURT – Sort-capable URL Reversible Transformation

Slide 8

In Scope versus Out of Scope In degree: Federal government Web locales (.gov, .mil, and so forth.) in the Legislative, Executive, or Judicial branches of government. Specifically compelling for prioritization are destinations liable to change significantly or vanish amid the move of government Out of extension: Local or state government Web destinations, or whatever other webpage not part of the above national government area

Slide 11 – truly a .gov? Content copyright © 2008 by Obama-Biden Transition Project, a 501c(4) association. Inquiry of .gov whois says space was enlisted by GSA (General Services Administration) Other authority move locales:

Slide 12

Tool Building URL Nomination Tool Allows for consolidating numerous seed records Allows for joint effort with subject specialists Helps make future seed records Helps to characterize general extent of task

Slide 13

URL Nomination Tool Allows for cooperation with subject specialists Ingest seed records from various sources Record known metadata for seed Branch Title Comment Who named Allow individuals to help with assignment Search Browse "simple to utilize" Create seed records for creeps

Slide 14

Tool Concepts URL – Single case of metadata in framework URL Attribute – (metadata component) Value – (metadata esteem) Nominator ID Project ID Timestamp Nominator Email Address Nominator Name Nominator Institution Project Metddata

Slide 15

List of URLs

Slide 16

List of SURTs gov.accessamerica gov.cancer.2001 gov.nasa.gsfc.accesstospace gov.nasa.gsfc.adc gov.nasa.jpl.acrim gov.nih.nci.2001 gov.noaa.fls.acweb gov.noaa.nos.acc gov.usda.program.1890scholars gov.usgs.access gov.wa.access

Slide 17

Back to the device… Tool Requirements Ingest seed records from various sources Keep track of who selected seed Record known metadata for seed Allow individuals to help with designation Search Browse "simple to utilize" Create seed records for slithers

Slide 18

Batch Ingest Administrator can import csv documents with URLs and related metadata with group merchant An ingest should be connected with a Nominator and a Project. Self-assertive metadata is perceived and added to the framework.

Slide 19

Nomination – In Scope/Out of Scope On cluster import a URL is given a positive selection +1 A client of the Nomination Tool can name a URL as in degree (+1) or out of extension (- 1) Nominations are computed to give a conceivable measure of significance for an undertaking.

Slide 20

EOT 2008 Project Metadata fields characterized for EOT 2008 Harvest Branch Title Comment Nominators don\'t have to enroll yet Name, Email and Institution are required.

Slide 32

Volunteer Nominators Call for volunteers at end of August to records focusing on: Government data pros Librarians Political and sociology analysts Academics Web annalists (IIPC, Archive-IT people group) 31 people joined to help

Slide 33

Nominator To-Dos Nominations in light of their interests/aptitude Nominate the most basic URLs for catch as "in scope" Add new URLs not officially incorporated into the rundown Mark unessential or out of date destinations as "out of scope" Add insignificant URL metadata

Slide 34

What Did They Do? Due date was November 30, 2008 24 volunteers designated no less than one website or all the more (counting venture group) 500 URLs named in extension or out of degree

Slide 35

Partner Roles Internet Archive – Broad, thorough harvests Library of Congress – inside and out Legislative branch slithers University of North Texas – Sites/Agencies that meet current UNT interests, e.g. ecological approach, and accumulations, and in addition a few "profound web" locales. California Digital Library – Mutiple slithers of all seeds in EOT database; locales important to their custodians Government Printing Office – Support and investigation of "authority archives" found in gathering

Slide 36

Crawl Schedule Two Approaches: Broad, far reaching creeps Prioritized, specific creeps Key dates: Election Day, November 4 Inauguration Day, January 20

Slide 37

Library of Congress creep arrangement Legislative: Enhance existing, progressing month to month slithers (congressional, aoc, loc, gao, gpo and different various urls) to incorporate organized authoritative URLs. Recognized new congressional sites not on and ( amid a serious pre-creep survey. Proceed with month to month creep, however slither further, more (one month instead of one week) October – February Will likewise bolster slither of all organized seeds amongst decision and before introduction

Slide 38

October EOT Crawl Approximately multiplied the quantity of reports reaped amid the slither. Discovering Twitter Youtube Myspace Flickr

Slide 39

UNT slither arrangement Crawling chose URLs important to UNT Libraries, including: FEMA Energy Information Administration Department of Agriculture Homeland Security Office of Faith-Based and Community Initiatives Department of Education Fuel Economy Environmental morals and approach materials crosswise over offices Around the decision; pre-introduction; post-initiation; potentially one year later previews

Slide 40

CDL creep arrangement Exhaustive creep of all seeds in assignment apparatus Before; a great many elections however before introduction; not long after introduction; and six months after initiation Focused slithering of destinations important to University of California caretakers Using their Web Archive Service

Slide 41

IA slither arrangement Performed standard harvest of 2522 seeds from Sept 15, 2008 until Election Day. Plan to bolster interval catches between Election Day and Inauguration Day to bolster particular harvests. Will start last exhaustive harvest of all URLs selected and/or went to on Jan. 21, 2009. Will close when full extent of materials have been gone to. Will just gather new material, i.e. content that was included or that has changed subsequent to the gauge harvest.

Slide 42

Near-Future Work Centralizing web information into a solitary accumulation at the Internet Archive Providing WayBack access to content Providing seek access to content Distributing gathering among accomplices (25-35 TB anticipated) Investigation of skim by Agency/Branch

Slide 43

Other Future Work Extracting topical accumulations from creep information Providing automatic access for information mining Research in ascertaining "size" of accumulation in connection to true measures Number of pages of content gathered Number of 8x10 in identical pictures gathered Hours of Audio Hours of Video Number of PDFs Physical library space necessities to hold gathering if in physical organization.

Slide 44


View more...