Electronic Data Service

A. Purpose and Program Description

Access to machine readable numeric data files is provided at Columbia by the Electronic Data Service (EDS) which is a joint project of the Libraries and Columbia University Information Technology (CUIT). EDS was created to support the research and instruction at Columbia that requires the use of numeric data available only in machine readable form.

The program focuses on delivering numeric data resources to users in the Columbia community. Individual data resources, called data studies, can have many components and are formatted and delivered using a variety of techniques. EDS services include building the data collection (largely done by the Libraries), providing the technical support necessary to store and deliver the collection (largely done by CUIT), and assisting users with identifying and accessing studies (by both CUIT and Libraries). In support of this effort, we:

  • participate in membership organizations that archive and/or distribute data or that support members by serving as forums for awareness of the trends and availability of data resources (organizations such as Inter-University Consortium for Political and Social Research (ICPSR), the National Center for Health Statistics (NCHS) Data Dissemination Program, Roper Center for Public Opinion Research, the Association of Public Data Users (APDU), and the New York State Data Center (SDC));
  • support the use of data products received as part of the Federal Depository Library Program (FDPL);
  • monitor the commercial and non-commercial markets for data products and select key data resources from these providers when they are not available through our membership organizations but are suited to our collection;
  • participate with other interested groups within Columbia in an effort to promote awareness and use of GIS as an analytic tool;
  • maintain a searchable catalog (DataGate) for those data resources we have in our collection;
  • assist users in identifying data resources, whether they are in our collection or available elsewhere, and in manipulating that data into a format they will use in their work;
  • support users in understanding the relationship among the parts of a data study: documentation (meta-data), raw data, and software applications needed to read the data;
  • maintain a PC network with the software applications and space needed to deliver data products, both those that are received over the Internet and those that need a PC device.

B. General Selection Guidelines (see classed analysis for detailed statement)

The EDS collection is at the study and teaching support level. There are research level materials in the collection but we are limited in our ability to fully support a research level collection for a variety of reasons. Two factors are the characteristics of some research data; they can be in formats we do not support or they may come with restrictions that preclude anyone except the end user from negotiating acquisition and/or accessing the data. Also when expensive data studies are very narrow in focus, the cost can make it difficult to justify a purchase. In such cases cost sharing with other library funds or with the user may occur.

Most of the titles in the collection are received automatically through our membership arrangements. The focus is on social science and health-related topics. Purchases of individual titles are made to fill a request by a user, for new products that complement an area of active research, and for products that switch from being available in an archive to commercial distribution.

Our spatial data is based primarily on products produced by the U.S. Census Bureau, ESRI, New York City Planning Department, and the New York State GIS Data Sharing Cooperative. It is also supported by the relationship we maintain with the Urban Planning Department and Columbia University's Center for International Earth Science Information (CIESIN). The collection will grow with the interest and demand in GIS.

C. Specific Delimitations

  1. Formats and Access
    Titles are acquired only when we are permitted to make the data available to any user at Columbia and when we can store and deliver the product either over the Internet, the CUIT CUNIX platform, or the PC-Network in EDS. DataGate serves as the catalog the brings together all the EDS titles.
        DataGate was designed to handle the unique demands of providing access, both intellectual and direct, to data studies. A "data study" is made up of a file of numeric data together with supporting meta-data that explain the format and makeup of the data. A data study may also include program code for using the data within a standard statistical application, or it may come with a custom software interface to the data and its meta-data. The "access" DataGate provides to data studies covers both bibliographic descriptions for all the components of a study and, based on the study's format, either direct online access to these components or information on how to obtain the study. Below is a list describing some of the common types of entries in DataGate.
    • Studies available for FTP
      These studies can be obtained using the standard Internet FTP download feature. The data may be stored on either the CUIT CUNIX platform or at remote archive maintained by the provider. The DataGate entry points to the study's location and from that location, users can save the study to their own computer. Access can be monitored either by IP authentication (in which case users can download themselves) or by password (in which case EDS staff move the data to CUNIX). Providers sometimes include a file where the data has been formatted to open automatically within a standard statistical software package, and, when they have not, EDS often can enhance our holdings by doing so.
          With the trend that data providers are permitting individuals to access their archives based on IP authentication, and the ease of transferring files across the Internet using an Internet browser, EDS has adopted the policy of storing studies on our own Unix server only when:
      • the provider does not permit individual users to download and EDS must do it on their behalf;
      • EDS creates enhanced versions of the data geared to the demands of our users;
      • there is high demand for a study;
      • a user requests that we help with the download.
    • Studies available using a web-based interface directly to the data
      Rather than allowing users to work with the full study, some studies consist of a web-based interface that allows users to select and save only the portion of the data they need. The interface has the meta-data, needed by users to make their choices, and may also have features that allow for online analysis of the data. The DataGate entry briefly describes the study and contains the URL (web address) for the site.
          Most of the examples of this type of study are services to which we subscribe and access is based on IP authentication. In such cases the study will appear on LibraryWeb as an electronic resource. In addition to LibraryWeb resources, EDS is selectively adding public Internet sites of this type to DataGate especially when they represent continuations of on-going data studies that were once delivered by FTP.
    • Studies that are not delivered over the Internet
      Most studies in this group are ones we receive on some PC-based medium such as floppy disc, CD-ROM, or DVD with some being delivered to us via email or ftp. They are described in DataGate but the data is not stored on CUNIX and users are directed to come to EDS to retrieve the data from the PC network. The reasons that the data are not made available on CUNIX can be one or more of the following:
      • the data product includes a custom PC-based software application;
      • the data has a license restriction that does not permit the type of wide distribution that DataGate provides;
      • the information contained in a study received on a PC-based medium is in low demand and the work of moving the data to CUNIX is not warranted (this applies to many of the titles we receive automatically through the FDPL, SDC, and NCHS programs).
  2. Other Formats
    EDS collects data that can be delivered using a technology supported by CUIT. This applies to both the hardware, operating system, and application software needed. When it is possible to convert data from a format we do not support to one that we do support, we will consider doing so. For studies already in our collection, if it becomes impossible to deliver data because of changes in technology, we will look for alternative sources, attempt to migrate the data to a format we can support, or remove the study from our collection.
  3. Restricted Data
    Restricted data contain information deemed by the data producer to be confidential. Users typically must sign an agreement with the data provider covering issues like who has access, how it is to be used, and how it is to be kept secure. The only services EDS provides to users of restricted data are helping them to identify producers of such data and making them aware of the extra work involved in acquiring the data. EDS does not negotiate the acquisition nor add such data to the collection.
  4. Unique Data
    EDS does not accept data offered for the purpose of archiving. Researchers wishing to archive data they produce are directed to ICPSR which has the infrastructure to support archiving activity.

D. Other Considerations

The primary bibliographic access to the EDS collection is DataGate. The only holdings that also appear in the CLIO OPAC are FDPL products, products purchased individually and delivered on a physical medium like CD-ROM, and LibraryWeb electronic resources that are delivered by IP access supported by the Library Systems Office. Together these represent a small but important selection of our holdings. Methods for providing bibliographic access to spatial data are being developed. Currently these files may appear in CLIO and/or DataGate while others appear in lists of resources available on EDS web site. Work is underway to develop an efficient method of creating and storing the metadata that describes them and that will be the basis for a spatial catalog.