Finding Data (for Journalists)


Intro

How to Use this Guide

DSSC_Jeremiah_Student

This guide will help you find data & statistics. Start with the steps for "finding data" below, then navigate through the guide using the tabs above.

Need more help?

Finding Data

Before you begin to search for your data, work through these questions and note your answers. They will help you in your search. There are tabbed sections at the top of this guide that correspond to each data question below.
 
  • Do you already know this data exists?
    • For instance, if you found it referenced in a newspaper or other article, note the citation, or any related information like who collected the data and when. You may be able to use this to find the original data source--even if it takes a little detective work with your friendly librarian!
  • Do you need raw data or statistics?
  • What geography level do you need? (country, state, city, neighborhood, etc.)
  • Are there time contraints (a range of years, monthly, quarterly, annually)? 
  • What is the unit of analysis?
    • Are you comparing individuals or groups?
    • Do you need microdata or macrodata?
  • What is the topic or subject?
    • You can search by topic in the Data Sources guide.
    • Think about who might collect this data, and for what purpose (collectors of data include: government agencies, nonprofits, NGOs, businesses, and academic researchers).
  • Do you need demographic data (characteristics that define a population, such as: gender, age, ethnicity, language, housing, employment)?
    • If so, then you may find that most of the data you need is available from the US Census Bureau and/or the NYC Department of Planning (DCP).
  • Do you know what kind of analysis you want to perform on your data?
  • Are you interested in creating maps (using spatial data)?
  • What software will you use? Where can you get training and help? Can you download this software and/or use it on campus?

Data vs Statistics

Data Versus Statistics

Data is the raw "stuff" that is analyzed to create statistics. They both come in several forms: numeric (or quantitative), spatial, and qualitative (includes interviews, opinion data). 

DATA:

  • require some analysis to be used in a story
  • you can customize your own analyses, for instance, comparing two different data sets/types
  • quantitative data often consists of large sets of variables (for instance: gender, income, ethnicity)
RAW DATA

CLEANED DATA

 

STATISTICS:

  • are often easier and quicker to use
  • provide quick facts, pre-crunched numbers, pre-interpreted analyses
  • are often in the form of tables or graphs
STATISTICAL TABLE

SINGLE STAT


There is some semantic overlap between the two terms, so don't be surprised to hear the same figure referred to as data or a statistic. Plus, one person's statistic (interpreted data) may in turn be someone else's data, to be re-interpreted and used for a different purpose! Don't worry too much about the terminology--the important point is:

If you need a quick figure or single fact, check existing statistics first.

 

Popular Statistical Sources

The following are some of the best resources to check for statistics. (Others appear in the more comprehensive Data Sources guide.)

International Stats

United Nations' Demographic Yearbook
Provides international statistics on population size and composition, births, deaths, marriage and divorce (annual).

United Nations: General Statistics
The UN's website has a menu on the left providing access to economic, environmental, and other statistics.

Eurostat is the statistical office of the European Union. Some of its publications are: the Eurostat Yearbook, the Eurostat Regional Yearbook, Statistics in Focus, and a variety on other topics.

United States Stats

Statistical Abstract of the US
A compilation of statistical tables from various government agencies. It shows which agency produces the info you want. 

-- Current edition (CU-restricted access through ProQuest)

-- Historic editions: 1878 to 2012 (publicly available)

ProQuest Statistical Insight (CU-restricted access)
Database compiling statistics from a variety of sources and topical areas. 

NY State Stats

New York State Statistical Yearbook
Similar resource to the US Statistical Abstract (above), but for the state of New York.

NYC (city & local) Stats

The Almanac of NYC (print only)
Basic statistics and information for New York City. Print only; located in Journalism Library Reference: F128.3 .A455 2008

Geography

Geography of Data

Determining the geography of your data need is a key part of your data search.

Geography deals not only with the specific location of your data (Astoria, Queens, NY) but also with the scope of your data-- the geographic level at which you wish the data to be collected. Some of the typical levels at which you might want data:

  • International: multiple countries
  • National: single country (i.e., US)
  • State
  • Local (county/borough, city, town)
  • Smaller localized levels (census tracts, community districts, neighborhoods, etc.) 
 

NYC Neighborhoods

New York neighborhoods are a special case, because there are no formally-defined boundaries for many of these neighborhoods. You will have to make your own decision about what geography to use to best describe a neighborhood. 

Some of your options (in order of descending size) are:

NYC map Geography Data Available

Boroughs (Counties)

  • Manhattan (New York County)
  • Brooklyn (Kings County)
  • Queens (Queens County)
  • Bronx (Bronx County)
  • Staten Island (Richmond County)
  • 2010 census
  • ACS 1, 3, 5-year

PUMAs (Public Use Microdata Areas)

PUMAs are areas of about 100,000 residents, may be roughly analagous to NYC Community Districts, and often contain multiple neighborhoods. This data is highly accurate.

  • ACS 1, 3, 5-year

ZCTAs (Zip Code Tabulation Areas)

ZCTAs are roughly analagous (but not equal to) USPS-defined Zip Codes, and as such do not correspond well with NYC neighborhoods. They are smaller than PUMAs.

  • 2010 census
  • ACS 5-year

NTAs (Neighborhood Tabulation Areas)

The Dept. of Planning (DCP) created these areas to roughly correspond with NYC neighborhoods. Data in this format is only available from the DCP, not through traditional census sources. 

  • 2010 census
  • ACS 5-year

Census Tracts

Census tracts are smaller than neighborhoods, so typically you combine several to create a neighborhood. Be aware that these areas have a large margin of error (since the measurement area is so small). 

  • 2010 census
  • ACS 5-year

Many thanks to Frank Donnelly at Baruch College for his excellent work that inspired this table. See his NYC Census Data handout for more info.

As you search different data sources, be aware that each source may use a different geography to define a given NYC area. (For instance, the NYC Health Dept. uses "Health Districts.")

 

Topic-Specific Geographies

Occasionally, you will run into geographies that are unique to a specific topic. Some of the most common topic-specific geographies are:

  • U.S. Congressional Districts 
  • Health Districts 
  • School Districts 
Some geographies can be easily mapped to other established geographies (like Census PUMAs), others may not. If in doubt, contact a librarian! 

Time

Define Your "Time" Data Needs

Consider the temporal parameters of the data you wish to find. Some things to consider:

  • What time frequency should your data be measured in (weekly, monthly, annually)?
  • If annual, do you need data for a single year (2012), or for a range of years (2002-2010)?
  • Are you interested in tracking changes over time? For instance, the Decennial Census tracks variables every ten years (2010, 2000, 1990, etc. back to 1790). 

Whether or not you can find the data you need formatted according to these specific temporal parameters may vary. For more on this, see the section below on problems.

 

Common Problems & Solutions

Having problems finding the right time period for your data? Here are the two most common issues:

  • The data you want exists, but it isn't measured (or recorded) at the frequency that you need. In this case, you may have to adjust your expectations. For instance, instead of looking at crime statistics for your neighborhood on a monthly basis, perhaps you will have to use annual numbers.
  • The data doesn't appear current. It takes time to gather (and format) data! Frequently, the most current data may be 1, 2, or 3 years old, and there's nothing you can do but use the most current year available. 

Units

Unit of Analysis

The unit of analysis is the entity that you're analyzing. It's called this because it's your analysis (what you want to examine) that determines what this unit is, rather than the data itself.

For instance, let's say that you have a dataset with 40 students, divided between two classrooms of 20 students each, and a test score for each student. You can analyze this data in several ways:

IndividualData

 

Individual unit of analysis:

Compare the test scores of each student to the other students.
(You're analyzing students, individuals.)

GroupData

 

Group unit of analysis:

Compare the average test score of the two classrooms.
(You're analyzing the classrooms, comparing two groups of individuals.)

Knowing your unit of analysis is helpful, because it helps you determine what kind of data you need. The other piece of this puzzle is whether you need macrodata (aggregated data) or microdata.

 

Microdata & Macrodata

So what is the difference between macrodata (aggregated data) and microdata? The Census Bureau has a good explanation of its microdata and macrodata, and here is a summary below:

MICRODATA

Contains a record for every individual (e.g., person, company, etc.) in the survey/study.

Source for US Census microdata: IPUMS

MACRODATA (Aggregated Data)

Higher-level data compiled from smaller (individual) units of data. For example, Census data in American Factfinder has been aggregated to preserve the confidentiality of individual respondents.

Source for US Census macrodata: American FactFinder

Topic

Data Sources Covering Multiple Topics

Two good places to start looking for data are NYC Open Data (New York City data) and Data.gov (U.S. data). Each website organizes data by topic, as well as search capabilities. Neither website is comprehensive--not all data can be found there--but it's always good to look. 

 

Data Sources by Topic

You can find many data sources, organized by topic, in this library guide:

 

Multiple Data Sources--On a Map!

An alternate way to view NYC data is on the NYCity Map (see image below). Map a neighborhood or intersection, then find nearby subway stops, schools, hospitals, police precincts, and more (user guide here). Also shows municipal boundaries like community districts & census tracts. 

Map-snapshot3

Demographics

Census Surveys

The major source for demographic data in the United States is the U.S. Census Bureau. This agency produces a several surveys; the two major ones are:

Decennial Census: What is generally called "the census" (or sometimes "Census of Population & Housing"). Survey of the entire U.S. population, performed once every 10 years (2010, 2000, 1990, etc. back to 1790). 

ACS (American Community Survey): A more detailed survey, performed more frequently than the Decennial Census (different areas on a rolling basis). If the decennial census data is too outdated or not detailed enough, you may use the ACS instead. See: Subjects included in the ACS

Other surveys produced by the Census Bureau include:

 

Accessing Census Data

Once you know which survey data you need, you must choose which source from which to get that data.

American FactFinder (Census Bureau)
From the U.S. Census Bureau, American FactFinder is a tool that allows you to build custom searches using existing data tables from the Census Bureau’s many surveys, including the Decennial Census, American Community Survey (ACS), the Economic Census, and more.

NYC Department of City Planning (DCP) 
The NYC DCP has great information on population change and demographic information on New Yorkers, including the foreign-born population of New York. (Data drawn from Census Bureau.)

Infoshare Online 
How many people live in your beat neighborhood? How many are on public assistance? How many have asthma? What are the crime statistics? Immigration trends? Answer these questions and more with this collection of statistics about NYC. The data can be retrieved by community district number or neighborhood name, covers demographic, health, and socio-economic variables, and in many cases is available for multiple years, which lets you chart trends.  Select Area Profile to begin your data retrieval.

Analysis Methods

Basic Statistical Analysis (Descriptives)

We'll start by considering quantitative methods--that is, methods that use data that is quantifiable or measurable. Frequently this means numeric data, though your datasets may contain text such as individual or group names, gender designations, etc.

Descriptive statistics simply describe what the data itself shows. These methods are common in journalism, particularly because they are generally understandable by the general pubilc.  

Some of the most commonly-used descriptives are the measures of central tendency, which describe the center of a set of data. There are three common measures of central tendency:

  • Mean (or "average"): this is the most commonly-used measure--you're probably already familiar with it! To calculate a mean, simply add all the values and divide by the total number of values.
  • Median: The median is the value found at the center of the values. For example, if you have a list with 20 values, value #10 would contain the median.
  • Mode: The mode is the most frequently-occurring value in a set. 

You can look at the distribution of the data using frequencies (counts) and percentages are also common descriptives. Frequency distributions are commonly displayed using a histogram or bar chart.

See the Research Methods Knowledgebase for more on these concepts, as well as standard deviations and correlations.  

 

Other Methods of Statistical Analysis

Inferential statistics interpret data and make conclusions beyond what is immediately apparent in the data itself. Inferential methods are common in academic research. If you plan to use these types of methods, UCLA has a great chart showing how to choose the appropriate inferential statistical analysis.

Qualitative methods consider data that can't be quantified, but only described. Data analyzed using these methods often take the form of interviews, video or audio recordings, photographs, or other observational data. See the Research Methods Knowledgebase for more on qualitative methods.

Spatial analysis (geographic analysis & maps) may involve qualitative and/or quantitative data analysis. See the next tab in this guide for more.

 

Spatial Data

What is Spatial Data?

Spatial (or geospatial) data is suitable for use in GIS mapping software (like ArcGIS or Google Earth) and can come in a variety of formats: vector, raster, imagery, or data tables.

 

Spatial Data Resources, Tutorials

Need help using ArcGIS or another program?

  • There are a variety of GIS resources on this page, including online ESRI courses (limited to CU-affiliated students, faculty, staff).
  • On-Demand Workshops: If you have a group of six or more people and are interested in learning ArcGIS or another GIS program, please email a request to Data Services at dssc.data@columbia.edu.
  • Get help with various spatial programs through Lynda.com.
  • CU Library books on using GIS: RSS feed from CLIO

 

Where to Find Spatial Data

Find spatial data using Columbia University Libraries' GeoData Portal (beta)!

 

Looking for NYC spatial data?

NY City Map (from DoITT): Map a neighborhood or intersection, then find nearby subway stops, schools, hospitals, police precincts, and more (user guide here). Also shows municipal boundaries like community districts & census tracts. Other city data maps include:

Software

Journalism Library Data Workstation

JlibraryDataWorkstation

The J-Library has a Data Workstation with a large monitor, pre-loaded with a variety of programs, including:

  • Statistical: R, Stata, SPSS, SAS, NVivio
  • Spatial: ArcGIS, Google Earth
  • Adobe Suite (Photoshop, Illustrator, InDesign, & more)  

You're welcome to come use the Data Workstation anytime the J-Library is open (hours)!

Statistical Software Resources

General Statistical Help

 

R Software Help

 

 

Need More Help?

DSSC_Jeremiah_Student

Struggling with a particular software progam or statistical method?