Finding Data (for Journalists)
How to Use this Guide
This guide will help you find data & statistics. Start with the steps for "finding data" below, then navigate through the guide using the tabs above.
Need more help?
- Do you already know this data exists?
- For instance, if you found it referenced in a newspaper or other article, note the citation, or any related information like who collected the data and when. You may be able to use this to find the original data source--even if it takes a little detective work with your friendly librarian!
- Do you need raw data or statistics?
- What geography level do you need? (country, state, city, neighborhood, etc.)
- Are there time contraints (a range of years, monthly, quarterly, annually)?
- What is the unit of analysis?
- Are you comparing individuals or groups?
- Do you need microdata or macrodata?
- What is the topic or subject?
- You can search by topic in the Data Sources guide.
- Think about who might collect this data, and for what purpose (collectors of data include: government agencies, nonprofits, NGOs, businesses, and academic researchers).
- Do you need demographic data (characteristics that define a population, such as: gender, age, ethnicity, language, housing, employment)?
- If so, then you may find that most of the data you need is available from the US Census Bureau and/or the NYC Department of Planning (DCP).
- Do you know what kind of analysis you want to perform on your data?
- Are you interested in creating maps (using spatial data)?
- What software will you use? Where can you get training and help? Can you download this software and/or use it on campus?
Data vs Statistics
Data Versus Statistics
Data is the raw "stuff" that is analyzed to create statistics. They both come in several forms: numeric (or quantitative), spatial, and qualitative (includes interviews, opinion data).
There is some semantic overlap between the two terms, so don't be surprised to hear the same figure referred to as data or a statistic. Plus, one person's statistic (interpreted data) may in turn be someone else's data, to be re-interpreted and used for a different purpose! Don't worry too much about the terminology--the important point is:
If you need a quick figure or single fact, check existing statistics first.
Popular Statistical Sources
The following are some of the best resources to check for statistics. (Others appear in the more comprehensive Data Sources guide.)
United Nations' Demographic Yearbook
Provides international statistics on population size and composition, births, deaths, marriage and divorce (annual).
United Nations: General Statistics
The UN's website has a menu on the left providing access to economic, environmental, and other statistics.
United States Stats
Statistical Abstract of the US
A compilation of statistical tables from various government agencies. It shows which agency produces the info you want.
-- Current edition (CU-restricted access through ProQuest)
-- Historic editions: 1878 to 2012 (publicly available)
ProQuest Statistical Insight (CU-restricted access)
Database compiling statistics from a variety of sources and topical areas.
NY State Stats
New York State Statistical Yearbook
Similar resource to the US Statistical Abstract (above), but for the state of New York.
NYC (city & local) Stats
Geography of Data
Determining the geography of your data need is a key part of your data search.
Geography deals not only with the specific location of your data (Astoria, Queens, NY) but also with the scope of your data-- the geographic level at which you wish the data to be collected. Some of the typical levels at which you might want data:
- International: multiple countries
- National: single country (i.e., US)
- Local (county/borough, city, town)
- Smaller localized levels (census tracts, community districts, neighborhoods, etc.)
New York neighborhoods are a special case, because there are no formally-defined boundaries for many of these neighborhoods. You will have to make your own decision about what geography to use to best describe a neighborhood.
Some of your options (in order of descending size) are:
|NYC map||Geography||Data Available|
PUMAs (Public Use Microdata Areas)
PUMAs are areas of about 100,000 residents, may be roughly analagous to NYC Community Districts, and often contain multiple neighborhoods. This data is highly accurate.
ZCTAs (Zip Code Tabulation Areas)
ZCTAs are roughly analagous (but not equal to) USPS-defined Zip Codes, and as such do not correspond well with NYC neighborhoods. They are smaller than PUMAs.
NTAs (Neighborhood Tabulation Areas)
The Dept. of Planning (DCP) created these areas to roughly correspond with NYC neighborhoods. Data in this format is only available from the DCP, not through traditional census sources.
Census tracts are smaller than neighborhoods, so typically you combine several to create a neighborhood. Be aware that these areas have a large margin of error (since the measurement area is so small).
As you search different data sources, be aware that each source may use a different geography to define a given NYC area. (For instance, the NYC Health Dept. uses "Health Districts.")
Occasionally, you will run into geographies that are unique to a specific topic. Some of the most common topic-specific geographies are:
- U.S. Congressional Districts
- Health Districts
- School Districts
Define Your "Time" Data Needs
Consider the temporal parameters of the data you wish to find. Some things to consider:
- What time frequency should your data be measured in (weekly, monthly, annually)?
- If annual, do you need data for a single year (2012), or for a range of years (2002-2010)?
- Are you interested in tracking changes over time? For instance, the Decennial Census tracks variables every ten years (2010, 2000, 1990, etc. back to 1790).
Whether or not you can find the data you need formatted according to these specific temporal parameters may vary. For more on this, see the section below on problems.
Common Problems & Solutions
Having problems finding the right time period for your data? Here are the two most common issues:
- The data you want exists, but it isn't measured (or recorded) at the frequency that you need. In this case, you may have to adjust your expectations. For instance, instead of looking at crime statistics for your neighborhood on a monthly basis, perhaps you will have to use annual numbers.
- The data doesn't appear current. It takes time to gather (and format) data! Frequently, the most current data may be 1, 2, or 3 years old, and there's nothing you can do but use the most current year available.
Unit of Analysis
The unit of analysis is the entity that you're analyzing. It's called this because it's your analysis (what you want to examine) that determines what this unit is, rather than the data itself.
For instance, let's say that you have a dataset with 40 students, divided between two classrooms of 20 students each, and a test score for each student. You can analyze this data in several ways:
Individual unit of analysis:
Compare the test scores of each student to the other students.
(You're analyzing students, individuals.)
Group unit of analysis:
Compare the average test score of the two classrooms.
(You're analyzing the classrooms, comparing two groups of individuals.)
Knowing your unit of analysis is helpful, because it helps you determine what kind of data you need. The other piece of this puzzle is whether you need macrodata (aggregated data) or microdata.
Microdata & Macrodata
So what is the difference between macrodata (aggregated data) and microdata? The Census Bureau has a good explanation of its microdata and macrodata, and here is a summary below:
Contains a record for every individual (e.g., person, company, etc.) in the survey/study.
Source for US Census microdata: IPUMS
MACRODATA (Aggregated Data)
Higher-level data compiled from smaller (individual) units of data. For example, Census data in American Factfinder has been aggregated to preserve the confidentiality of individual respondents.
Source for US Census macrodata: American FactFinder
Data Sources Covering Multiple Topics
Data Sources by Topic
You can find many data sources, organized by topic, in this library guide:
Multiple Data Sources--On a Map!
The major source for demographic data in the United States is the U.S. Census Bureau. This agency produces a several surveys; the two major ones are:
Decennial Census: What is generally called "the census" (or sometimes "Census of Population & Housing"). Survey of the entire U.S. population, performed once every 10 years (2010, 2000, 1990, etc. back to 1790).
ACS (American Community Survey): A more detailed survey, performed more frequently than the Decennial Census (different areas on a rolling basis). If the decennial census data is too outdated or not detailed enough, you may use the ACS instead. See: Subjects included in the ACS
Other surveys produced by the Census Bureau include:
Accessing Census Data
Once you know which survey data you need, you must choose which source from which to get that data.
American FactFinder (Census Bureau)
From the U.S. Census Bureau, American FactFinder is a tool that allows you to build custom searches using existing data tables from the Census Bureau’s many surveys, including the Decennial Census, American Community Survey (ACS), the Economic Census, and more.
NYC Department of City Planning (DCP)
The NYC DCP has great information on population change and demographic information on New Yorkers, including the foreign-born population of New York. (Data drawn from Census Bureau.)
- Population Page
- Basic Population Info (from ACS)
- Social Characteristics (Foreign-born, nativity of recent immigrants)
- Population Change (Immigration / Migration)
How many people live in your beat neighborhood? How many are on public assistance? How many have asthma? What are the crime statistics? Immigration trends? Answer these questions and more with this collection of statistics about NYC. The data can be retrieved by community district number or neighborhood name, covers demographic, health, and socio-economic variables, and in many cases is available for multiple years, which lets you chart trends. Select Area Profile to begin your data retrieval.
Basic Statistical Analysis (Descriptives)
We'll start by considering quantitative methods--that is, methods that use data that is quantifiable or measurable. Frequently this means numeric data, though your datasets may contain text such as individual or group names, gender designations, etc.
Descriptive statistics simply describe what the data itself shows. These methods are common in journalism, particularly because they are generally understandable by the general pubilc.
Some of the most commonly-used descriptives are the measures of central tendency, which describe the center of a set of data. There are three common measures of central tendency:
- Mean (or "average"): this is the most commonly-used measure--you're probably already familiar with it! To calculate a mean, simply add all the values and divide by the total number of values.
- Median: The median is the value found at the center of the values. For example, if you have a list with 20 values, value #10 would contain the median.
- Mode: The mode is the most frequently-occurring value in a set.
You can look at the distribution of the data using frequencies (counts) and percentages are also common descriptives. Frequency distributions are commonly displayed using a histogram or bar chart.
Other Methods of Statistical Analysis
Inferential statistics interpret data and make conclusions beyond what is immediately apparent in the data itself. Inferential methods are common in academic research. If you plan to use these types of methods, UCLA has a great chart showing how to choose the appropriate inferential statistical analysis.
Qualitative methods consider data that can't be quantified, but only described. Data analyzed using these methods often take the form of interviews, video or audio recordings, photographs, or other observational data. See the Research Methods Knowledgebase for more on qualitative methods.
Spatial analysis (geographic analysis & maps) may involve qualitative and/or quantitative data analysis. See the next tab in this guide for more.
What is Spatial Data?
Spatial (or geospatial) data is suitable for use in GIS mapping software (like ArcGIS or Google Earth) and can come in a variety of formats: vector, raster, imagery, or data tables.
Spatial Data Resources, Tutorials
Need help using ArcGIS or another program?
- There are a variety of GIS resources on this page, including online ESRI courses (limited to CU-affiliated students, faculty, staff).
- On-Demand Workshops: If you have a group of six or more people and are interested in learning ArcGIS or another GIS program, please email a request to Data Services at email@example.com.
- Get help with various spatial programs through Lynda.com.
- CU Library books on using GIS: RSS feed from CLIO
Where to Find Spatial Data
Find spatial data using Columbia University Libraries' GeoData Portal (beta)!
Looking for NYC spatial data?
NY City Map (from DoITT): Map a neighborhood or intersection, then find nearby subway stops, schools, hospitals, police precincts, and more (user guide here). Also shows municipal boundaries like community districts & census tracts. Other city data maps include:
Journalism Library Data Workstation
The J-Library has a Data Workstation with a large monitor, pre-loaded with a variety of programs, including:
- Statistical: R, Stata, SPSS, SAS, NVivio
- Spatial: ArcGIS, Google Earth
- Adobe Suite (Photoshop, Illustrator, InDesign, & more)
You're welcome to come use the Data Workstation anytime the J-Library is open (hours)!
Statistical Software Resources
General Statistical Help
- UCLA stats guide (covers various software programs)
- Lynda.com (online tutorials, free through CU Libraries!)
- Get help on-campus at Data Services (in Lehman Library)
- For help, tutorials, and more on statistical programs, see: Lehman Library DSSC Statistical Software
- Download NVivo for free (CU-affiliate restricted)