Frequently Asked Questions (FAQ)
What is SESTAT?
What is the difference between SESTAT and SDR?
What weight should I use?
Are Higher Ed data nationally representative?
Who is considered a scientist or engineer?
What types of data are available?
What are the differences between logical skip, blank, missing, and survey exclusion/confidentiality?
Are IPUMS Higher Ed variable names the same as the original public use variable names?
Is longitudinal data analysis possible with IPUMS Higher Ed?
Are there longitudinal weights I can use for SESTAT component surveys?
Is it possible to link NSCG survey respondents to the ACS?
Can I link IPUMS Higher Ed data with the original public use SESTAT data?
Where should a new user start?
How do I get access to IPUMS Higher Ed data?
What are microdata?
What are "weights"?
What does "universe" mean in the variable descriptions?
How do I obtain data?
What format are the data in?
How long does a data extract take?
How does "sample selection" work on the IPUMS Higher Ed web site?
What does "add to cart" mean?
Why can't I open the data file?
Is there a preferred statistical package for using the IPUMS?
Can I get the original data?
General information about the project
What is SESTAT? [top]
The Scientists and Engineers Statistical Data System (SESTAT) is a database that provides demographic, educational, employment, and earnings data about scientists and engineers in the US. Scientists and engineers in this database are defined as individuals who have earned a post-secondary degree in science, engineering, or in later surveys, health sciences. Individuals who work in an occupation that is related to science or engineering, even though they do not have a science or engineering post-secondary degree, are included in the SESTAT file.
The database is composed of individuals who meet these criteria from three surveys conducted by the National Science Foundation: the National Survey of College Graduates, the Survey of Doctorate Recipients, and the National Survey of Recent College Graduates. The three surveys are fielded at the same time, questions common among the surveys have the same reference period, and have a similar target population (non-institutionalized individuals under the age of 76 with a bachelor's degree or higher).
Each SESTAT component survey was designed to be a cross-sectional representation of the target population for that year. However, some respondents in the SESTAT are interviewed in multiple surveys over time, allowing for limited longitudinal research. For more information on longitudinal data linking, visit the Longitudinal Data page.
The SESTAT files are composed of microdata. Each record is a person, with all characteristics numerically coded. Unlike other IPUMS data projects, SESTAT records are not organized by households. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require.
What is the difference between SESTAT and SDR? [top]
The Survey of Doctorate Recipients (SDR) is one of three sub-component surveys of the Scientists and Engineers Statistical Data System. The sampling frame for the SDR includes all doctorate recipients residing in the US who received a research doctorate degree from a US institution in a science or engineering field. All SDR respondents are included in the SESTAT version of the data. The original full SDR sample has some variables that were not included in the SESTAT version. Please see our User Note on survey designs for information about the surveys that comprise the SESTAT files.
What weight should I use? [top]
Are Higher Ed data nationally representative? [top]
The target population for the SESTAT surveys is non-institutionalized individuals under the age of 76 with a bachelor's degree or higher, or an occupation, in a science, engineering, or health field. The weights provided with the data cause the resulting estimates to be nationally representative of the target population in the United States in the survey year.
Who is considered a scientist or engineer? [top]
In the SESTAT surveys' sampling frames, a scientist or engineer is an individual who holds a bachelor's degree or higher in a science, engineering, or health, or a science-, engineering-, or health-related field. More specifically, the following list describes the degree fields the National Science Foundation nests beneath science and engineering or science- and engineering-related fields.
Science and Engineering
Computer and math sciences
Biological, agricultural, and environmental life sciences
Physical sciences (Physics, chemistry, geosciences
Social Sciences (Psychology, economics, political science)
Science and Engineering related
Health (Medicine, audiology, nursing, physical therapy)
Science and math teacher education
Technology and technical fields (Engineering technology)
Other Science and Engineering related fields
Actuarial science, architectural or environmental design
Respondents who have a degree in a field other than a science, engineering, or health related field might also be included in the SESTAT if they had an occupation in a science, engineering, or health sciences field.
What types of data are available? [top]
The SESTAT surveys collect information related to demographics, labor force participation, educational history, family, and professional activities. For confidentiality, geographic information such as birthplace, location of academic institution, or current residence, is sometimes suppressed or only available at the region level.
What are the differences between logical skip, blank, missing, and survey exclusion/confidentiality? [top]
In the original NSF data, there were three types of invalid response. There were also unexplained blanks, which IPUMS Higher Ed has assigned an invalid response value.
Logical Skip: Based on the answer a respondent gave to a survey question, they might have been directed by instructions on the survey to skip a question. In this case, a non-response was considered a logical skip. Another way to frame this is that the respondent did not fall within the universe of a survey question.
Missing: A missing code indicates that the respondent was in universe for a question but did not answer a survey question.
Survey Exclusion/Confidentiality: In the original NSF files, if a survey question was asked in only one survey, the observations of respondents in the other two surveys will be coded with a "97". In some cases, responses such as the respondent's ethnic background or location may be suppressed to prevent identification of the individual.
Blank: In the original NSF files, there were unexplained blank values that we have chosen to code differently from the cases explicitly identified as missing.
In IPUMS Higher Ed, we have imposed the following consistent coding scheme for these values:
97: Survey Exclusion/Confidentiality
98: Logical Skip
Are IPUMS Higher Ed variable names the same as the original public use variable names? [top]
When possible, the IPUMS Higher Ed variable names are identical to the original variable names. In some instances, variable names changed over time, and we typically keep the more recent variable name. Please see our User Note for variable name correspondence.
Is longitudinal data analysis possible with IPUMS Higher Ed? [top]
Yes, because some respondents are surveyed multiple times across NSF surveys, there are some opportunities for longitudinal analysis. The variable PERSONID identifies individuals across survey years. (It replaces the original variable, REFID, which contains non-numeric values in some samples.) IPUMS Higher Ed does not provide linked longitudinal files; however, users can download the samples they desire and easily link them using a statistical program, merging by PERSONID. More detailed information about the patterns of follow-up surveys can be found in our User Note on longitudinal data.
Are there longitudinal weights I can use for SESTAT component surveys? [top]
Longitudinal weights were created by the NSF, but they are not yet publicly available.
Is it possible to link NSCG survey respondents to the ACS? [top]
Both the ACS and the NSCG survey data are protected under US Code, Title 13; so through disclosure avoidance techniques used by the Census Bureau, it is not possible to link ACS respondents to respondents of the NSCG. For researchers who wish to create these linkages, contact the Federal Statistical Research Data Center (FSRDC) -administration to apply to use the restricted version of NSCG data.
Can I link IPUMS Higher Ed data with the original public use SESTAT data? [top]
Yes, the original person identification number REFID is available for every IPUMS Higher Ed sample.
Where should a new user start? [top]
The natural starting point is the "Select Data" or "Browse and Select Data" links on the left navigation bar and the top banner. These links open the variables page: the primary tool for exploring the contents of IPUMS Higher Ed. By default, the variables page displays one variable group at a time for all samples in the data series. You can change the view option to show all groups simultaneously, but the page can get very large and slow to load. However, you can filter the information at any point to include only the samples of interest to you ("Select samples").
When you select samples, the page will display only variables present in those samples. An "x" indicates the availability of a variable for a particular sample.
On the variables page, clicking on a variable name brings up its documentation. The information about the variable is contained on a number of tabs. The default tab is the brief description of the variable. More information is usually available on the "comparability" tab. The variables page also has direct links to the codes page for each variable (they are also accessible as a tab in the variable description). The codes page shows the codes and labels for the variable, and the availability of categories across samples. These categories can suggest the types of research possible with a given sample.
Throughout the variable documentation system there are buttons to "Add to cart." Any variables you select in this way are put in your data cart to include in a data extract. Your selections only last for the current web session.
The Data Cart in the upper right keeps track of your variable and sample selections. Once you have made some selections you can click on "View Cart" to review your choices. If you have selected variables and samples you can enter the data extract system. To make a data extract you must be registered to use IPUMS Higher Ed. If you are already registered to use IPUMS Higher Ed, you can click on "create an extract" and use the data access system. The instructions for the extraction system are here.
How do I get access to IPUMS Higher Ed data? [top]
Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
What are microdata? [top]
Microdata are composed of individual records containing information collected on persons. The unit of observation is the individual. The responses of each person to the different questions are recorded in separate variables.
Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the SESTAT data.
Microdata are inherently flexible. One need not depend on published statistics that are compiled in a certain way, if at all. Users can generate their own statistics from the data in any manner desired, including individual-level multivariate analyses.
What are "weights"? [top]
All IPUMS Higher Ed datasets include a weight variable. Each individual has a weight that ranges from 1 to 13,508.49, which adjusts for the frequency at which individuals with characteristics matching the respondent occur in the actual target population. Most statistical programs have functions to automatically use weights to adjust statistical analyses.
The weight variable WTSURVY should be used for full survey samples, that is, the original SDR survey samples as opposed to the pooled SESTAT samples. The appropriate variable for analyzing SESTAT files is WEIGHT. Because it is possible for the same individual to be in the target population for more than one NSF survey, the NSF adjusts WTSURVY of each survey for use when they are combined into a SESTAT file. For more information about survey designs and the file structure of SESTAT, see our User Note on survey design.
What does "universe" mean in the variable descriptions? [top]
The universe is the population that is given an opportunity to answer a particular survey question, or that has a valid value for a variable. The population is determined by individual characteristics or particular responses to previous questions. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. Cases that are outside of the universe for a variable are labeled "Logical Skip" on the codes page. Differences in a variable's universe across samples are a common data comparability issue. The label "Logical skip" is comparable to "NIU" in other IPUMS data projects.
How do I obtain data? [top]
All IPUMS data are delivered through our data extraction system. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. To start, users can reference our instructions for the data extraction system and instructions for opening an IPUMS extract on your computer.
Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
What format are the data in? [top]
IPUMS produces fixed-column ASCII data. Data are entirely numeric. In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer.
A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.
All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.
How long does a data extract take? [top]
The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts will generally take a few minutes. The system sends an email when the extract is completed, so there is no need to stay active on the IPUMS site while the extract is being made.
How does "sample selection" work on the IPUMS Higher Ed web site? [top]
When a user first enters the variable documentation system, all samples are selected by default. Every variable in the system will display on all relevant screens.
Users can filter the information displayed by selecting only the samples of interest to them. Only the variables available in one of the selected samples will appear in the variable lists. The integrated variable descriptions and codes pages will also be filtered to display only the text and columns corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.
When a user enters the extract system after selecting samples, those selections are carried into the data extract system.
What does "add to cart" mean? [top]
While browsing variables in the documentation system, you can place them into your data cart. Checkboxes and buttons labeled "Add to cart" are available in different contexts for this purpose. Any variables you identify in this way will be selected for you when you enter the data extract system. Once in the extract system, you can return to the variable list to make more selections.
Why can't I open the data file? [top]
There are two likely explanations:
1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.
2) You cannot open the data file directly with a statistical package. The file is a simple ASCII file, not a system file in the format of any statistical package. The extract system does, however, generate a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data.
Is there a preferred statistical package for using the IPUMS? [top]
IPUMS supports SPSS, SAS and Stata. The system does not make data files in those formats, but does generate syntax files with which to read in the ASCII data.
Can I get the original data? [top]
Original public use SESTAT and full NSF survey data files are located at the following National Center for Science and Engineering Statistics