Frequently Asked Questions (FAQ)
What is SESTAT?
What is the difference between SESTAT and SDR?
What weight should I use?
Are Higher Ed data nationally representative?
Who is considered a scientist or engineer?
What types of data are available?
What are the differences between logical skip, blank, missing, and survey exclusion/confidentiality?
Are IPUMS Higher Ed variable names the same as the original public use variable names?
Is longitudinal data analysis possible with IPUMS Higher Ed?
Are there longitudinal weights I can use for SESTAT component surveys?
Is it possible to link NSCG survey respondents to the ACS?
Can I link IPUMS Higher Ed data with the original public use SESTAT data?
Where should a new user start?
How do I get access to IPUMS Higher Ed data?
What are microdata?
What are "weights"?
What does "universe" mean in the variable descriptions?
How do I obtain data?
What format are the data in?
How long does a data extract take?
How does "sample selection" work on the IPUMS Higher Ed web site?
What does "add to cart" mean?
Why can't I open the data file?
Is there a preferred statistical package for using the IPUMS?
Can I get the original data?
How is a record uniquely identified?
Using IPUMS data
Are there tricky aspects of IPUMS data to be particularly aware of?
What are the major limitations of the data?
Can I find particular individuals in the data?
How do I cite IPUMS Higher Ed?
Using the variables page
Variables page menu
Variables page details
Using the data extract system
Your data cart
Why are some variables in my data cart preselected?
Extract request page
Extract option: Select cases
Extract option: Describe your extract
General information about the project
What is SESTAT? [top]
The Scientists and Engineers Statistical Data System (SESTAT) is a database that provides demographic, educational, employment, and earnings data about scientists and engineers in the US. Scientists and engineers in this database are defined as individuals who have earned a post-secondary degree in science, engineering, or in later surveys, health sciences. Individuals who work in an occupation that is related to science or engineering, even though they do not have a science or engineering post-secondary degree, are included in the SESTAT file.
The database is composed of individuals who meet these criteria from three surveys conducted by the National Science Foundation: the National Survey of College Graduates, the Survey of Doctorate Recipients, and the National Survey of Recent College Graduates. The three surveys are fielded at the same time, questions common among the surveys have the same reference period, and have a similar target population (non-institutionalized individuals under the age of 76 with a bachelor's degree or higher).
Each SESTAT component survey was designed to be a cross-sectional representation of the target population for that year. However, some respondents in the SESTAT are interviewed in multiple surveys over time, allowing for limited longitudinal research. For more information on longitudinal data linking, visit the Longitudinal Data page.
The SESTAT files are composed of microdata. Each record is a person, with all characteristics numerically coded. Unlike other IPUMS data projects, SESTAT records are not organized by households. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require.
What is the difference between SESTAT and SDR? [top]
The Survey of Doctorate Recipients (SDR) is one of three sub-component surveys of the Scientists and Engineers Statistical Data System. The sampling frame for the SDR includes all doctorate recipients residing in the US who received a research doctorate degree from a US institution in a science or engineering field. All SDR respondents are included in the SESTAT version of the data. The original full SDR sample has some variables that were not included in the SESTAT version. Please see our User Note on survey designs for information about the surveys that comprise the SESTAT files.
What weight should I use? [top]
Are Higher Ed data nationally representative? [top]
The target population for the SESTAT surveys is non-institutionalized individuals under the age of 76 with a bachelor's degree or higher, or an occupation, in a science, engineering, or health field. The weights provided with the data cause the resulting estimates to be nationally representative of the target population in the United States in the survey year.
Who is considered a scientist or engineer? [top]
In the SESTAT surveys' sampling frames, a scientist or engineer is an individual who holds a bachelor's degree or higher in a science, engineering, or health, or a science-, engineering-, or health-related field. More specifically, the following list describes the degree fields the National Science Foundation nests beneath science and engineering or science- and engineering-related fields.
Science and Engineering
Computer and math sciences
Biological, agricultural, and environmental life sciences
Physical sciences (Physics, chemistry, geosciences
Social Sciences (Psychology, economics, political science)
Science and Engineering related
Health (Medicine, audiology, nursing, physical therapy)
Science and math teacher education
Technology and technical fields (Engineering technology)
Other Science and Engineering related fields
Actuarial science, architectural or environmental design
Respondents who have a degree in a field other than a science, engineering, or health related field might also be included in the SESTAT if they had an occupation in a science, engineering, or health sciences field.
What types of data are available? [top]
The SESTAT surveys collect information related to demographics, labor force participation, educational history, family, and professional activities. For confidentiality, geographic information such as birthplace, location of academic institution, or current residence, is sometimes suppressed or only available at the region level.
What are the differences between logical skip, blank, missing, and survey exclusion/confidentiality? [top]
In the original NSF data, there were three types of invalid response. There were also unexplained blanks, which IPUMS Higher Ed has assigned an invalid response value.
Logical Skip: Based on the answer a respondent gave to a survey question, they might have been directed by instructions on the survey to skip a question. In this case, a non-response was considered a logical skip. Another way to frame this is that the respondent did not fall within the universe of a survey question.
Missing: A missing code indicates that the respondent was in universe for a question but did not answer a survey question.
Survey Exclusion/Confidentiality: In the original NSF files, if a survey question was asked in only one survey, the observations of respondents in the other two surveys will be coded with a "97". In some cases, responses such as the respondent's ethnic background or location may be suppressed to prevent identification of the individual.
Blank: In the original NSF files, there were unexplained blank values that we have chosen to code differently from the cases explicitly identified as missing.
In IPUMS Higher Ed, we have imposed the following consistent coding scheme for these values:
97: Survey Exclusion/Confidentiality
98: Logical Skip
Are IPUMS Higher Ed variable names the same as the original public use variable names? [top]
When possible, the IPUMS Higher Ed variable names are identical to the original variable names. In some instances, variable names changed over time, and we typically keep the more recent variable name. Please see our User Note for variable name correspondence.
Is longitudinal data analysis possible with IPUMS Higher Ed? [top]
Yes, because some respondents are surveyed multiple times across NSF surveys, there are some opportunities for longitudinal analysis. The variable PERSONID identifies individuals across survey years. (It replaces the original variable, REFID, which contains non-numeric values in some samples.) IPUMS Higher Ed does not provide linked longitudinal files; however, users can download the samples they desire and easily link them using a statistical program, merging by PERSONID. More detailed information about the patterns of follow-up surveys can be found in our User Note on longitudinal data.
Are there longitudinal weights I can use for SESTAT component surveys? [top]
Longitudinal weights were created by the NSF, but they are not yet publicly available.
Is it possible to link NSCG survey respondents to the ACS? [top]
Both the ACS and the NSCG survey data are protected under US Code, Title 13; so through disclosure avoidance techniques used by the Census Bureau, it is not possible to link ACS respondents to respondents of the NSCG. For researchers who wish to create these linkages, contact the Federal Statistical Research Data Center (FSRDC) -administration to apply to use the restricted version of NSCG data.
Can I link IPUMS Higher Ed data with the original public use SESTAT data? [top]
Yes, the original person identification number REFID is available for every IPUMS Higher Ed sample.
Where should a new user start? [top]
The natural starting point is the "Select Data" or "Browse and Select Data" links on the left navigation bar and the top banner. These links open the variables page: the primary tool for exploring the contents of IPUMS Higher Ed. By default, the variables page displays one variable group at a time for all samples in the data series. You can change the view option to show all groups simultaneously, but the page can get very large and slow to load. However, you can filter the information at any point to include only the samples of interest to you ("Select samples").
When you select samples, the page will display only variables present in those samples. An "x" indicates the availability of a variable for a particular sample.
On the variables page, clicking on a variable name brings up its documentation. The information about the variable is contained on a number of tabs. The default tab is the brief description of the variable. More information is usually available on the "comparability" tab. The variables page also has direct links to the codes page for each variable (they are also accessible as a tab in the variable description). The codes page shows the codes and labels for the variable, and the availability of categories across samples. These categories can suggest the types of research possible with a given sample.
Throughout the variable documentation system there are buttons to "Add to cart." Any variables you select in this way are put in your data cart to include in a data extract. Your selections only last for the current web session.
The Data Cart in the upper right keeps track of your variable and sample selections. Once you have made some selections you can click on "View Cart" to review your choices. If you have selected variables and samples you can enter the data extract system. To make a data extract you must be registered to use IPUMS Higher Ed. If you are already registered to use IPUMS Higher Ed, you can click on "create an extract" and use the data access system. The instructions for the extraction system are here.
How do I get access to IPUMS Higher Ed data? [top]
Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
What are microdata? [top]
Microdata are composed of individual records containing information collected on persons. The unit of observation is the individual. The responses of each person to the different questions are recorded in separate variables.
Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the SESTAT data.
Microdata are inherently flexible. One need not depend on published statistics that are compiled in a certain way, if at all. Users can generate their own statistics from the data in any manner desired, including individual-level multivariate analyses.
What are "weights"? [top]
All IPUMS Higher Ed datasets include a weight variable. Each individual has a weight that ranges from 1 to 13,508.49, which adjusts for the frequency at which individuals with characteristics matching the respondent occur in the actual target population. Most statistical programs have functions to automatically use weights to adjust statistical analyses.
The weight variable WTSURVY should be used for full survey samples, that is, the original SDR survey samples as opposed to the pooled SESTAT samples. The appropriate variable for analyzing SESTAT files is WEIGHT. Because it is possible for the same individual to be in the target population for more than one NSF survey, the NSF adjusts WTSURVY of each survey for use when they are combined into a SESTAT file. For more information about survey designs and the file structure of SESTAT, see our User Note on survey design.
What does "universe" mean in the variable descriptions? [top]
The universe is the population that is given an opportunity to answer a particular survey question, or that has a valid value for a variable. The population is determined by individual characteristics or particular responses to previous questions. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. Cases that are outside of the universe for a variable are labeled "Logical Skip" on the codes page. Differences in a variable's universe across samples are a common data comparability issue. The label "Logical skip" is comparable to "NIU" in other IPUMS data projects.
How do I obtain data? [top]
All IPUMS data are delivered through our data extraction system. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. To start, users can reference our instructions for the data extraction system and instructions for opening an IPUMS extract on your computer.
Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
What format are the data in? [top]
IPUMS produces fixed-column ASCII data. Data are entirely numeric. In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer.
A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.
All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.
How long does a data extract take? [top]
The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts will generally take a few minutes. The system sends an email when the extract is completed, so there is no need to stay active on the IPUMS site while the extract is being made.
How does "sample selection" work on the IPUMS Higher Ed web site? [top]
When a user first enters the variable documentation system, all samples are selected by default. Every variable in the system will display on all relevant screens.
Users can filter the information displayed by selecting only the samples of interest to them. Only the variables available in one of the selected samples will appear in the variable lists. The integrated variable descriptions and codes pages will also be filtered to display only the text and columns corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.
When a user enters the extract system after selecting samples, those selections are carried into the data extract system.
What does "add to cart" mean? [top]
While browsing variables in the documentation system, you can place them into your data cart. Checkboxes and buttons labeled "Add to cart" are available in different contexts for this purpose. Any variables you identify in this way will be selected for you when you enter the data extract system. Once in the extract system, you can return to the variable list to make more selections.
Why can't I open the data file? [top]
There are two likely explanations:
1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.
2) You cannot open the data file directly with a statistical package. The file is a simple ASCII file, not a system file in the format of any statistical package. The extract system does, however, generate a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data.
Is there a preferred statistical package for using the IPUMS? [top]
IPUMS supports SPSS, SAS and Stata. The system does not make data files in those formats, but does generate syntax files with which to read in the ASCII data.
Can I get the original data? [top]
Original public use SESTAT and full NSF survey data files are located at the following National Center for Science and Engineering Statistics webpage.
How is a record uniquely identified? [top]
The variables REFID and PERSONID (a recode of REFID) represent an individual respondent, though respondents can appear in different samples. The combination of PERSONID/REFID and SAMPLE constitute a unique identifier for every individual record in IPUMS Higher Ed.
Using IPUMS data
Are there tricky aspects of IPUMS data to be particularly aware of? [top]
The IPUMS Higher Ed samples are weighted: each individual does not represent the same number of persons in the population. It is important to use the weight variables when performing analyses with these samples.
It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. In other words, the syntax labels are not enough. There are two things to pay particular attention to. The universe for a variable -- the population at risk for answering the question -- can differ subtly or markedly across samples. Also, read the variable comparability discussions for the samples you are interested in. Important comparability issues should be mentioned there. If a variable is of particular importance in your research (for example, it is your dependent variable), you are also well served to read the enumeration text associated with it. This text is linked directly to the variable, so it is quite easy to call it up.
What are the major limitations of the data? [top]
IPUMS is composed entirely of sample data, and some subpopulations may be too small to study with the sample data.
Because the data are public-use, measures have been taken to assure confidentiality. Names and other identifying information are suppressed. Most importantly for many researchers, geographic information is limited.
Can I find particular individuals in the data? [top]
No. A variety of steps have been taken to ensure the confidentiality of the data. Most fundamentally, the samples do not contain names or addresses. The data are only samples, so there is no guarantee any given individual will be in the dataset.
How do I cite IPUMS Higher Ed? [top]
Reports and publications using IPUMS Higher Ed data must be cited appropriately. The citation is:
Minnesota Population Center. IPUMS Higher Ed: Version 1.0. [Machine-readable database]. Minneapolis: University of Minnesota, 2016.
Any publications, research reports, presentations, or educational material making use of the data or documentation should be added to our Bibliography. Continued funding for the IPUMS depends on our ability to show our sponsor agencies that researchers are using the data for productive purposes.
Using the variables page
Variables page menu [top]
Use the "Variables" menu to browse or search variables:
Topics: variables by group
A-Z: integrated variables by letter
Search: display only variables that contain specified text in particular fields
Use the links on the right side of the menu to:
Select Samples: limit the display of variable information to selected samples
Help: directs you to our FAQ's for instructions about the extract system
Display options: alter how the variable list is displayed or get help for this page
Sample label key: show a descriptive key to the sample labels and how they correspond to different NSF surveys
Variables page details [top]
The variables page allows you to browse variables while limiting and controlling how the information is displayed.
The "Select Variables" menu is for browsing the variables. You may also search variables by specifying search terms for specific fields of variable metadata. The system will return a list of variables that include any of the search terms you indicate.
When you "Select Samples" you limit the variable list to display only variables that are available in at least one of those samples. But the effect of selecting samples extends into all the variable descriptions and codes pages you can access through the variable system. Only information relevant to your selected samples will be displayed in any context while you browse the variables. You can change your sample selections at any point.
Selecting samples is a good practice when exploring the IPUMS, because the amount of information can be unwieldy. On the other hand, sometimes you need to see everything to determine what kinds of research are possible using the database.
The final choices are "Options" and "Help." The "Display Options" item brings up a screen that offers a number of choices regarding the display of the variable list. Each selection has a default choice.
Switch between the long version of a sample label and the short version. For example, the long version of the SESTAT NSCG file in 1993 is S-NSCG 1993, while the short version is S-CG 1993.
Switch between viewing one variable group at a time and viewing all variable groups on one screen. Unless you have a limited number of samples selected, your browser may be slow to display all groups. The default view is one group at a time.
Variable availability information
Switch between displaying the full sample-specific availability matrix, and a view that only displays the total number of samples that contain each variable. Both views only display or sum the samples that the user has selected in "Select samples." The default view is the detailed availability information.
Variables that are not available for the selected samples
Switch between a view that only displays variables present in one of your selected samples, and a view that displays every variable, even if they are not available. The default view is to only display available variables.
Ordering of sample columns
Display the samples columns indicating variable availability in chronological order (oldest to newest) or reverse chronological order (newest to oldest). The default is reverse chronological (newest to oldest).
The Variable List
As you browse the variables, they are displayed in a list containing a number of columns. The variable name links to the variable description, which includes detailed comparability discussions, universes, and enumeration text. The variable codes -- and their associated labels -- can be accessed directly using the "codes" links.
In the area to the right of the "codes" column is a column for every sample that the user chose in "Select samples." By default, the most commonly requested samples from each year are selected. The country abbreviation and last two digits of the sample year identify each sample at the top of every column. Hover over the year with the mouse to see the full country name. If a variable is available in a given sample, an "x" is printed in that column.
Each variable has a box on the far left in the column labeled "Add to cart." Use these to identify variables you wish to include in a data extract.
Using the data extract system
Your data cart [top]
You must be logged in to use the data extract system. If you are not registered, you must apply for access.
At the top right corner of the variables page is a summary of your data cart. This box displays the number of variables and samples you have selected. Clicking the yellow circle next to a variable places it in your data cart. You can view your data cart at any time by clicking "View Cart." The "View Cart" link only becomes operative when you have selected a variable or sample.
You data cart lists the variables pre-selected by the extract system as well as any variables you selected while browsing the documentation. As with the variable selection page, you can remove variables from your extract in this step by clicking the checkbox next to the variable in the "Add to cart" column. If you chose a variable but subsequently altered your sample selections in such a way that the variable is no longer available, it is indicated by an "i" icon.
The data cart also includes record type, links to codes pages, and sample availability for the variables in your cart.
Buttons are provided to return to the variable list to make more selections or to alter your sample choices. If you return to the variable list, click on "View Cart" again to return to the data cart.
When you are satisfied with your data selections, click "Create Data Extract" to finalize your extract request.
Why are some variables in my data cart preselected? [top]
Certain variables appear in your data cart even if you did not select them, and they are not included in the constantly updated count of variables in your data cart.
Unless you are absolutely certain you will not need one of these variables, we recommend that you not remove them from your data cart.
Extract request page [top]
When you click "Create data extract" in the Data Cart, you come to the Extract Request page. All of the actions on this page are optional. If you wish, you can simply hit the "Submit" button and create your data extract. You will be prompted to log in if have not done so already.
The page summarizes your data extract and provides a number of options for customizing it. A link at the top expands to show the samples you selected. If any samples have notes associated with them, a message will appear on the samples bar to encourage you to review that information. Click the appropriate links to go back to the variable browsing and sample selection pages to alter your choices. You return to the extract request page via the data cart, where you can review the availability matrix for selections and easily drop variables by unchecking them.
When you submit an extract, there will be a short delay, rarely longer than a few minutes. You do not need to wait on our site for the job to be completed. The system will send you an email when your extract is ready.
The definitions of every extract will remain on our server indefinitely, but the data files are subject to deletion after three days. However, the screen where you download extracts has a feature that lets you revise old extracts. When you click on "revise," all your selections for that extract will be loaded into the system, after which you can edit or regenerate it. Note, however, that each successive data release can create difficulties for recreating old extracts, because codes might change.
Extract option: Select cases [top]
The "select cases" feature allows users to limit their dataset to contain only records with specific values for selected variables, such as persons age 65 and older. Multiple variables can be used in combination during case selection. Selections for multiple variables are additive, each being implicitly connected by a logical "AND" for processing purposes. You can only perform case selection on either the general or the detailed version of a variable, not both.
Users should be careful with the case selection feature. It is possible to select a specific variable category that does not exist across all the samples in your extract, thereby inadvertently excluding those samples from your dataset.
Extract option: Describe your extract [top]
You can describe your extract for future reference. Our system will display the description on the page where you download your data extract.