Trajectories of Secondary Data Use in the Social Sciences

Trajectories of Secondary Data Use in the Social Sciences


[Amy] Okay, thank You Dharma. So this
presentation is a little bit of a mash-up of what I do at ICPSR. It’s also
a little bit of how I came to ICPSR because some of that is rooted in my own
Summer Program experience. And it’s a little bit about the research that we do
at ICPSR to help make data archiving a more effective and efficient activity. So
a little bit of a research presentation, as much of research as I do at ICPSR, and
also a little bit about about the role. So thank you to Dharma for that nice
introduction. Okay. So this is really my beginning, random introductory slide. I
mentioned that my path to ICPSR is partly because of my own Summer Program
experience. But even before talking about that, I’m going to go a little bit further
back in time to talk about how I became a sociologist. And I think that’s
kind of interesting given the commitment that the staff at ICPSR make to the
social sciences, the disciplines we come from, it’s very varied. So I’ll just tell
a little bit of my story. So I came from a really really… I came from a
blue-collar background, my parents didn’t finish high school and I had no
benchmark for what a sociologist would even do when I was in high school. But
that said, I knew I wanted to go to college because college was a good thing.
I’m from Buffalo and so you can see the insignia there. That’s where I went
to school as an undergrad, to the University of Buffalo. But the way that I
afforded going to college was, in part because I have older parents and my
older parents retired. They retired when I was still in school before I was 18.
And because of that, you receive Social Security dependent benefits. So I got to
save for college even though I might not have had resources to go to college
anyway and so I got there. The picture of Ronald Reagan is because
during the Reagan era that Social Security dependent benefit was changed.
It was made less generous, in part to keep Social Security solvent. Social
Security dependent benefits accrue… are given to children, dependent children of
retirees until you’re 18. Before 1981, I think, you actually got them all the way
through the end of college. So had I had all those resources, I’m sure I would
have gone to Harvard and not a state school in New York. But here I am, so I
went I went to sociology. So the other thing that I think is interesting,
what I have here, this picture of the steel factory where I grew up in
Buffalo. This was Bethlehem Steel in the city of
Lackawanna. My dad worked there, as I mentioned. He didn’t go to high school, so
this was a great job, it afforded him lots of benefits. He was forced to
actually retire when he did. It was before he was actually retirement age.
And it was these kinds of experience that shaped my sociological curiosity
about the world. At the time, again, didn’t know what the word sociology meant. I
certainly didn’t consider myself a future sociologist. But I knew I was
interested in the social problems of the time. And so for me, the forced retirement
of my dad was one of those, you know, really interesting, interesting topics.
Okay. So I started at school. I picked sociology, again not knowing what
it was, because I knew I wanted to take some classes in aging or gerontology,
something aging related. And I went through the course catalog, the very
first discipline that I hit upon that had an aging course was Sociology of
Aging. It was my first class in sociology. And so with that, I was kind of off to
the races. I loved the topic. I made a connection with my instructor of the
class. She, kind of, guided me then both through my undergraduate career but
then also into the graduate program in sociology. Because again, I didn’t even really
know what grad school was. I’m not kidding. So I was like, “I think I want
to be a lawyer or I might go get MBA.” But she was like, “No no no. What you
really want to do is stay in this department and get your PhD in sociology.”
So I was like, “Okay, that sounds good too.” and I did it. And my own particular
interest in aging combined with what my strengths was as a student, which was
sort of a quantitative approach and being good at quantitative methods, led
me to be sort of unique in my department. The department itself was relatively
small. Anyone from the University of Buffalo? Just as a per chance. And maybe
this is an experience that might resonate for some of you. I was in a
program that had a lot of people who studied theoretical approaches to
sociology, they studied qualitative research but there weren’t very many
people who had sort of a quantitative bent. And that quantitative bent, of
course, introduced me to the world of what, I did not know at the time, would be
important to both my graduate career and finishing my PhD but then eventually my
course work, which was ICPSR providing data for the work that we do in our
graduate programs. And so my advisor at the time… I was well matched, you know
also did quantitative demographic approaches to social problems and social
questions… handed me the catalog at ICPSR and said pick a dataset. You know,
so if you’re going to be here as a graduate student these are the data set options.
You pick one and then you study it. And so it’s literally a paper catalog
at the time, and you can order the tapes, and dot dot dot. So I loved it, that was
ultimately how I did write my dissertation, of course was a secondary
data analysis with data that I got from the University of Michigan. Actually the
Health and Retirement Study, which is not a study of ICPSR, but part of the
Institute where we are. And all of a sudden I felt like there were resources
that could support the kind of work that I wanted to do as a sociologist. The
other, sort of like, run-in with ICPSR– the weird acronym that came along– was the
different training programs like you are in here in the four week program. But there were one-week programs or one-week
workshops that I was interested in. I applied, I got in, and I came to ICPSR. And
what was really life-changing was the fact that I wasn’t like one graduate
student interested in quantitative methods and secondary data analysis. I
was one of many people who were like me and it was the first time in my,
sort of, new career that I felt at home because there were other people
interested in the approach to work that I wanted to do. So anyway, just a
little bit of background of why the Summer Program is near and dear to
my heart. I still have colleagues that I made in the Summer Program that support
me at different points through my career. Some of the more senior ones support me
in really expressive ways. I also have, sort of like, same peer colleagues that I write with still
today. And so it’s a really… the Summer Program for me, it was a really
transformative process and I hope it is for you as well. So when I was done with
my PhD, I was actually starting to get knowledgeable about what the whole thing
was about, and landed a good job. It was here at Wayne State University as a new
assistant professor working in an interdisciplinary research center. And I
mentioned this because it really began to shape how I thought about the
kinds of work the different scientists did. So I now knew that
sociologists, at least some of them, valued secondary data analysis but I was
the only sociologist in the research center where I was at Wayne State
University. And I remember distinctly my colleague saying to me, “Oh well that’s
really nice that you did a dissertation on the Health and Retirement Study and
it’s easy to publish papers, but when are you really going to do research?” So
anyway again, shaping this future passion to change and revolutionize how we think
about the kinds of work that scientists do. So here I am returning to several
years later after several years in the assistant professor tenure track returning
to ICPSR to actually make a difference. And what interested me of
course, in the job at ICPSR was the ability to come here and be part of the
infrastructure that supports the social sciences. And so I was,
you know, publishing papers at a good rate and I thought I could publish one
more paper or I could do this. I could go to ICPSR and I could publish a data set
that hundreds of people might use and so that was really really appealing
to me. So I was recruited to ICPSR out of, I was at the University of Florida at
the time, to become the Director of Acquisitions. And really the Director of
Acquisitions is the person at ICPSR, it was actually a new position, but the task
of that role was to sort of survey the field, determine the datasets that we
might archive at ICPSR, and go out and get them. And what it really was akin to
was a bit of a sales job. And because it’s all about, sort of like you know,
identifying who might want your services and attracting and recruiting them to
what you do and ultimately bringing the datasets into ICPSR. It’s just
remarkably akin to the sales process. We actually use some of the tools in the
business world to support the pipelines of data that come to ICPSR. So I and
eventually my team that worked with me, work one-on-one with researchers, like
all of you, and budding researchers … one-on-one with researchers and project
teams convincing them to share their data, about how ICPSR will be good
caretakers for their datasets. We are interested in data from all social
science and behavioral disciplines and datasets that are related to the social
sciences. So if you’re a climate scientist, your data aren’t out of scope
for ICPSR because climate impacts humans. We are interested in that too. And so
that has been the work that I have done over the last decade at ICPSR. In
addition to, kind of, these one-on-one relationships and our one-on-one
approach to recruiting data. Another big activity that I’ve been involved in has
been, sort of, I would say programmatic acquisitions. So defining a new
discipline that needs awareness about data archiving services and figuring out
how to bring datasets from that discipline that might not share data
with a place like ICPSR, which is known some disciplines better than others, and
figuring out what the challenges will be to that group archiving their data and
working through those. It also is supporting substantive areas and domains
and methodologies where a particular funding agency might want to build a
base of data to support the research in that substantive area or, again,
methodologies. So a lot of the work that I’ve come to do over time has been,
as Dharma mentioned, working on large archiving projects at ICPSR. So writing
the proposal and leading the design and then the execution of that work with
funding. She mentioned the National Addiction and HIV Data Archive Program,
that’s been going on for over eleven years, I think, at ICPSR. I was talking
with the National Institute of Drug Abuse today about the project and giving
them a status update. It’s a project that archives longitudinal data sets that
have been funded by NIDA. Datasets that receive a lot of funding over a long
period of time, NIDA invests in a lot of long term studies following people at
risk for substance use or users of illegal substances for long periods of
time. And so we work with researchers who, in some ways, have amassed data, sometimes
over 20 years 30 years, to determine how best to help them get their data
into a stable archive like ICPSR so that the data can live longer than the
project and also so that a broader group of users can come to the
data. So that’s one of the challenges for example of that particular audience. We
also have an archive of data on disability, it’s funded by NICHD and a
few other different NIH institutes. You’ll probably get the sense, so my
background in health and aging and work has led me to be the person who’s very
much working on the health datasets and health projects of ICPSR in addition to
acquisitions. So this is a project that currently has funding to support
rehabilitation sciences and help build inroads into this group
who’ve never thought about data sharing at all. I talk to people routinely on
that project who are like, “You want my data and why? It’s like with 30 people
and we did a bunch of sensors.” I know, they said they have a range
of studies in this particular field and so we’re working to help that community
archive their data. And then my final project is funded by the Robert Wood
Johnson Foundation, at least one of my current projects. I think there’s a couple more. And
this is the Health and Medical Care Archive. And this is for RWJ grantees, they have a
requirement to archive their data with us at HMCA at ICPSR. So this is a great group, they’re actually compelled to archive data.
This is relatively new for me, this was a project that one of my predecessors ran
for a number of years so I’m just beginning to learn the landscape of who
these grantees are and ways that we can help them share their data. But I wanted to
mention that one too. Okay so to the research. Long preamble. So one of the
questions that I’m always asking in my role at ICPSR is, “What data should ICPSR
add?” Obviously we have relatively limited staff resources to go out and get data.
We certainly have limited staff resources to fix the data sets up and
provide the nice ICPSR codebooks that you might have come across if you’ve
downloaded it a data set of ICPSR. And so as we ask our question, “What data should
we add?” we’ve attacked that in a variety of different ways through user groups
through setting priorities for, you know, data related to highly cited researchers.
Like over the time that I’ve been here, we’ve thought a lot of different
ways about it. But a few years ago some of my colleagues and I, Dharma included,
thought we actually have data at ICPSR itself in our systems that can help us
understand what users want. Yes, we could go out and do a user survey and we’ve
done those too, but there’s other things that we have, traces of data in our
systems that tell us about what users come for. So we started to ask the question, “What is it, what data do users actually want?”
And more specifically, “What do they want that we don’t have a lot of already?” So
if you’ve been to ICPSR you know that we have thousands of, (coughs) excuse me, datasets and
studies in our archive that you can just download and take and use. You know, what
are we possibly missing? So we are excited, this said research project to us,
and the first thing, of course, we did is, “Where can we get data on this?” So we
looked amongst our team for some expertise. This is the group that
actually worked on assembling the data set and some of the analyses and related
publications. So Dharma included. And really what we were doing, the data set that we were working
on is a Google Analytics data set. It comes from the search boxes of ICPSR’s
website where over 500,000 searches happen every year of people looking for
data sets on our website. We scraped all of that information for both 2014 and
2015 from, like I said, the Google Analytics reporting. So for example, in
this data set in 2014, got a range of things that happen. So a
lot of people come and they search, they search for something that’s really
unique. That’s 34 percent of searches but that’s also, you know, misspellings and
probably people who found our site that they were really looking
for amazondata.com, I don’t know. So we get a lot of unique searches and so that was
kind of interesting. And at first we had the idea maybe we should look at the
unique searches because that’ll tell us about things we don’t have. On the other
hand, the most frequent search of 2014 was, I don’t know the actual what was
searched for, I’m sorry. Maybe I will in a later slide but the top search was performed 2,727 times so that’s the range, 1 to almost
3000. So instead, you know, that first idea was, you know,
pretty much a wrong instinct. We’re not going to look at the really rare
searches of ICPSR, as fun as that might be. But looking even at the top 500 searches, the most frequent searches, we could
actually characterize a lot of the activity of what our users are
interested in. And even there we have a lot of places where we don’t have
datasets, it turns out. So the top 500 searches are around 20% of all searches
that happen in a given year. The frequency range of that that top 500 is
92, again that top number of 2700. In addition to that Google Analytics data
set that we scraped that was really easy to create, we had to add stuff to it, of
course. So what we added to the data set were the number of results that were
returned by that search term. So we went and entered the 500 searches
ourselves in a couple of different ways, but the bottom line was, “How many results
did the user see when they did that search?” And we also did some basic
classifications of the search: was it a keyword search, were they
looking for the name because they knew the name of a person who collected a
data set, you know, dot dot dot. And so this is the … is it big enough? Oh good
it looks big enough, good. So these are the top 500 searches,
well the top 10, right? This is 10, 10 searches. So for exact phrase searches,
education was the number one search term on the ICPSR website, I knew it was
going to be there. And this was with quotes around the phrase which is why
the number varies a little bit from that 2700 number. And so these are
the things that people look for most frequently: education, crime, health, China,
income, domestic violence. A lot of these were things that we would have expected.
A lot of them reflects some of the strengths of our collection and some of
them reflect the sort of current topics of the time of 2014 and 20… ultimately also we replicated this for 2015 to see how many things are stable across year to year and how many things, sort of, crop up as being a new, hot search term. And ultimately what we did is, looking at the column that’s third to the right, the search study ratio, we created a ratio of how many studies were returned relative to the
search. And so the search to study ratio shows, the largest of that
number, shows the biggest gap in our collection. So social media, in 2014, was
the “gapiest” topic in our collection that people cared a lot about. And so dot
dot dot, NCAA, LGBT, restorative justice, the 2012 election, and so on and so forth-
stop and frisk- down at the bottom, demoralization. Okay so anyway, as Dharma
mentioned, we undertook this so that we could then begin to focus our activities
on identifying datasets to fill some of these gaps and holes. We do that, we
continue to do that today. We do that in the journals, we do that in grant search
databases, and when we’re talking to people we keep a priority list of these
kinds of things so that we can use them to help guide and shape our decisions
about bringing data sets into ICPSR. But to return to the question of, “What data
should ICPSR add?” The other question that we thought we could answer with our
own data, our next research project, was the data that the users are actually
going to use. And so we wondered, “Could we look at our current collection and
predict (according to the title of the talk today) what data actually gets used
over time?” And so we do keep data use statistics for our studies. And we began
to do that, this is a really old version of the ICPSR website I’m not sure why I
put that in there, but in 2000, ICPSR began disseminating downloadable data at
a set from the web and so that’s when we actually have then really good records
of things like data use statistics. But really in 2006 our data delivery improved at ICPSR so that we could actually build a more systematic
data set around those things. So we really started our inquiry looking at
ICPSR 2006 onwards. We selected a period of three years from
2006 to 2008, looking at studies that were released to the public to use for
those years, that three-year period, that then we could follow for seven
years and so that’s sort of why we looked at this initial period of years.
So we have seven years of data use statistics on our studies, for 614
studies. And here’s the trend. So data use in the first
year, a lot of the first year itself is not a complete year, so the first full
complete year of a study if I cut this first year off, would be year two. And so
that’s, sort of, the peak of use across all studies that we release at ICPSR. The
metric here that you’re looking at is the number of unique users that
downloaded some of the datasets. So over 45 unique users on average in the first
full year, which is year two, downloaded an ICPSR dataset. It then drops off,
of course, a little bit more steeply and and levels off some as the seven year
progresses. So this is the basic trend. But then we wanted to ask and we could
ask… ICPSR has lots of different streams of data coming into ICPSR, we have a lot
of different workflows and processes around how we prepare data to go
back out for release. And we wondered if the datasets we took better care of, if
those datasets got more use? Now that’s a really simple question and it’d be great if
we could actually answer it in and of itself, but the other problem, of
course, is the confounding of the fact that we tend to take really good care of
and curate the datasets that we expect are going to get the most use. So yeah,
there’s a causality area. So as an association we just wanted to show, could
we demonstrate that some of the different ways that we curate data and
take care of and improve data at ICPSR, if it differentiates the use ultimately?
And so we have categories for replication data. Replication data are
data sets that ICPSR does not touch and that was around 7 percent of the studies
across that three-year period. We have and then two categories: Membership and
sponsored. So what we do for ICPSR membership is a little bit more resource
constrained because we have a lot of datasets that get donated and deposited
to the ICPSR membership. Sponsors, when we apply to places like NIH for funding to
curate datasets, we have a set amount of studies and we well resource how much
those data sets will be curated. So just as a comparison member versus sponsored,
you’d expect that sponsored things would be curated to a higher level meaning
that they’d be cleaner and nicer and easier for people to use. And again,
perhaps selected for high use. And then within that we make a differentiation
between non-intensive and intensive in our database, and so I’m able to further
differentiate membership… member data and sponsored data by those two levels. So
we’ve got five levels in the end. So here’s the same curve. Oh no and I can’t
even see the… so the blue line on the bottom is the replication data. The
replication data are the ones, as I mentioned, get no love and care from
ICPSR. People just self-publish them. Most of them are just in support of a
journal article and those get the lowest use at the outset and stay low over
the seven-year period. The green line is sponsored projects
that are highly intensively curated and they’re, of course, it’s the
highest line as we’d expect. The other lines differentiate amongst all those
things. So anyway, as hypothesized the data we take more care of and curate to
a higher standard get more use. Some of the other things that we could look at
in our data were some of the things about the study itself and this is sort
of interesting. If it’s a more complex and comprehensive study you would expect
that it would be something that would be more used and of interest to users, like
yourselves, who are downloading data. Narrow range, single topic studies are going to appeal to a smaller group of users and so we
wanted to test them things like that. So we differentiated the findings by the
number of variables, the data use trajectories; the number of subject terms
and we broke these into some categories that you’ll see in the next charts; whether
they were not part of a series and I’ll come to what that means. So here
you’re looking at the number of variables. So when a study has over 200–
200 or more variables, that’s the gold line at the top, it gets the most use.
There’s not so much differentiation though between the next two, either a
really small study that’s under a 100 variables in it or a study
that’s a 100 to 200 variables. And then there’s some missing
data that plagues the data set when you’re looking at this kind of variable
that we’ve created. So I guess our hypothesis stands. Similarly, the number of
subject terms… study covers more subjects, 21 or more subjects it gets
the highest use and a little bit less for the other two categories. And then
finally, here the question I asked about series which is, I guess, interesting
because it changes and flips in terms of what’s important early on versus later.
Data in series are data sets where there’s an either an annual or a regular
update to the series and perhaps then it’s not surprising that when somebody
is a user, perhaps, of a series they’re waiting for the latest most updated data.
And they want that, and they know that it’s coming, and they download it
when it’s new, and then use drops off. So that’s the blue line so if a study is
part of a series, it’s really hot and then a little bit not. And then if it’s
just not a study, which I actually find this other part, this 163 studies that
were not part of a series it’s just a one-off study, use stays high and really
high for the seven year period, which I think is pretty remarkable. And so over
40 unique users over the course of a seven year period, I think for a study is
pretty fascinating. Okay so we learned a lot in the course of these couple of
research papers. One of them is a published paper, one is a paper that we
are working on and hope to present at a conference later this year. But we
learned that we can use our own systems to monitor our users and to inform the
way that we think about building our collection. The user behavior data at
ICPSR leaves these traces in our systems that can help us understand what
datasets we should recruit and which datasets we should spend money to curate
more. And it answers these questions of what it is, perhaps, that the users are
looking for, what data is ICPSR lacking, and what data sets are going to
get use if we bring them to the archive. The first data set, the Google Analytics
data set is available, you can download it and study and crunch the numbers
yourselves. So I put that there. The other data set is a little bit newer
and will be out when our first publication goes under review, which is
at least decent practice. But we have to clean it up, of course just like anybody,
before giving it to ourselves and so the data sets that underlies some of the
things that I’ve talked about you can go out and find. And then I have just two
slides to talk about the… because I cannot, you know, I said this was a sales
job like acquisitions is a sales job, and I cannot, for the life of me, talk to 30
people and not ask you for your data, either current or future. So I just want
to stress that ICPSR is… considers itself to be a life line partner in sharing
data. Even though I focused on, sort of, what data are most desirable, perhaps to
our users or to ICPSR I don’t want it to be mistaken or a myth that we wouldn’t
be interested in your data because it doesn’t maybe meet one of those criteria
that I focused on today. So sharing data with ICPSR is free. People can deposit
data. We’re interested in data of any size and we have a lot of different
services and ways to handle the data sets and so people can
self-publish their data at ICPSR. So no data set is too small or too
insignificant. I think all data are good, I’ve thought that my entire time here. I
focused on what ICPSR is pretty heavily known for which is our quantitative data
sets. I didn’t exclude other things but by design, when you’re talking about
variables and things like that it might not resonate if you collected a
qualitative or a mixed methods study that you have something that ICPSR could help
with, it’s true we can help with that too. We have a qualitative study that arrived
just a couple of weeks ago that’s working its way through the pipeline.
Some datasets are public use and downloadable because you can fully
de-identify them but not all of our data are, we also have restricted
data. And I don’t want to say too much about all these things, I know you’ll
hear from Lynette tomorrow if you’re able to come. A lot of graduate students
give us data, it doesn’t have to be that it was a professor who collected the
data, it’s possible. We get a lot of NSF dissertation award datasets, for example.
Online surveys, data that led to null results, I just kept going with my
brainstorming of, like, no data are too small. So I’ll stop there. And then
finally, obviously there’s really good reasons to do this, to share datasets.
Many of you throughout your careers will come up against funders who expect you
to share your data, that wants you to have your data in the public domain
because they were funded with public dollars, with tax dollars to collect the
data and so the public also should have access to the data.
It also supports other things that are… that matter in our careers. When I
was putting my recent promotion packet together, I was super excited to be like
here’s my published dataset, here’s how many times it got used, all those things
were possible because of ICPSR being a home for datasets, even the small ones
that I’ve collected myself. We track data use and related publications for studies
that we have and so it’s really gratifying to take a study, look at it,
and say, “Wow!” 40… so this was 45 people, on average. But it’s not, you know, unheard of
that I open up a study and it has 400 people that downloaded it in a given
year. And not even necessarily those, like, usual suspect datasets, like big national
surveys but other things get used really heavily too. It’s also one of the things
that ICPSR thinks about, is not just giving data to researchers who can put
the data into SPSS, and analyze it, and make publications, and do secondary data
analysis, and build their careers but we’re also thinking about ways to
disseminate data to really broad audiences. Yes, it’s free for many people
to download many of our datasets but trying to make those datasets actually
really easily useful for practitioners, for people in communities, those are the
kinds of things that ICPSR is thinking about because we want dissemination
to mean more than dissemination for research purposes. Preservation is
something that we worry about when you give us data and you no longer have to
worry about it. We’ve done a lot of… I’ve done studies of scientists where I’ve
asked them what happened to your data from your NSF award from 1986 and I get,
“Well it was lost in a flood.” or “My wife threw it out.” and, you know, you get this,
like, range of things and very rarely do people say, “It’s in an archive. Let me
show, I can give you the link.” You know so, at ICPSR the datasets that first
came into the archive 55 years ago, are still accessible today because
ICPSR uses resources to migrate those things to new formats, to make sure
they’re documented and useful. And so we like to do that for researchers and as
you develop datasets in your career, hopefully you’ll let us do that for you.
And then finally, one of the other benefits of archiving with ICPSR is
around user support and assistance. There’s a lot of really low-level
questions that ICPSR can answer around a study, even if it doesn’t know exactly
how you collected your data. That means that nobody bugged you and
so I love to tell people that we can handle all that tier one user support
kinds of questions. We have a really good staff that do that. Part of
that staff reports to Dharma and they’re really good at helping users find
datasets, download datasets, get started using them, and that just means that your
data then actually do get used. So with that… so I’ve shared a lot of
personal things at the beginning and I had to end with my brand-new puppy. On
Saturday we adopted little Olive. She’s a Bernie-doodle, a Bernese Mountain Dog and
Poodle. And I encourage you, if you want to talk Bernie-doodles to reach out to
me. Should you have questions about either the research that we’re doing at
ICPSR, or I really hope that you have questions for me about datasets that you
think we might archive, either your own or others that you know in the field
that aren’t at a place like ICPSR, those are the conversations we’d love to have
with you. Thank you. [applause]