Statistical Methodology in the Social Sciences 2017: Session 3

Statistical Methodology in the Social Sciences 2017: Session 3


There we go.>>So we have three more talks, and some people asked about this. This is the most attended this conference has been, this is the third,
and it will continue. So next year around this time, it’ll be on a Friday within a week or
two of this time. And I will say that I am
always looking for speakers, and ideally,
we have nine speakers from nine different social science departments and
groups on campus. So, and usually,
it’s a faculty presentation, particularly good for newcoming faculty. But also, it can be advanced, very good students, and I’m always open to suggestions and
so on for speakers. So without further ado,
we have Jenny McDougall from sociology, presenting on something [CROSSTALK]. You’ve heard of this topic before?>>[INAUDIBLE]
>>Okay, well, thank you Collin for the invitation to be here, and
to ISS for putting this together. I know a lot of you are students,
so I hope you realize there are a lot of institutions, where this sort
of interdisciplinary exchange of ideas and opportunities to come together and
talk aren’t available. Where everything’s very siloed, and people just sort of exist
within their own departments. That’s a real shame, so it’s a really nice
thing, that we have things like this. I really appreciate it, and I hope you find this useful
to all of you as well. So I’ll start with the talk today by
explaining my motivation for the topic, and sort of the content of this talk. It begins back in my musician days, back then, there was this essay or short story by Ralph Ellison,
where he describes jazz performance. In the back of the room,
there’s a guy by himself, standing quietly by the radiator
in the back of the room. You would almost miss him,
he’s sort of quiet, obscure in the back. But he’s giving the performers
his full attention, he’s completely locked in and
totally absorbed in the music. So the takeaway from the story I
used to reflect on a lot was that, that’s the person you’re playing for
when you’re playing music. Even at the crappiest gig where
there’s very few people there, and people there are talking,
not even listening to the band. If you kind of imagine that the guy
in the back of the room is there, and that he’s giving his full
attention to the music. Then that helps you fully sort of
honor the music and the process and yourself, and
sort of maximize your creativity. Back when I was a musician, I used to
think of the guy in the back of the room. Then I quit music, did academia, and now when I’m at the front of a room,
it’s usually at an academic conference, or maybe I’m visiting another university,
giving a talk or something. And it’s a different person in the back
of the room now that I think about, and you all should think about too. It’s not the sort of
quiet jazz aficionado, it’s a different guy,
it’s the snarky stats nerd. He’s a snarky stats nerd,
he’s in the back, and he’s got his Linux machine in his lap. He’s on Twitter, and he’s waiting for you to say something dumb, so
he can put it up on Twitter. And he can cross out a box on
his bad stats bingo card, and expose you to the entire world. That’s who I’m thinking about
when I give a talk, and so today, I wanna help you not let
snarky stats nerd cross out this box in his bingo card
when you give your next talk. So this is my motivation,
my first motivation, we’ll get to my second one,
second one’s much more selfish. Okay, so count data, when people talk
about count data, we’re referring to a dependent variable that reflects
the number of times something happens. Some event is observed zero times,
one time, two times, so forth. And there are, of course, a number of
research questions in the social sciences that resolve around count
data type questions. The reason snarky stats nerd has linear model of counts on his bingo card is because a linear progression
model is an inappropriate way for modeling this sort of dependent variable,
for a couple of reasons. What is the correct way to do it,
generally, approaches hinge on some version
of a generalized linear model, it uses a Poisson data
distribution assumption. The hallmark of the Poisson distribution
is that its shape changes as the expectation increases,
the mean of the count variable. As it is closer to zero,
the Poisson distribution has this highly right-skewed kind of peaky distribution. The more the expected count increases, the closer the distribution
moves to a normal shape. So especially when we
have low expected counts, the normality assumptions of the linear
regression model is violated. Another reason to avoid using linear
models and model count data is that, it can, when you draw a straight line, you
can end up making predictions of negative counts, which is not possible. So kind of this similar
rationale to why we use lodgit model to model probabilities. Using the lodgit length and the Poisson
data distribution, the lodgit length by applying the logarithmic transformation
of the right hand side of the model. Takes these linear predictions,
forces them to be positive numbers, which is what we want when
we’re modeling event counts. There’s these couple of classes of count
models that you see being used a lot, either the word Poisson shows up in them,
or negative binomials. But they’re all based on
the Poisson distribution, so what I would like to do now
is take the next two and a half hours to go through
each of these models.>>[LAUGH]
>>Talk about the technical characteristics with pros and cons. Of course, there are a number reasons
why that’s a silly thing for me to say. One of the reasons that’s a silly
thing for me to come in here and say is because we’re all
in the room right now with the person who literally wrote
the book on these types of models. In addition, so step one,
before you begin a project where you’re going to do [INAUDIBLE] would
be getting your hands on this book, talking to Collin, maybe taking his class. There are another couple of books that
i recommend that are especially useful. But I’ll also note that Scott Long himself
describes Collin’s book as, quote unquote, the definitive text.>>[INAUDIBLE]
>>We’re concluding the kissing up portion.>>[LAUGH]
>>But no, so I’m not going talk much about the
technical specifications of the models. What I want to do is just talk about some
sort of practical considerations, walk through how a person using Stata software
might approach modeling count data. We then talk about how a project
I’m currently working on sort of exceeds what
Stata I can really do, talk about some ways to hopefully Work
around some of those limitations. Any questions forward? So when selecting your account model,
the things to consider, generally when deciding
which approach to use, generally have to do with the distribution
of your particular account variable. The first key issue is establishing
whether you have equid-ispersion or over-dispersion of your account variable. So the standard Poisson progression,
the base line model for doing this sort of work
assumes equi-dispersion. Which with the account variable means
that the variance is equal to the mean. Which, in practice,
is a pretty infrequent, rare occurence. Most of the time researchers are working
with a variable that’s over-dispersed where the variance is
greater than the mean. When you have over- dispersion and
you use a standard Poisson model, your standard error estimates are downwardly
biased which puts you in the position of being more likely to reject a null
hypothesis when you shouldn’t. Type one error, which is a position we don’t want to
be in generally as researchers. The negative binomial
regression model adjusts for this over-dispersion by
including additional error term. One of the ways to check whether you have over-dispersion is to estimate
the negative binomial regression and see whether it is error return is
significantly different from zero. It happens by default in this data. And then there’s this other, I mentioned a
couple slides ago, zero truncated models, and we’ll talk about them today. Those are useful when you have a situation
where your counts are only recorded when they happen one or more times. Basically you only start
recording events when they occur, you miss all the zero observations. Zero inflated model is used
in the opposite situation, where you have lots and
lots of zero accounts. For many of the observations,
the event doesn’t occur. Zero inflated models I
think are very interesting. Conceptually the way they
handle this is like, so assuming there are two classes of
observations in the population as group, for which there’s no chance that the event
will occur, it’s always zero group. And then there’s another latent group
of observations that’s not necessarily always zero. So they can have a zero count, but
there’s some nonzero probability that the event will occur one or
more times for that group. And what zero inflated models do, and
hurdle models, or if it’s class as well, is estimates the probability of
membership in both of those categories. Or membership in the always zero category
and then estimates the predicted number of conditional and
not being a zero type of observation. Combines those two probabilities
into a prediction. Paul Allison, among others,
answers this question with a probably not. Zero inflated models are appealing
in some ways because we can talk ourselves into thinking that if there any
sort of dual processes generating counts, the process that predicts having
some non zero probability and then the second process that generates
the actual count of the event. But arguments like Paul Allison and
others make, generally against zero inflated models focus on
a sort of overkill argument. Using a zero inflated model much
of the time is kind of like using a sledgehammer to
drive in a little nail. And at any rate, the Regular Negative
Binomial Aggression Model generally fits the deal just as well as the zero inflated
negative binomial, so why make things more computationally intensive and complex
[INAUDIBLE] things than we need to? So from this strictly data driven
argument, there may be reasons not to use zero inflated models, but given
that we’re social scientists we have to remind ourselves that theory
comes into play sometimes. We’re not just number crunchers,
we’re modeling some social process. There could be reasons why we would expect
there to be these two separate processes predicting zeroes and predicted counts. In this, this is a blog post from Paul
Allison’s webpage, he gives an example of if you’re modelling the number of
children a woman has by age 50. There are some women for whom the
biological or physiological causes make it impossible for them to have kids,
they’re always going to have zero kids. Then among women, you’ll have a non
zero probability of having children. They’re separate processes that
the determined number of cases. In that case, there’s theoretical
reason to model the account generating processes and
the zero inflated approach. Okay, I’m going to move on to
some hands on examples now. So imagine the first thing you
might look for is over-dispersion. These are the data that I’m going to. Where the outcome, this variable I’m
interested in is the number of charter schools that are opened withing a school
district over a 12 year period of time. The first thing that were interested
in is identifying whether there is over-dispersion in this variable. So among the 10,500 school districts,
the average number of charter schools opened 0.5, and
the variance is 11 times that. So for some model students, these numbers
are equal, they’re very clearly not equal. You can also see the distribution of
this variable’s strongly right skewed, and very peaky as well. For the snarky stats nerd.5
>>[LAUGH]>>The reason we have this right skew distribution with this strong peak down
at the bottom is because most school districts have no charter school. 84%, that’s why they
are sixth in the country. Then you have another 8% that has a single
charter school, and by the time you get beyond two or three charter schools,
very few districts represent it. So you have this strong right
skew count distribution. [COUGH] Okay, so what you might do is
estimate a bunch of different models. Your Poisson regression, your negative binomial regression, the
Hurdle and tick some of your estimates and compare them for another,
use like a ratio test to compare them. You could use those models
to generate predictions, and then you also have interesting things, comparing predictive counts across models
or comparing predicted to observed counts. And that can be a really good way
to do things if you have lots and lots of time you’re trying to kill, or
if you really like debugging your own code and trying to figure out what
went wrong over and over again. So that’s the first option or
what I wanna show you is this. Countfit command written by Scott Long and
Jeremy Freese that when using Stata makes this model
comparison much, much easier. Much more straightforward and let’s you
do some interesting comparisons, really you hone in on the type of estimator you
should be using with your account data. So countfit, much simpler unless it’s
totally not and it doesn’t work at all. Let me give you some
background really quick. So in 1994, William Green points
out the zero-inflated and non-zero inflated models are not nested. So you need nested models if you’re gonna
use a likelihood ratio based comparison of model fit depending on which one’s better. Says they’re not nested, so
you need to use a different test. And he proposes using something called the
Vuang test to compare zero inflated and non zero inflated versions
of the count model. Then Allison comes along and
he’s all, [SOUND].>>[LAUGH]
>>He says that model’s, he’s like William Greene claims the models are not
nested because there’s no parametric restriction on the zero inflated model
that reproduces the non finlated model. This is incorrect. A simple re-parameterization of
the zero inflated negative binomial model allows for such a restriction,
so likely the test isn’t appropriate. And then Long and Freese 2014 say,
chill out guys, it’s going to inflate, it really doesn’t matter. Even if you agree with Allison,
you can still use the Vuong test. Let’s just use the Vuong test. And leave [INAUDIBLE] right there.>>[LAUGH]
>>I’m only a little bit being facetious. So this 2012 I’m referring to is that same
blog post I showed you the screen shot of. And there’s a comment section. And that comment section
is like 60 comments long. And it’s these two guys
fighting with each other.>>[LAUGH]
>>Which is its own, you can do a sort of
nerdiness test on yourself. Give it a read back and
see how much you are entertained by it. I loved it, I thought it was great, it was
like watching a boxing match or something. It wasn’t anything like that it
was a lot nerdier and a lot more. Okay, so they’re disagreeing. Long and Freese say it doesn’t matter,
let’s just use the Vuong test. So up until, so as of recently, let’s say like Monday of this week,
you run the Calc fit command that I was just talking about and it includes
among other things, the Vuong test. That let’s you compare the zero-inflated
and non-zero-inflated models [INAUDIBLE]. I’m not sure exactly when this changed. It updates data when I open it. But imagine my shock and
dismay yesterday when I’m going through putting together slides for a talk where
I’m gonna show you how great Countfit is, run it to start producing some output,
and I get this error message. Vuong test is not appropriate for
testing zero inflation. Specify option- forcevuong
to perform the test anyway. But since this just happened they
haven’t updated the countfit command. There is no force Vuong option. There’s also no way to
cut out the Vuong test, which makes the whole countfit
thing blow up and go nowhere. So my I can’t do my talk anymore.>>[LAUGH]
>>Cuz I can’t show you this really interesting command.>>[LAUGH]
>>Okay, so I know you are all thinking right now, you
are thinking first it’s global warming,>>[LAUGH]>>Elected officials are behaving erratically, unpredictably, and
now the Vuong test doesn’t work.>>[LAUGH]
>>For comparing zero inflated and
non-zero inflated models. I feel your dismay. I’m in the same boat. What Does though is it’s nice as I’m
explaining their reasoning here. The main points of this
paper by Wilson in 2015 as demonstrating unequivocally that you
can’t use the Vuong test anymore. So basically,
the distribution of the test statistic for the Vuong test is not standard normal or
key values from our meeting list. The actual distribution is unknown,
which is interesting. It can’t be used for reference. But the take away here, you may consider
using information criteria to choose between the standard and
zero inflated models. Not something most of us are used
to doing, taking our AIC-BIC, compare those, fine, all is not lost. But the problem is,
the software hasn’t caught up yet. I had an old version of SData
installed on my computer. And so I could, I used that,
and went back in and rerun, the good old days is like Monday,
but it’s changes since then. And this is going to be fine again
someday, too, once they let us just submit the Vuong test, it’s a very small
part of what countfit does. Countfit is awesome, a big fan. What a person starts with is specifying
the model as a zero inflated model. When you write in SData the commands
to estimate a zero inflated model, the first part, which is the part that
protects the number of event occurrences. Then after the comma, the inflate option, that’s the other model that predicts
always being in the zero category. Here I admitted all of the flow variants. And I’m just showing you a portion of
the output, but the first thing that countfit gives you is the point
estimates for the slow coefficients and their standard errors across
each of the different models. And so in this example,
we can go through and see. The thing that jumps out at me first
is that the negative binomial and the zero negative binomial. Pretty different, with respect to
the coefficients they’re predicting for the effect of being in a, this is
an established Latino gateway school district, on the number of charters
that open, and that sort of. These are, I believe exponentiated,
coefficients, so odds ratios, in the count model world we
call it incidents just a [INAUDIBLE] IRR. So I wrote it with an interaction there. You can specify what you know,
the complex model, the thing you’re actually going to
end up interpreting down the road. One caution I would give you though is, so countfit, what it does is
actually goes and estimates each of these models because quietly, the state
of [INAUDIBLE] behind the scenes. If you have convergence problems with
any one of these you won’t see it. It will kind of look like it’s frozen or
whatever finish, but you can’t see what the problem is. So I recommend going through and
fitting each one of these individually. Just to make sure that all go
into the in this countfit. So this would go on for as many
covariances you have in your model and then down at the bottom, statistics. Going to be preferable. The next thing countfit spits,
so this is just one, I just wrote half of that in my model and
that follows comes out. For each of the models, it gives you
the value of the account variable for which the estimator is
most wrong basically. The difference between predicted and
observed counts. The Poisson model was most wrong for
predicting zeros. It under predicts them,
this is pretty common. The negative binomial and zero inflated
negative binomial, their biggest wrongness is substantially less than the biggest
wrongness for the Poisson model. You can see it gives you
the average wrongness over here. Other measures of which model is doing
the best job of fitting the data. Then it will give you this sort of table,
once for every estimator. It gives the actual
probability of certain count, the predicted probability of that count,
based on that model. The difference, whether those differences are significant
adds up the total wrongness over there. And so you can sort of read
across these four tables and you get a sense of how these
different estimators are doing. Then countfit gives you a graphical
presentation of the same sort of information. So here it’s the observe minus
predicted counts at each value of your account variable. What you wanna see is lots of is
basically zero, [INAUDIBLE] are the same. My goodness. Poisson is doing a terrible job,
[INAUDIBLE]. Last thing you got is this formal
comparison of different statistics, it’s nice. Over here you get a sort of narrative
summary of how strong the evidence is, for one over the other. Tells you which one is
preferred over which. This is that God forsaken Vuong test,
that we should not pay any attention to. But this table gives you a summary of,
which model is doing the best job. So, running this one command
does all of this work for you. It lets you onto one model. So the last piece that I’ll talk of today,
the title of the talk. I haven’t talked about
multilevel models so far. But each of these hypothetical questions
that I started out with can be thought as a multilevel sort of process whether it’s
repeated measures within observations over time or spacial or
bureaucratic cluttering of observations. It’s nice, it would be nice to be able
to use multilevel models, fixed random, fixed effects. Models with this sort of zero-inflated framework to model these count off There’s
no reason that this shouldn’t be possible. So each of these are just a sampling of
papers laying out how someone would use a multilevel framework, the random effect
framework or mixed effect framework with the zero-inflated account model
using in this case, SAS MIXED, LIMDEP, S-Plus but conspicuously
absent from all of this data. There is no pre-packaged or
user in set of commands that would let you do multi level modelling for
zero inflated account data. Here is my,
I wanted to make sure I get to this so that the smart people in the room
can tell me whether this is okay. So I’m not sure. Here is my proposal work around. What I’m interested in here, so I’m
measuring, I’m trying to model the number of charter schools that are formed in
school districts across the US over time. But there is reason to suspect there will
be some between state variations, for a number of reasons. Chiefly, policy differences
that some states don’t have any laws that would allow
you to open charter schools. The laws that have been passed in certain
states or passed at different times having sort of different period of rest for
charter schools that have been opened. What I wanna do is make
within state comparisons. I wanna use state fixed effects when
estimating this zero inflated negative binomial model. So what you’re doing
includes safe dummies in both portions of the zero inflated
negative binomial model and then use cluster of standard
errors at the state level. Okay? This is my best sort of a shot
at doing whats data doesn’t have mechanism for
letting you do, as it stands. The last little trick I’ll show you,
as I’m sure I’m pretty much out of time. Am I all the way out of time?>>No, you have two minutes.>>Two more minutes. So, the reason I have this version 14.2
up here, if you’re using version 15, the list collapse command doesn’t work,
it hasn’t sort of been updated. This is a little trick, we call up a previous versions of the data
to make commands work with each other. One of the downsides to using this first
fixed effects is I get, so there’s 52 states because we have DC and Puerto Rico
so I have that many dummy variables twice. Once in this first part of the model and
once and second parts are 104 coefficients
that are just sort of meaningless. I don’t want to look at them. This coeff lets you select the independent
variables that you want to see coefficients for and lets you tailor how
you want those coefficients presented. How I asked for them to be presented is,
as the raw coefficient, the test statistic, the p-value. And then these percents columns
are really, really handy for these sorts of models. You get the percent change in the expected
count for one unit increase in x, or one standard deviation increase in x. And the other thing I really like is next
to the one standard deviation increase in x column, it gives you what the standard
deviation of that variable is. It’s a small little thing, but
it’s very, very handy [INAUDIBLE]. Because my model has interactions, this actually isn’t a very
good way to interpret things. It’s a logistic progression model,
interactions and logistic progression are complex. The margins, the margins plot
commands work very well with zero inflated negative binomial,
zero inflated plus sign. Sorry I’m rushing and wearing myself out. [COUGH] What I have here is the, so the predicted count of charter
schools in school districts. As the change in the percent switch of
the residential population is Latino, over time, from the previous
decade increases separately for school districts that
are established Latino destinations. The non established destinations here,
the interpretation is when the Latino child population is increasing in relative
size like a new phenomenon when there didn’t used to be Latino
kids in the district. A bunch of charter schools opened. It’s about exactly the same thing. And the destination, [COUGH] sorry, a school district that has a longer
history of Latino child [INAUDIBLE]. This is the average marginal
effect of destination type, that’s the way of doing a significance
test with the two slopes. With the confidence interval,
doesn’t have to include zero. It’s an indication of a significant
difference so, we interpret it as signs like once the Latino child population
increases by more than about 20% it’s more than a previous decade to see this
significant difference between established and
non-established destinations. But we’ll stop there. Sorry for rushing. Thank you.>>[APPLAUSE]>>Any questions? All right.>>So, the question is [INAUDIBLE]. We have a data set that is not half data. It’s officially the percent of time that
students have sort of behaved in class on the zero to ten scale,with ten being
very behaved and zero being not at all.>>Mm-hm.
>>But if you were to look at the distribution
it looks a whole lot like a Poisson Distribution. And we have been going back and
forth about whether or not to use the Poisson Distribution
even though it’s not theoretically or to try to come up with something else and
I was wondering if either one of you had, given that this is the topic of your talk,
if you had an opinion about that?>>So if you treated the, so
it looks kind of like one of these?>>It looks like the first black line.>>Yeah, so Poisson distributions are
typically described as rates, you know, in terms of rates [INAUDIBLE]. I’ll have to wait, it could be. I don’t know. This is another one where
I shouldn’t be answering.>>So let’s say on that one,
your thought it was a continuous measure, between zero and a 100, you can re-scale it to zero and
one and do a logical protocol. [INAUDIBLE] proportion state. I think that would be,
I know you’ve got some wonkyness but still I think that’s what
I would consider doing. Because a larger model you could
apply not just to zero one data. It’s for something where the conditional
meaning lies Is between zero and one. Similarly, I’d say with account data. Acount data is not restricted to counts. It’s really for any model where
the conditional model is exponential, ex transposed figure and, so
that answers that question. Just a couple other things. With the fixed effects, by the time
you get to a quite parametric model you have the incidental
frame of this problem. So the fix effects are in the state. As long as you have a considerable number
of observations per state, you’ll be okay. But in your example, it may be that
you want to drop some small states, states who just have
a couple of observations. [INAUDIBLE] but
I think by the time you get up to certainly 30 observations in
the state you’ll be five, maybe 20. And then you gave this
introduction I would say another big reason for doing counts. Two reasons, doing as Poisson. I think in most applications
with account with exchanges, if someone gets a year older I
don’t think that means point three more visit to the Doctor,
I think it means. What, a 5% increase. I think all the effects are multiplicular. And in that case, you either want Y has
conditionally exponential exponents or you could go log Y is exponential
You go the log route. You’re taking the log of zero
at times which is problematic. But also, at the end of the day,
I want to produce the y, not log of y. All right. And then the other reason for doing it, is that how hard is it to go from
regress y comma x comma v c u, Yo poisson y x, we see [INAUDIBLE]. Once upon of time it was incredible. And then our long and freeze,
it’s a really good book, right? The book that Ryan and I have on counts
is more of a research monograph, so it’s doing much more [INAUDIBLE]. It’s just not gonna be ethisized,
the basics, as long as piece is very good. Also, the Department might
have the chapter on counts, it’s kind of the essentials, and
that would be a good place to start. This has been a lot of fun actually,
and it’s unusual for us to have so
much detail on a given set of data. I think, in future times, next,
you will have one on something else. [INAUDIBLE] or something. And finally with counts, the data often
requires you to [INAUDIBLE] Right? [INAUDIBLE] Is often you ask people that
actually kind to the doctor’s office and [INAUDIBLE] But
you don’t see the ones that never came. So you are forced, so the parametrics, and I think item I think maybe things like
a binomial actually work quite well.>>Great thank you.>>All right, so next we have Maggie Molsh
from the School of Education.>>All right, so, I am Meghan Welsh, and I do actually know what you
are going to do [INAUDIBLE]. I work in the school of education and
I am a [INAUDIBLE]. I care a lot about measures, and how we
measure what people know, think, believe. So one sort of shift in thinking
that I think is necessary in the world of cyber methods,
is that I am no longer interested in thinking about the effect
of something on people, where I am sampling in people,
or district, or type of organization from
a population of people. Instead, the work in psychometrics
starts with the idea that there is An infinite universe of items,
that we sample from create a task. And so, what I am interested in,
is statistics around items. Just as when we’re measuring people,
we have measures both on safe tests and on people. When we’re measuring, when we’re
evaluating the property of test items, we do the same thing. We have some data on people and on items. But, it takes a little bit of a shift in
thinking to think about the fact that, what I care about when you’re
dealing in measurement, is the emphasis we can make about items,
not about people, because those sorts of
variables are in my analysis. So on that note,
most of what I’m gonna talk about today, sorta falls within the realm of
something called test validity If you’ve ever taken a psychometrics
class, and it’s at all been a while ago, you should know that, sort of the definition of validity
has changed relatively recently. And what we’re really concerned about is,
the degree to which evidence, and theory support both the interpretation of test
scores, and the proposed uses of them. So we’re not only concerned with,
when you look at the score, dedicate how well does
a kid understand math? But we’re also interested in how that
information about how someone understands math is being used. Can you use that information
to make a change in policy? It would be appropriate for that use. Would it be appropriate to make
an inference about how good a job a teacher is doing? Those are the types of things that we
need to sort of additionally investigate, whereas even so recently we say, less than ten years ago, we were
only concerned with interpretations. I, today gonna talk about something
called instructional sensitivity, which is one aspect of validity. And what this has to do with,
it does have some policy relevance, is that the grade in which
students performance on the test, reflect the quality of instruction
that students are receiving. So you may be familiar with
lots of educational reforms, that have come out where we tend to quite
explicitly make inferences about schools, or about teachers,
based on student level test scores. So during No Child Left Behind,
there was a lot of very sort of rigorous evaluations of schools,
based on test performance. More recently, in states outside of
California, California lives in some ways in this wonderful bubble, teachers
are now being evaluated in this way. So there are many states in which there is
a state level teacher evaluation system, where student performance on test
aggregated out to the teacher level, is used to design whether to label
a teacher as someone who’s successful, or someone who is failing, and therefore
need sort of intensive intervention after professional development,
after supervision from their principal. And so, in particular, what I’m
concerned about, is the way that we think about tests capturing the effect
of-it can be teaches or schools, but I’m gonna focus on teachers today. I also wanna point out that in many,
many areas of research, we use test scores as a factor,
as a proxy for the effectiveness of some intervention, that were
having teachers implement, right? So it’s easy to adjust the policy concern. This is an education of a such concern,
from the perspective value, often do thing to teachers, and then see
if that thing we did to teachers, or we trained teachers to do differently,
changes how kids do. And in reality, the way that
we tend to evaluate tests for, in terms of test validity,
never takes into account whether, or not, they can be used to make inferences
about instruction at all. And in fact there’s a far amount of
work by people like Bruno Zumbo at the University of British Columbia,
who has looked at things like the factor structure of measures, when you measure
them at the individual level, and then the factor structure when you
measure them at say the group level. And with the same data set,
you can have entirely different factors at the student level, and now when
you aggregate that to the teacher, it looks like you’re measuring
different constructs all together. And this talk today is gonna focus
specifically on instructional sensitivity of items. There’s also been a lot of work on
instructional sensitivity of test. And in the scholarly side of things,
not in the test developer side of things. And I’m particularly interested
in item level evaluation, because it gives us information about
which test items we have to improve in the test development process. So as I said, test and item sensitivity, are not currently evaluated by highs
takes testing programs at all. My adviser, and I wrote a paper evaluating
the test sensitivity of a test in Arizona, and I think it was 2008, and
that’s the last published piece that I’m aware of that looks, at this at
examining of a particular instrument. When we think about item sensitivity,
there is no one established method. And the reason for this is that,
when you think about how large scaled tests are structured,
there’s usually about 50 items per test. There’s thousands of teachers per state,
hundreds per district, and 25 students per classroom. So we have a problem where often don’t
have, for the number of items we have, then when we get up to the student level,
things don’t estimate correctly because we don’t have enough students to
then make good teacher estimates. So historically, sensitivity has been an evaluating just
with a quasi experimental approaches. One approach is that you would give
a test, you hired a group of kids, you have a teacher provide instruction
on the content of the test and then you give the test again. And there’s a bunch of different
statistics you can populate to think about the differences in the pre-test and
post-test scores. If the items that go up the most are the
most sensitive to instruction, on average, and the ones that don’t change must
not be sensitive to instruction. So if it gets much easier to
answer the item, it’s a good item. And if it’s not, there’s no change,
it’s a bad item. There’s also some work in this
bill called opportunity to learn, where we actually go out and measure
what teachers are doing in classrooms. Are they teaching the skills
that are on the test? Are they doing a good job
of teaching those skills? How much are they emphasizing
what’s on the test? Also putting in some other things
like overall achievement and student demographics to think about how well these
two things predict item performance. If things like teaching this
skill that the item is measuring. Improves the probability the answer
of getting the item correct, then it is a sensitive item and if OTL doesn’t seem to predict
item performance, then it’s not. There’s a couple of
problems with this method. One is that it’s burdensome because you
have to do additional data collection, beyond collecting operational test data. The operational test data is just
the stuff you give to kids in the testing moment. Let’s go back. If we use something like this in
particular and the instructional effect. If you think about a state test ,it
normally gets that measurement and it gets that number sense and it gets
that geometry and it gets that algebra. So you end up having to have a lot
of different lessons here to measure the effect of instruction on the test And
in addition to that, figuring out what’s actually going on in a
classroom is really, really, really hard. Because there are about
180 days of instruction, and things change a lot day to day. I bet anyone here who’s actually taught
even in a university classroom, and sort of acknowledge that any given
lecture may be their best lecture or maybe not quite their best lecture. So I’ve been playing around with and I really do want people to
write other ideas here, statistical models that we might use
to figure out, to item sensitivity. And what I’ve been doing is borrowing
a lot from the teacher effects literature and trying to think about teachers
effects on either performance. I wanna acknowledge that there’s this
whole literature out there on teacher effects or themes like multi
level measurement models. The people who write these papers aren’t
even aware that the instructional sensitivity exist. And then people who are aware that
the structural sensitivity exists, for the most part, are not methodologists. So there has been very little sort
of meeting of the minds here. And so here’s what a multi-level
measurement model looks like, what we’re talking about is predicting the
probability that a student answers an item correctly. With the item level performance on
all the items on the test except for that one item that you are predicting. And then we have estimates. We are taking the clustering
of the items with student into count and
in In an IRT model, which is again, predicted on the probability of
answering an item correctly, we would have a random effect just for
the intercept at the student level. And so, what we can do then is estimate
the, you have a probability of answering the item correctly that’s based on
the student level random effect. Which, Jason just referred to as, or
Jacob, just referred to as error or noise. But in our thinking,
if it’s at the student level, then it’s something about the student,
difference from the typical student is a characteristic of how
much the student knows. And then we take off the item difficulty,
And it’s a logistic model. Questions about this model? We can add a third level to
get to that teacher effect. And again I bet when we are just treating
that as a measurement model, not trying to get at the teacher effect on each item we
just have an error for the teacher here. Which is sort of the teacher,
the difference of a hidden teacher from the average teacher writes
that stat residual. Now the item in what you know and
both the characteristics of the student, whether they are above the residual or
not. The characteristics of the teacher, whether they are above
the average teacher or not. Minus the characteristic of the average
item and the particular item So this is just,
if you’ve heard of item response theory, this is just item response theory
written as a multilevel model. If I want to estimate
the actual teacher effect, now I have to add in a whole
much more error terms, right? So each one of these If the attitude and
effect on any given item, right? And then this is the difference
of any other particular student from the average student, right? So this is the student level residual for item two and this is one for item three
and this is the one for item four. So I can see how much variability there is
in student performance, around each item. Then I have the same thing for
the classrooms. How much variability is there of teachers
away from the teacher level average performance in terms of the probability
of answering specific items correctly? And again, I’ve moved from the world
where this is noise to where this is some sort of meaningful effect of a teacher. Given what I said about
the structure of my data, can anyone predict what
the problem is with this model?>>So you get a shoe of all these errors,
are normally distributed? Are they normals, is that?>>Well I would have to,
except think about if I have 55 items instead of bk- 1j,
it’s literally beta naught to beta 49. And then at the student level I only have 25 students to estimate the variability
around each item average item performance. And at the teacher level I’ve got
at least 100’s of teachers here. So here’s what happens when
I actually try to run it. It just wont even converge. Right, cause what I’m trying to estimate,
the number of parameters I’m trying to estimate is greater than the number of
observations left, data points I have. And in fact, when people sort of study
multi level IRT, through simulation studies using a [INAUDIBLE] They’re almost
always focusing on about five items. And I’m pretty sure that more than five
items is where things just won’t converge. So the problem we really have is that
the way that we’re approaching multilevel modeling from a basic
research perspective, doesn’t actually apply to sort of
operational testing environments where we’re dealing with
50 items per grade level. Yes?
>>How many students would you need for the [INAUDIBLE] converge? I’m thinking about like schools where
a teacher may have several different classrooms of the same subject, so that in
fact they have more than 25 students but even then it might not be enough. So I’m just curious.>>So in the high school level,
this might work. In the high stakes testing world, there’s usually sort of one grade
level in high school that has a test. So that’s a really good point and I haven’t played around
with the high school level. There has been some work around a
different kind of modeling that I’m going to present, that has found that things
sort of hold up pretty well once you hit around 50 students per classroom. So I have two workarounds I’ve been
playing with in an applied way. One is that I just estimate
the two level model or I estimate the variability around the
student and I sort of drop the teachers, I’m sorry on the classroom, and I drop
the student level out all together, right? And if I brought the kids out,
I can get classroom level residuals. And if I do that, you might have
heard something called [INAUDIBLE] in the education world, it was a buzzword for a while because it’s the way that test
scores are used to evaluate teachers. And what you do essentially,
is fit a regression line for each student predicting their test
score based on prior performance and lots of student demographic
characteristics. And then you look at
the magnitude of the residual, the difference between the predicted and
the observed score for that student. And you aggregate all those
residuals all within a classroom. And if on average kids are doing
better than projected, then the teacher must be successful. And if the kids are doing worse,
then it must be the teacher’s fault. So I sort of borrowed that idea and
instead though, what I’ve done is, I’ve sort of
aggregated up the student level, sorry the classroom level
residuals around each item. So I’ve done the student level residuals
around each item and I’ve aggregated them to the classroom level but
instead of looking at the mean residual, I’m looking at the variability between
the fast groups and the mean residuals. But this is like, I don’t know,
it’s like a back of an envelope, not necessarily statistically
correct approach. And then the other approach I’ve used,
is something called item difficulty variation, which is
developed by [INAUDIBLE]. And at this point, they also just sort
of get rid of the student [INAUDIBLE]. Instead of looking just at the student
level, they get rid of the students and just look at the classroom level. And so this is a very similar sort of
situation where they thought the average classrooms test performance and then they
think about, then they take off what would be sort of the average difficulty
for an item, the difficulty of an item and they adjust it for sort of the mean
test performance across the states. So is this item more difficult then
typical items or less difficult? And then they add in a random difficulty
item, and that’s associated with sort of how much more difficult is this item in
this classroom after we’ve accounted for average performance in the classroom and
general difficulty of the item. This will converge and run, so I can, but you’ll notice that as opposed to this
model where I’ve incorporated all of the items into one model,
this has to be run item by item. So when I do that,
what’s the big problem if I have 50 items? This is like a very basic
statistics question and I have several graduate students in
the room who I’m not afraid to call out.>>[LAUGH]
>>[INAUDIBLE]>>Yeah, I have like a type one error issue, right? Where I’m sort of testing things over and
over and over again. And when you get to 50 items,
if you wanna even make some sort of post hoc adjustment, you’re talking
about very, very, very small numbers. You have to have a very small probability
of having the result by chance in order to have it work. It also ignores the student
level information altogether, which seems problematic as well.>>So these aren’t, are they likely to
be independent each of those tests or will there be positive correlations?>>The tests?>>When you did the 50 items.>>So the items themselves,
in most testing situations, are assumed to be independent. That’s quite common in the testing world,
especially when you get to situations like if you’ve ever taken a language arts test,
usually you read a passage, right, and then you answer
several items from the passage. And so we know that they’re
not really independent because they’re all linked to the same passage.>>Well I think it’s not so
much the item itself but the statistic that you’re using, that
you’re getting at the end [INAUDIBLE].>>The probability of
answering it correctly?>>No I think, you said this problem
with doing, on the next slide.>>Ignore student level information, yeah.>>Right.>>Yeah.>>I thought you said that doing this
test separately, 50 times, this->>Yeah, as I do the items 50 times. And so you are right that you can do this
either where you get the overall item, we take the overall test
performance into account, right? And so there are models that sort
of take overall test performance, holding out the item
that you’re evaluating. So it’s overall test performance on say,
49 of the 50 items one at a time, to adjust for that fact that the item
would be part of the ability measure. Is that the question?>>I can’t talk [INAUDIBLE]
>>Okay, there’s a final general approach I wanna talk about, which is called
multi-root confirmatory factor analysis. And what we do there, is we have a factor
analysis where we have a factor loading. We have item indicators that are sort
of loading, that are predicting math achievement, we have sort of an average
performance of the item difficulty, and we’ve got the relationship between
the item and the blade construct, that’s the lambda there. And we can create these sort of factor
models for, often you will see these when we’re validating tests where it’s just for
all students, all together. But we can run them for
sort of, at the teacher levels. So we do it for each classroom or for each
school, and then we compare whether or not both the difficulties and the loadings
are equivalent across the classrooms. And if there are any differences in
the relationship between the item and say overall math achievement,
then we take that to mean that there’s something different about what
math achievement looks in this classroom, that’s related to the item. So if this is a stronger relationship and
this is algebra, and over here there’s a stronger relationship
here in it’s number sense, then we’re going to assume that this teacher
teaches number sense really well and this teacher teaches algebra really well. So then we could make some inferences
about differences in instruction or teacher effects that are being detected. But in reality, the problem we have,
is that when you think about multi-group measurements it varies, but
these kinds of studies, again, we’re normally talking about small
numbers of groups, often just two. But if we have even, say, on one data set
I was working with, I had 40 teachers. Now I have 40 different CFA
models to compare, and that, again, It will actually converge,
I will give it that, but it won’t generate fit statistics,
which seems like a pretty big problem. And I had a student who worked on that, cuz she was really convinced that
the fit statistics were not a problem. So we ran some simulation studies,
and found that our parameter estimates were really, really
biased, that we couldn’t trust them. Although, if you move up to clusters
of 100 students in a classroom, Aspera, Hook, and
Moutin suggest that it does work. Okay, so that’s sort of, it is just different ways I’ve tried
to run multilevel measurement studies. And the problems with trying to generate,
to try and think about the effectiveness of
a test to measure instruction, given the way that schools
are structured today. Questions or thoughts, things I
might try that I haven’t tried yet?>>[APPLAUSE]>>[INAUDIBLE].>>What, converge, it-
>>[INAUDIBLE] too much time, or it’s not adding time, [INAUDIBLE]?>>It basically gets looped up and
never generates estimates. You can leave your computer on for
five days, and you can adjust the number of iterations upwards to
the point that it seems ridiculous. And it won’t ever actually hit the,
it will never come to a solution.>>[INAUDIBLE] is not justified, maybe if you change other words and
stuff somewhere.>>Yeah, so I’ve actually talked to
people when I was trying to make this particular stuff work, here. I actually spent some
time with Linda Mutane. I don’t know how many of you
are familiar with Amp Plus, but it’s a very big, sort of very
popular psychometrics package. It also does a lot of interesting
modeling, really, in general, and they just kept saying. So you’re trying to estimate 4,000
parameters with 12,000 participants, that’s just a silly idea to begin with. And in particular, this approach, we used
Bayesian and [INAUDIBLE] because Bayesian modelling, it is better at dealing with
small numbers of a small sample size. But you can’t fix everything
with Bayesian statistics. If you’re trying to make too many
inferences from too little data, there’s very little I think you can
actually do about it, at least, that’s the conclusion I’m coming to.>>Is it correct to say that none
of these methods will let you recover the [INAUDIBLE] between
student variants estimates, that the number of students is
too small in the classroom?>>So what I can do is, if I take the classroom out I can get
between student variance estimates. So one of my workarounds was to
sort of get rid of level three and just look at the between
student variance estimates. And then to take those estimates and
aggregate them up to the classroom. And by hand, get a proxy between classroom
variance estimate, by just tapping the variance between the aggregate
classroom, these aggregated residuals. So that was my back of
the envelope approach, but I’m sure there’s all kinds of
reasons why that’s a bad idea.>>[INAUDIBLE]
>>But I don’t know, I was waiting for someone here to be like,
why on Earth would you ever do that? But that’s the closest
thing I’ve come up with.>>Okay, so this is one of the big
divides between economics and the rest of the world. We do not do multi level model,
except for random effects model. And that’s why I’m not [INAUDIBLE] to,
probably couldn’t anyway if I knew it, but I’m not the person to ask. But I’m curious to know, first of all,
who here actually is part of the research, either now or going to be soon,
to be doing multi level modeling? All right, so
we’re in two different worlds here, okay. And the second question is, it already
came up once to [INAUDIBLE] economics, but it wasn’t well advertised. Chuck Huber from Stata will give a talk
on structural equation modeling. Who would be interested in
having him come here to do that?>>You know that psychology offers
amazing [INAUDIBLE] classes too, are you familiar with that?>>Yeah, but the thing is,
a class is a class.>>That is true.>>[LAUGH]
>>The benefits of this is much more efficient, and
I can read books and so on. But there’s many things I need to know, the people around me don’t do
structural equation modeling. I’ve actually worked on it,
cuz it’s just a question of time, as to how much we put in the new
edition of [INAUDIBLE] Stata. Just to point out to people who don’t use
it what they could be missing out on. But I’m just saying,
it’s two different worlds, yeah.>>So
this isn’t really a stats question, but would you be able to get anything
from grouping items together?>>So there are these things called
differential bundle functioning and differential test functioning models,
that do allow for clustering of items. And that is actually the next
place we’re starting to move. I’m still trying to really
understand sort of theoretically that the items are clustered
the way that Paul was describing. And they’re considered to be independent,
but I’m not deep enough into the meaningful differences in the models
yet to feel comfortable running them.>>Okay, thank you.>>[APPLAUSE]
>>I apologize. I closed it and it went to sleep.>>Okay, so last speaker is
Eduardo Estrada from psychology. And I’ll give you a five minute warning,
25 minutes. Good afternoon, and
I am a Rhodes Scholar in the second. I will be and I’m going to share some of
my ideas that we have been working on, and I would really like to
hear your comments about it. So feel free to tell me and
say [INAUDIBLE]. For this product, we are collaborating with some
colleagues in the school of medicine. This is, we’re talking about [INAUDIBLE]
method we want to propose, but in this context, we apply it to reading abilities,
the development of reading abilities. So in pretty much all
fields identifying change in the individual level is very useful,
but not in psychology. For example, if you are a physician, and you are planning intervention,
a behavioral intervention, you might want to know which
particular persons are not changing or not changing enough, so they need some
pharmacological treatment, for example. In the context of inflation, for
the usual question, you often have them. Okay, tell me, which children are not
catching up with that in class? Which ones? Which specific person [INAUDIBLE]. And the traditional methods for this
[INAUDIBLE] statistics focus on samples, and the whole sample [INAUDIBLE]. So, our purpose is to propose a method
that can predict individual change and identify the scores that show that
person is not changing enough or is changing too much. [INAUDIBLE] And one of the things I’m right by is I try to propose things
that can be actually used by people. And when I say people,
I don’t mean the people in this room. I mean the people in schools, which often don’t know as much
as we do about statistics. So I try to keep things simple. To [INAUDIBLE] that might
have some [INAUDIBLE] too. You can [INAUDIBLE] something. So what are we doing? Linear regression. How many of you have
used linear regression?>>[INAUDIBLE]
>>[LAUGH]>>Some people have [INAUDIBLE] not aware. How many of you has completed
the confidence interval for [INAUDIBLE] linear regression? So this should be a half, here, here. So in a linear regression, lets assume
this is a [INAUDIBLE] score, [INAUDIBLE]. This is time one score and
this is time two score, this is score. So we are trying to predict
[INAUDIBLE] n2 with n1 and for this point estimation,
we can compute our confidence interval. And the way people usually do this is,
okay, for the people [INAUDIBLE] What is
going to be the meaning then of [INAUDIBLE] in the conference invalid for
the mean [INAUDIBLE]. But you can also compute
the confidence interval for each individual prediction, okay? Am I making sense? The other thing that this
changed [INAUDIBLE] and the way of computing the [INAUDIBLE]
is distant too it must be [INAUDIBLE] there is a different computation,
different format. So, [INAUDIBLE] with [INAUDIBLE] it could be the [INAUDIBLE] for a different level of [INAUDIBLE]. The confused thing for
the mean has this shape, and if we compare that for
the individuals, it’s much wider. It’s special corp, but with an. But how this can be used
to interpret change? Well, we can look people [INAUDIBLE]. This person change acceptable [INAUDIBLE]. This person who change more
than just [INAUDIBLE] for the root but
still is we think that [INAUDIBLE]. This person influencing
[INAUDIBLE] Stops at three. And this person crossing Italy,
it’s big words and expected. Even if he or
she [INAUDIBLE] this is higher than this. [INAUDIBLE] One of this is that,
I have to do that, and judging by the videographs, and
judging by the [INAUDIBLE], we are testing the whole sample, okay. As compared to single case exam,
for example. So we use, we basically use an [INAUDIBLE]
as a measure of [INAUDIBLE]. And we can set the confidence level to show a typical change. If [INAUDIBLE] confidence for the
individual, we predict the new hypothesis, which in this case states that these
cases are of this propagation. So these two are not from this
propagation, [INAUDIBLE]. Okay so far, does it make sense? So any score about the upper
limits means that the person was actually given his
the average group change, and the expected change for the people with
that score, anyone, and below the. Even in the context of an inquest, this person will even know it
is enough to ignore [INAUDIBLE]. So this method [INAUDIBLE] in
the context of neuropsychology and [INAUDIBLE] cases with two
[INAUDIBLE] in [INAUDIBLE] showing that this works very well. [INAUDIBLE] type one errors. So when you said confidence level,
you have the right of type one error. So while what we are doing here
is to go one step beyond and extend the [INAUDIBLE] Not two ten points,
but several ten points. [INAUDIBLE] This is not
an actual autoregressive model because we are not saying that [INAUDIBLE]
we are not what happened before. We ask [INAUDIBLE] Here will
give this information, okay?>>[INAUDIBLE]>>This is so for computing the confidence-
>>I think it’s easy if you go back to the picture, stop.>>[INAUDIBLE]
>>Yes. So the first confidence band is based on the fact that we got a beta hat and
not a beta. Just controlling for
estimation noise across everything. Now for that one,
you can control [INAUDIBLE] You can get [INAUDIBLE] correctly. You get the second one, you’re having to
make assumptions about the [INAUDIBLE] and the topics of the [INAUDIBLE]. My guess is that it’s assumed that
the errors are [INAUDIBLE], all right? You’re adding in the same amount. But actually,
you could argue that out in the tails, you should be Having a bigger one.>>So if this is going to
there is no displacement here.>>But I think it’s coming across to
beat a hat rather than your assumption. Well, if you come back to your equation, you are having,
to one with the [INAUDIBLE] term start. All you [CROSSTALK]
>>I have [INAUDIBLE] at the end of the presentation [INAUDIBLE].>>But you are going to
have to say something about what that [INAUDIBLE] for individual life. Yes, the one, that one is saying everyone has the surname, individual there seems squared.>>[INAUDIBLE]. That’s not caused the [INAUDIBLE]
speaks.>>Actually, what I wouldn’t show that,
no it doesn’t [INAUDIBLE]>>Yeah, it’s a [INAUDIBLE].>>Yeah.
>>That’s something [INAUDIBLE]. [INAUDIBLE] individual change. It doesn’t go to, it gets to a point where
you cannot make it thinner by more time or more [INAUDIBLE]
>>It’s a varience [INAUDIBLE] that always has been intriguing [INAUDIBLE]
>>Okay, so the first thing we need with this idea, is
applying some interesting empirical data. We are using data from
the Connecticut longitudinal study, which is an ongoing study improving our
co- a very large group of children for which measures of learning attention and
reading disabilities, not abilities, are taken. They’re using this sample size of this
study and these kids were test at first grade and then annually assessed all
they way up to grade nine in the data. And a way to enter the graphics,
but basically, the main idea here is that they have and
to US publishing by the time is that. So the measures I’m using here they
are about I mean these are rates 1, 3, 5, 7, 9, particularly the west lands game. That’s a measure of their non-verbal
ability [INAUDIBLE] the [INAUDIBLE] from the [INAUDIBLE] that’s
probably more narrow abilities.>>[INAUDIBLE]
>>This is the way data [INAUDIBLE]
45 sample. As you can see, it is very common
in the context of location, eight numbers course,
if you have IQ curves, the box curves. So in this curve, the higher is the curve,
the more ability. Here, the mean of each rate is 100, so you can see the radiation from
each person to to their mean. And because these first addressed to
encryption to educational leadership, we are using this [INAUDIBLE] but
the same method can use here.>>Raw scores, that’s the number
correct or the percent correct? The raw score, is the number
of items answered correctly or the percent of items answered correctly?>>In these three cases it is.>>So technically they’re ordinal,
not interval, because the item difficulty, differences difficulty are not
being taken into account.>>Right.
>>So the IQ method is a better one for parametric statistics
from that perspective.>>Seems fair, I didn’t notice that,
it may not be that, but that’s not generalization to use.>>[LAUGH]
>>But my point here is that this technique can be applied to both types of
data and I was focusing here on the shape. So this is [INAUDIBLE] and
this is [INAUDIBLE], or that’s another specter to increase,
that means it’s going to be 100 always and the method to be used to both
as even to decrease it’s course. So we are going to use this
[INAUDIBLE] method to predict this data to,
we’re getting data to this here and see how it works, we’ll see what happens. The first thing you need to know is you
need to find out how many unit measures are useful. We have four unit measures [INAUDIBLE] so
basic approach to these, yes, you just need the expect
version of regression and use as far as measure of well [INAUDIBLE]. If the further measures have
improved the partitions so much and that will wake up the less of more
we [INAUDIBLE] based on that model. So for basically, where we have
one previous array of unit five. When we add this,
we first include array three with array one and the is significant. For array seven, we can use arrays
five and one, but for array 9. After we enter with same as [INAUDIBLE]
three we want to know that very useful. So we know the reason with that
we see that with these data and this data did that. After three measures are enough even with
that we need to move on for the exam. So where we to estimate the Cii for
this, this, this and this computer. And this is one of the ways
we can plot it, okay. These are the predictive and
observance code for break free. And this is very similar to
the first problem I showed, because this is not a in day one,
but it’s are linear transformation. So this would be the least case,
showed the exact same thing. So it has the value we
have 60 cases [INAUDIBLE], cases below the [INAUDIBLE] in this case. In the remaining three plots, this is actually a linear combination
of the previous occasions, right? And mirror conditions are real
nice errors of the cases, and we can see that the type one errors
are close to 5% in all cases. Another way of plotting this or
carrying this, and this is actually probably more
informative for applied practitioners to need to of one person,
because we want to [INAUDIBLE]. So let’s take case 5570 [INAUDIBLE]. And I have list here is the SR95 for
this group, for the whole group and
they [INAUDIBLE] and the medium, so these three [INAUDIBLE]
group indicators and the first resource [INAUDIBLE]
with this information, and the information for the whole group
[INAUDIBLE] for this person [INAUDIBLE]. So, basic comes and we see that this
special [INAUDIBLE] was the [INAUDIBLE]. But with this data with this kind of
data [INAUDIBLE] Which a break free near around the north, which is very long
top piece low line and this under line. So we the next person and we verify
discussion in much better perspective so great center correcting up close or
incorrect information. So in the next step, it is correct again. And then red line,
the group prices are listed. That’s the still the model,
that you validate our [INAUDIBLE]. [INAUDIBLE],
any questions so far? So one of the things that left out
of this, so I will lecture this, let you see that this is useful. Because this could be used in a school. And we could have like
balance of these for different kids,
can have three different kids. Three different kids in. So we can see that these two
kids started very similarly. But the evolution, the evolution,
is very different for each of them. We can see that this person is very good. She has a very high level. But nothing [INAUDIBLE] some these
have to [INAUDIBLE] performance. So these other things go in here and
[INAUDIBLE] can I do about [INAUDIBLE] in here [INAUDIBLE]
this [INAUDIBLE] again [INAUDIBLE] positions [INAUDIBLE] receive
an intervention for [INAUDIBLE]. To sum up, in the [INAUDIBLE]
based on linear regression, allows to first simulate the expected
change for sample at each equation. Second, compute the expected confused
they know those capacities for the [INAUDIBLE] recognition
of third values. Third, determine which individuals
show a typical change, higher or lower than expected. Fourth, study trajectory
in different variables for this individual, and
[INAUDIBLE] estimated [INAUDIBLE] for different universe that were even without
being in the original sample, right? And this is one of the few of them
that is interesting and unusual. So I will move this up. So here and we are completing this with information
that we already have that we put. Once we have the knowledge and
assuming that we have new cases from the same operation we could
actually estimate the whole [INAUDIBLE] have this [INAUDIBLE] and
then update it at each spot. And so we can actually focus
[INAUDIBLE] in this case. If this person [INAUDIBLE] application. [INAUDIBLE] for each new of
the oscillation there aren’t changed for the rule the person we used his
car there’s active change of for those particular scores underneath the
value factors of association and in this case one of the findings for this reading
that [INAUDIBLE] indications are enough. And one of the main
solution of this part is that we have a [INAUDIBLE]
of this model [INAUDIBLE]. Not to be a good idea. [INAUDIBLE] But
I think that this is interesting. And [INAUDIBLE] nice thing [INAUDIBLE] this distinguishes from [INAUDIBLE] first. The different estimation procedures. [INAUDIBLE] this is [INAUDIBLE]. Time?>>I forgot to reset my clock,
we’ll give you, say another three minutes.>>So another thing we could do
is change the confidence level. So if we want to gain sensitivity,
well then we need to work with the 95%. With liquid and
set on the rejection [INAUDIBLE] in the lower bound of the plot,
until they get 100%, but are not increasing at the right. Now, some things that this model does,
this does not do is first characterize. Which is this is not a model for change. We are not describing the change. We are just trying to
predict this kind of. And it doesn’t explain why people
doesn’t change as they should. You can add covariance to the prediction, to the model, but
the repetition of this is [INAUDIBLE] because if you
have the right covariance, that have [INAUDIBLE] to the node. So you’re going to [INAUDIBLE],
does that make sense? So, the [INAUDIBLE] covariance. [INAUDIBLE] In division. What he does is even define throughout
checking the real cases and changing more on this and
changes mode step and leat step. I mean the one it seems that you
can be the one you do not rate us. With other of electron check so
I said this before. If we have something that will serve for
and we once we have the model, we can distinguish, we can focus for
these new cases and see where got it. On one of the applications of these and most of the fall out will be to detect
the cases with dyslexia for example and will decide possible interventions for
this case based on this script. Now some them western
like to is incorporate this method when there is a change. How can we do this estimate? For example, in situations, when you have structure
paths that need to. It should be easy to contribute
if [INAUDIBLE] I don’t think [INAUDIBLE] is what [INAUDIBLE]. And another thing that we have
to do is to [INAUDIBLE] with [INAUDIBLE] an [INAUDIBLE] is
the model that try to find for us to find the different sub
correlation within the correlation. This is the same thing this
model is doing conceptually. So if this will tell us the same thing or
not. Okay, That was everything I wanted to.>>[APPLAUSE]
>>Any questions.>>So I have a quick one. Can you get back to your equation? The one with the epsilon number. It has an error return in it.>>One second. Okay, suppose we could
perfectly estimate this. Then the only thing to
worry about is this. So we’re asking,
when this gets big enough, there’s a flag. Something’s gone wrong. Okay, and
during the confidence interval approach, we’re basically saying if
it’s 1.96 times the estimate of the estimated standard deviation,
that’s bigger, okay? Now what if you have one person who
year after year is just really steady on the test?. They have small variability of epsilon. And then there’s someone else who’s
just all over the place, all right? And one year they are way
up while you’re going down. Well that person you’re going
to misidentify as needing attention when they don’t need attention,
right? So I think the heteroscedasticity
here may matter. Now given your results, it suggests to me
that you were working in a setting where there wasn’t much heteroscedasticity. There wasn’t much variability in the area. But with your data you could check that. And you could run simulations with
wildly heterscedastic data and just see then that you’re not
getting it adding up to 5%, right? That’s what I wanted to say. I’ve concluded this is very, very useful. And I was persuaded by your
particular data example. Just a more matter of robustness across,
right? And this sort of data,
it may be that’s the way it always is. Okay, so yeah.>>I just would like to suggest
another place you might want to go. So we know that the growth of children,
especially in reading, is not linear, especially based
on where they start, right? So there are people who found some work
in quantile regression, where they sort of calculated those regression
curves completely separately for the different levels of
initial performance. And it would be interesting to just
sort of see how this might be applied. I mean, I know this sort of one kind of
slice of it, but I think it works better than OLS for things where you’ve got
growth that you know is going to look very different at different parts of the
distribution in terms of where you start. Did that make sense?>>Yeah it makes sense, it really does. We might adopt a new plan.>>Okay,
we’ll make this the last question.>>Yeah, I have a comment. I think it’s kind of related
to what Collin said, which is about the effect of
measurement error on the method. Because if you don’t take into
account measurement error, it’s basically confounded with
a prediction error, right? So, if you get a huge epsilon, it could
be a prediction error, but it could also be that you’re just not measuring it-
>>It’s a measure of both.>>Right, yeah, so
one direction that you may want to explore is how measuring error
can have an affect on this.>>And how would you,
what is your intuition? How do we attack those? Because I don’t know
what’s related variables. And but I really don’t know how to
use this with little variables. Because you are using the whole matrix
stuff, score of the minus tables. So you will need, I think, the outer
scores, and that opens Pandora box.>>Yes.>>I understand that well enough. That’s very interesting. I definitely want to go over it, too.>>Okay, we’re done. Well, thank you very much. Thank you.>>[APPLAUSE]