### Statistical Methodology in the Social Sciences 2017: Session 3

There we go.>>So we have three more talks, and some people asked about this. This is the most attended this conference has been, this is the third,

and it will continue. So next year around this time, it’ll be on a Friday within a week or

two of this time. And I will say that I am

always looking for speakers, and ideally,

we have nine speakers from nine different social science departments and

groups on campus. So, and usually,

it’s a faculty presentation, particularly good for newcoming faculty. But also, it can be advanced, very good students, and I’m always open to suggestions and

so on for speakers. So without further ado,

we have Jenny McDougall from sociology, presenting on something [CROSSTALK]. You’ve heard of this topic before?>>[INAUDIBLE]

>>Okay, well, thank you Collin for the invitation to be here, and

to ISS for putting this together. I know a lot of you are students,

so I hope you realize there are a lot of institutions, where this sort

of interdisciplinary exchange of ideas and opportunities to come together and

talk aren’t available. Where everything’s very siloed, and people just sort of exist

within their own departments. That’s a real shame, so it’s a really nice

thing, that we have things like this. I really appreciate it, and I hope you find this useful

to all of you as well. So I’ll start with the talk today by

explaining my motivation for the topic, and sort of the content of this talk. It begins back in my musician days, back then, there was this essay or short story by Ralph Ellison,

where he describes jazz performance. In the back of the room,

there’s a guy by himself, standing quietly by the radiator

in the back of the room. You would almost miss him,

he’s sort of quiet, obscure in the back. But he’s giving the performers

his full attention, he’s completely locked in and

totally absorbed in the music. So the takeaway from the story I

used to reflect on a lot was that, that’s the person you’re playing for

when you’re playing music. Even at the crappiest gig where

there’s very few people there, and people there are talking,

not even listening to the band. If you kind of imagine that the guy

in the back of the room is there, and that he’s giving his full

attention to the music. Then that helps you fully sort of

honor the music and the process and yourself, and

sort of maximize your creativity. Back when I was a musician, I used to

think of the guy in the back of the room. Then I quit music, did academia, and now when I’m at the front of a room,

it’s usually at an academic conference, or maybe I’m visiting another university,

giving a talk or something. And it’s a different person in the back

of the room now that I think about, and you all should think about too. It’s not the sort of

quiet jazz aficionado, it’s a different guy,

it’s the snarky stats nerd. He’s a snarky stats nerd,

he’s in the back, and he’s got his Linux machine in his lap. He’s on Twitter, and he’s waiting for you to say something dumb, so

he can put it up on Twitter. And he can cross out a box on

his bad stats bingo card, and expose you to the entire world. That’s who I’m thinking about

when I give a talk, and so today, I wanna help you not let

snarky stats nerd cross out this box in his bingo card

when you give your next talk. So this is my motivation,

my first motivation, we’ll get to my second one,

second one’s much more selfish. Okay, so count data, when people talk

about count data, we’re referring to a dependent variable that reflects

the number of times something happens. Some event is observed zero times,

one time, two times, so forth. And there are, of course, a number of

research questions in the social sciences that resolve around count

data type questions. The reason snarky stats nerd has linear model of counts on his bingo card is because a linear progression

model is an inappropriate way for modeling this sort of dependent variable,

for a couple of reasons. What is the correct way to do it,

generally, approaches hinge on some version

of a generalized linear model, it uses a Poisson data

distribution assumption. The hallmark of the Poisson distribution

is that its shape changes as the expectation increases,

the mean of the count variable. As it is closer to zero,

the Poisson distribution has this highly right-skewed kind of peaky distribution. The more the expected count increases, the closer the distribution

moves to a normal shape. So especially when we

have low expected counts, the normality assumptions of the linear

regression model is violated. Another reason to avoid using linear

models and model count data is that, it can, when you draw a straight line, you

can end up making predictions of negative counts, which is not possible. So kind of this similar

rationale to why we use lodgit model to model probabilities. Using the lodgit length and the Poisson

data distribution, the lodgit length by applying the logarithmic transformation

of the right hand side of the model. Takes these linear predictions,

forces them to be positive numbers, which is what we want when

we’re modeling event counts. There’s these couple of classes of count

models that you see being used a lot, either the word Poisson shows up in them,

or negative binomials. But they’re all based on

the Poisson distribution, so what I would like to do now

is take the next two and a half hours to go through

each of these models.>>[LAUGH]

>>Talk about the technical characteristics with pros and cons. Of course, there are a number reasons

why that’s a silly thing for me to say. One of the reasons that’s a silly

thing for me to come in here and say is because we’re all

in the room right now with the person who literally wrote

the book on these types of models. In addition, so step one,

before you begin a project where you’re going to do [INAUDIBLE] would

be getting your hands on this book, talking to Collin, maybe taking his class. There are another couple of books that

i recommend that are especially useful. But I’ll also note that Scott Long himself

describes Collin’s book as, quote unquote, the definitive text.>>[INAUDIBLE]

>>We’re concluding the kissing up portion.>>[LAUGH]

>>But no, so I’m not going talk much about the

technical specifications of the models. What I want to do is just talk about some

sort of practical considerations, walk through how a person using Stata software

might approach modeling count data. We then talk about how a project

I’m currently working on sort of exceeds what

Stata I can really do, talk about some ways to hopefully Work

around some of those limitations. Any questions forward? So when selecting your account model,

the things to consider, generally when deciding

which approach to use, generally have to do with the distribution

of your particular account variable. The first key issue is establishing

whether you have equid-ispersion or over-dispersion of your account variable. So the standard Poisson progression,

the base line model for doing this sort of work

assumes equi-dispersion. Which with the account variable means

that the variance is equal to the mean. Which, in practice,

is a pretty infrequent, rare occurence. Most of the time researchers are working

with a variable that’s over-dispersed where the variance is

greater than the mean. When you have over- dispersion and

you use a standard Poisson model, your standard error estimates are downwardly

biased which puts you in the position of being more likely to reject a null

hypothesis when you shouldn’t. Type one error, which is a position we don’t want to

be in generally as researchers. The negative binomial

regression model adjusts for this over-dispersion by

including additional error term. One of the ways to check whether you have over-dispersion is to estimate

the negative binomial regression and see whether it is error return is

significantly different from zero. It happens by default in this data. And then there’s this other, I mentioned a

couple slides ago, zero truncated models, and we’ll talk about them today. Those are useful when you have a situation

where your counts are only recorded when they happen one or more times. Basically you only start

recording events when they occur, you miss all the zero observations. Zero inflated model is used

in the opposite situation, where you have lots and

lots of zero accounts. For many of the observations,

the event doesn’t occur. Zero inflated models I

think are very interesting. Conceptually the way they

handle this is like, so assuming there are two classes of

observations in the population as group, for which there’s no chance that the event

will occur, it’s always zero group. And then there’s another latent group

of observations that’s not necessarily always zero. So they can have a zero count, but

there’s some nonzero probability that the event will occur one or

more times for that group. And what zero inflated models do, and

hurdle models, or if it’s class as well, is estimates the probability of

membership in both of those categories. Or membership in the always zero category

and then estimates the predicted number of conditional and

not being a zero type of observation. Combines those two probabilities

into a prediction. Paul Allison, among others,

answers this question with a probably not. Zero inflated models are appealing

in some ways because we can talk ourselves into thinking that if there any

sort of dual processes generating counts, the process that predicts having

some non zero probability and then the second process that generates

the actual count of the event. But arguments like Paul Allison and

others make, generally against zero inflated models focus on

a sort of overkill argument. Using a zero inflated model much

of the time is kind of like using a sledgehammer to

drive in a little nail. And at any rate, the Regular Negative

Binomial Aggression Model generally fits the deal just as well as the zero inflated

negative binomial, so why make things more computationally intensive and complex

[INAUDIBLE] things than we need to? So from this strictly data driven

argument, there may be reasons not to use zero inflated models, but given

that we’re social scientists we have to remind ourselves that theory

comes into play sometimes. We’re not just number crunchers,

we’re modeling some social process. There could be reasons why we would expect

there to be these two separate processes predicting zeroes and predicted counts. In this, this is a blog post from Paul

Allison’s webpage, he gives an example of if you’re modelling the number of

children a woman has by age 50. There are some women for whom the

biological or physiological causes make it impossible for them to have kids,

they’re always going to have zero kids. Then among women, you’ll have a non

zero probability of having children. They’re separate processes that

the determined number of cases. In that case, there’s theoretical

reason to model the account generating processes and

the zero inflated approach. Okay, I’m going to move on to

some hands on examples now. So imagine the first thing you

might look for is over-dispersion. These are the data that I’m going to. Where the outcome, this variable I’m

interested in is the number of charter schools that are opened withing a school

district over a 12 year period of time. The first thing that were interested

in is identifying whether there is over-dispersion in this variable. So among the 10,500 school districts,

the average number of charter schools opened 0.5, and

the variance is 11 times that. So for some model students, these numbers

are equal, they’re very clearly not equal. You can also see the distribution of

this variable’s strongly right skewed, and very peaky as well. For the snarky stats nerd.5

>>[LAUGH]>>The reason we have this right skew distribution with this strong peak down

at the bottom is because most school districts have no charter school. 84%, that’s why they

are sixth in the country. Then you have another 8% that has a single

charter school, and by the time you get beyond two or three charter schools,

very few districts represent it. So you have this strong right

skew count distribution. [COUGH] Okay, so what you might do is

estimate a bunch of different models. Your Poisson regression, your negative binomial regression, the

Hurdle and tick some of your estimates and compare them for another,

use like a ratio test to compare them. You could use those models

to generate predictions, and then you also have interesting things, comparing predictive counts across models

or comparing predicted to observed counts. And that can be a really good way

to do things if you have lots and lots of time you’re trying to kill, or

if you really like debugging your own code and trying to figure out what

went wrong over and over again. So that’s the first option or

what I wanna show you is this. Countfit command written by Scott Long and

Jeremy Freese that when using Stata makes this model

comparison much, much easier. Much more straightforward and let’s you

do some interesting comparisons, really you hone in on the type of estimator you

should be using with your account data. So countfit, much simpler unless it’s

totally not and it doesn’t work at all. Let me give you some

background really quick. So in 1994, William Green points

out the zero-inflated and non-zero inflated models are not nested. So you need nested models if you’re gonna

use a likelihood ratio based comparison of model fit depending on which one’s better. Says they’re not nested, so

you need to use a different test. And he proposes using something called the

Vuang test to compare zero inflated and non zero inflated versions

of the count model. Then Allison comes along and

he’s all, [SOUND].>>[LAUGH]

>>He says that model’s, he’s like William Greene claims the models are not

nested because there’s no parametric restriction on the zero inflated model

that reproduces the non finlated model. This is incorrect. A simple re-parameterization of

the zero inflated negative binomial model allows for such a restriction,

so likely the test isn’t appropriate. And then Long and Freese 2014 say,

chill out guys, it’s going to inflate, it really doesn’t matter. Even if you agree with Allison,

you can still use the Vuong test. Let’s just use the Vuong test. And leave [INAUDIBLE] right there.>>[LAUGH]

>>I’m only a little bit being facetious. So this 2012 I’m referring to is that same

blog post I showed you the screen shot of. And there’s a comment section. And that comment section

is like 60 comments long. And it’s these two guys

fighting with each other.>>[LAUGH]

>>Which is its own, you can do a sort of

nerdiness test on yourself. Give it a read back and

see how much you are entertained by it. I loved it, I thought it was great, it was

like watching a boxing match or something. It wasn’t anything like that it

was a lot nerdier and a lot more. Okay, so they’re disagreeing. Long and Freese say it doesn’t matter,

let’s just use the Vuong test. So up until, so as of recently, let’s say like Monday of this week,

you run the Calc fit command that I was just talking about and it includes

among other things, the Vuong test. That let’s you compare the zero-inflated

and non-zero-inflated models [INAUDIBLE]. I’m not sure exactly when this changed. It updates data when I open it. But imagine my shock and

dismay yesterday when I’m going through putting together slides for a talk where

I’m gonna show you how great Countfit is, run it to start producing some output,

and I get this error message. Vuong test is not appropriate for

testing zero inflation. Specify option- forcevuong

to perform the test anyway. But since this just happened they

haven’t updated the countfit command. There is no force Vuong option. There’s also no way to

cut out the Vuong test, which makes the whole countfit

thing blow up and go nowhere. So my I can’t do my talk anymore.>>[LAUGH]

>>Cuz I can’t show you this really interesting command.>>[LAUGH]

>>Okay, so I know you are all thinking right now, you

are thinking first it’s global warming,>>[LAUGH]>>Elected officials are behaving erratically, unpredictably, and

now the Vuong test doesn’t work.>>[LAUGH]

>>For comparing zero inflated and

non-zero inflated models. I feel your dismay. I’m in the same boat. What Does though is it’s nice as I’m

explaining their reasoning here. The main points of this

paper by Wilson in 2015 as demonstrating unequivocally that you

can’t use the Vuong test anymore. So basically,

the distribution of the test statistic for the Vuong test is not standard normal or

key values from our meeting list. The actual distribution is unknown,

which is interesting. It can’t be used for reference. But the take away here, you may consider

using information criteria to choose between the standard and

zero inflated models. Not something most of us are used

to doing, taking our AIC-BIC, compare those, fine, all is not lost. But the problem is,

the software hasn’t caught up yet. I had an old version of SData

installed on my computer. And so I could, I used that,

and went back in and rerun, the good old days is like Monday,

but it’s changes since then. And this is going to be fine again

someday, too, once they let us just submit the Vuong test, it’s a very small

part of what countfit does. Countfit is awesome, a big fan. What a person starts with is specifying

the model as a zero inflated model. When you write in SData the commands

to estimate a zero inflated model, the first part, which is the part that

protects the number of event occurrences. Then after the comma, the inflate option, that’s the other model that predicts

always being in the zero category. Here I admitted all of the flow variants. And I’m just showing you a portion of

the output, but the first thing that countfit gives you is the point

estimates for the slow coefficients and their standard errors across

each of the different models. And so in this example,

we can go through and see. The thing that jumps out at me first

is that the negative binomial and the zero negative binomial. Pretty different, with respect to

the coefficients they’re predicting for the effect of being in a, this is

an established Latino gateway school district, on the number of charters

that open, and that sort of. These are, I believe exponentiated,

coefficients, so odds ratios, in the count model world we

call it incidents just a [INAUDIBLE] IRR. So I wrote it with an interaction there. You can specify what you know,

the complex model, the thing you’re actually going to

end up interpreting down the road. One caution I would give you though is, so countfit, what it does is

actually goes and estimates each of these models because quietly, the state

of [INAUDIBLE] behind the scenes. If you have convergence problems with

any one of these you won’t see it. It will kind of look like it’s frozen or

whatever finish, but you can’t see what the problem is. So I recommend going through and

fitting each one of these individually. Just to make sure that all go

into the in this countfit. So this would go on for as many

covariances you have in your model and then down at the bottom, statistics. Going to be preferable. The next thing countfit spits,

so this is just one, I just wrote half of that in my model and

that follows comes out. For each of the models, it gives you

the value of the account variable for which the estimator is

most wrong basically. The difference between predicted and

observed counts. The Poisson model was most wrong for

predicting zeros. It under predicts them,

this is pretty common. The negative binomial and zero inflated

negative binomial, their biggest wrongness is substantially less than the biggest

wrongness for the Poisson model. You can see it gives you

the average wrongness over here. Other measures of which model is doing

the best job of fitting the data. Then it will give you this sort of table,

once for every estimator. It gives the actual

probability of certain count, the predicted probability of that count,

based on that model. The difference, whether those differences are significant

adds up the total wrongness over there. And so you can sort of read

across these four tables and you get a sense of how these

different estimators are doing. Then countfit gives you a graphical

presentation of the same sort of information. So here it’s the observe minus

predicted counts at each value of your account variable. What you wanna see is lots of is

basically zero, [INAUDIBLE] are the same. My goodness. Poisson is doing a terrible job,

[INAUDIBLE]. Last thing you got is this formal

comparison of different statistics, it’s nice. Over here you get a sort of narrative

summary of how strong the evidence is, for one over the other. Tells you which one is

preferred over which. This is that God forsaken Vuong test,

that we should not pay any attention to. But this table gives you a summary of,

which model is doing the best job. So, running this one command

does all of this work for you. It lets you onto one model. So the last piece that I’ll talk of today,

the title of the talk. I haven’t talked about

multilevel models so far. But each of these hypothetical questions

that I started out with can be thought as a multilevel sort of process whether it’s

repeated measures within observations over time or spacial or

bureaucratic cluttering of observations. It’s nice, it would be nice to be able

to use multilevel models, fixed random, fixed effects. Models with this sort of zero-inflated framework to model these count off There’s

no reason that this shouldn’t be possible. So each of these are just a sampling of

papers laying out how someone would use a multilevel framework, the random effect

framework or mixed effect framework with the zero-inflated account model

using in this case, SAS MIXED, LIMDEP, S-Plus but conspicuously

absent from all of this data. There is no pre-packaged or

user in set of commands that would let you do multi level modelling for

zero inflated account data. Here is my,

I wanted to make sure I get to this so that the smart people in the room

can tell me whether this is okay. So I’m not sure. Here is my proposal work around. What I’m interested in here, so I’m

measuring, I’m trying to model the number of charter schools that are formed in

school districts across the US over time. But there is reason to suspect there will

be some between state variations, for a number of reasons. Chiefly, policy differences

that some states don’t have any laws that would allow

you to open charter schools. The laws that have been passed in certain

states or passed at different times having sort of different period of rest for

charter schools that have been opened. What I wanna do is make

within state comparisons. I wanna use state fixed effects when

estimating this zero inflated negative binomial model. So what you’re doing

includes safe dummies in both portions of the zero inflated

negative binomial model and then use cluster of standard

errors at the state level. Okay? This is my best sort of a shot

at doing whats data doesn’t have mechanism for

letting you do, as it stands. The last little trick I’ll show you,

as I’m sure I’m pretty much out of time. Am I all the way out of time?>>No, you have two minutes.>>Two more minutes. So, the reason I have this version 14.2

up here, if you’re using version 15, the list collapse command doesn’t work,

it hasn’t sort of been updated. This is a little trick, we call up a previous versions of the data

to make commands work with each other. One of the downsides to using this first

fixed effects is I get, so there’s 52 states because we have DC and Puerto Rico

so I have that many dummy variables twice. Once in this first part of the model and

once and second parts are 104 coefficients

that are just sort of meaningless. I don’t want to look at them. This coeff lets you select the independent

variables that you want to see coefficients for and lets you tailor how

you want those coefficients presented. How I asked for them to be presented is,

as the raw coefficient, the test statistic, the p-value. And then these percents columns

are really, really handy for these sorts of models. You get the percent change in the expected

count for one unit increase in x, or one standard deviation increase in x. And the other thing I really like is next

to the one standard deviation increase in x column, it gives you what the standard

deviation of that variable is. It’s a small little thing, but

it’s very, very handy [INAUDIBLE]. Because my model has interactions, this actually isn’t a very

good way to interpret things. It’s a logistic progression model,

interactions and logistic progression are complex. The margins, the margins plot

commands work very well with zero inflated negative binomial,

zero inflated plus sign. Sorry I’m rushing and wearing myself out. [COUGH] What I have here is the, so the predicted count of charter

schools in school districts. As the change in the percent switch of

the residential population is Latino, over time, from the previous

decade increases separately for school districts that

are established Latino destinations. The non established destinations here,

the interpretation is when the Latino child population is increasing in relative

size like a new phenomenon when there didn’t used to be Latino

kids in the district. A bunch of charter schools opened. It’s about exactly the same thing. And the destination, [COUGH] sorry, a school district that has a longer

history of Latino child [INAUDIBLE]. This is the average marginal

effect of destination type, that’s the way of doing a significance

test with the two slopes. With the confidence interval,

doesn’t have to include zero. It’s an indication of a significant

difference so, we interpret it as signs like once the Latino child population

increases by more than about 20% it’s more than a previous decade to see this

significant difference between established and

non-established destinations. But we’ll stop there. Sorry for rushing. Thank you.>>[APPLAUSE]>>Any questions? All right.>>So, the question is [INAUDIBLE]. We have a data set that is not half data. It’s officially the percent of time that

students have sort of behaved in class on the zero to ten scale,with ten being

very behaved and zero being not at all.>>Mm-hm.

>>But if you were to look at the distribution

it looks a whole lot like a Poisson Distribution. And we have been going back and

forth about whether or not to use the Poisson Distribution

even though it’s not theoretically or to try to come up with something else and

I was wondering if either one of you had, given that this is the topic of your talk,

if you had an opinion about that?>>So if you treated the, so

it looks kind of like one of these?>>It looks like the first black line.>>Yeah, so Poisson distributions are

typically described as rates, you know, in terms of rates [INAUDIBLE]. I’ll have to wait, it could be. I don’t know. This is another one where

I shouldn’t be answering.>>So let’s say on that one,

your thought it was a continuous measure, between zero and a 100, you can re-scale it to zero and

one and do a logical protocol. [INAUDIBLE] proportion state. I think that would be,

I know you’ve got some wonkyness but still I think that’s what

I would consider doing. Because a larger model you could

apply not just to zero one data. It’s for something where the conditional

meaning lies Is between zero and one. Similarly, I’d say with account data. Acount data is not restricted to counts. It’s really for any model where

the conditional model is exponential, ex transposed figure and, so

that answers that question. Just a couple other things. With the fixed effects, by the time

you get to a quite parametric model you have the incidental

frame of this problem. So the fix effects are in the state. As long as you have a considerable number

of observations per state, you’ll be okay. But in your example, it may be that

you want to drop some small states, states who just have

a couple of observations. [INAUDIBLE] but

I think by the time you get up to certainly 30 observations in

the state you’ll be five, maybe 20. And then you gave this

introduction I would say another big reason for doing counts. Two reasons, doing as Poisson. I think in most applications

with account with exchanges, if someone gets a year older I

don’t think that means point three more visit to the Doctor,

I think it means. What, a 5% increase. I think all the effects are multiplicular. And in that case, you either want Y has

conditionally exponential exponents or you could go log Y is exponential

You go the log route. You’re taking the log of zero

at times which is problematic. But also, at the end of the day,

I want to produce the y, not log of y. All right. And then the other reason for doing it, is that how hard is it to go from

regress y comma x comma v c u, Yo poisson y x, we see [INAUDIBLE]. Once upon of time it was incredible. And then our long and freeze,

it’s a really good book, right? The book that Ryan and I have on counts

is more of a research monograph, so it’s doing much more [INAUDIBLE]. It’s just not gonna be ethisized,

the basics, as long as piece is very good. Also, the Department might

have the chapter on counts, it’s kind of the essentials, and

that would be a good place to start. This has been a lot of fun actually,

and it’s unusual for us to have so

much detail on a given set of data. I think, in future times, next,

you will have one on something else. [INAUDIBLE] or something. And finally with counts, the data often

requires you to [INAUDIBLE] Right? [INAUDIBLE] Is often you ask people that

actually kind to the doctor’s office and [INAUDIBLE] But

you don’t see the ones that never came. So you are forced, so the parametrics, and I think item I think maybe things like

a binomial actually work quite well.>>Great thank you.>>All right, so next we have Maggie Molsh

from the School of Education.>>All right, so, I am Meghan Welsh, and I do actually know what you

are going to do [INAUDIBLE]. I work in the school of education and

I am a [INAUDIBLE]. I care a lot about measures, and how we

measure what people know, think, believe. So one sort of shift in thinking

that I think is necessary in the world of cyber methods,

is that I am no longer interested in thinking about the effect

of something on people, where I am sampling in people,

or district, or type of organization from

a population of people. Instead, the work in psychometrics

starts with the idea that there is An infinite universe of items,

that we sample from create a task. And so, what I am interested in,

is statistics around items. Just as when we’re measuring people,

we have measures both on safe tests and on people. When we’re measuring, when we’re

evaluating the property of test items, we do the same thing. We have some data on people and on items. But, it takes a little bit of a shift in

thinking to think about the fact that, what I care about when you’re

dealing in measurement, is the emphasis we can make about items,

not about people, because those sorts of

variables are in my analysis. So on that note,

most of what I’m gonna talk about today, sorta falls within the realm of

something called test validity If you’ve ever taken a psychometrics

class, and it’s at all been a while ago, you should know that, sort of the definition of validity

has changed relatively recently. And what we’re really concerned about is,

the degree to which evidence, and theory support both the interpretation of test

scores, and the proposed uses of them. So we’re not only concerned with,

when you look at the score, dedicate how well does

a kid understand math? But we’re also interested in how that

information about how someone understands math is being used. Can you use that information

to make a change in policy? It would be appropriate for that use. Would it be appropriate to make

an inference about how good a job a teacher is doing? Those are the types of things that we

need to sort of additionally investigate, whereas even so recently we say, less than ten years ago, we were

only concerned with interpretations. I, today gonna talk about something

called instructional sensitivity, which is one aspect of validity. And what this has to do with,

it does have some policy relevance, is that the grade in which

students performance on the test, reflect the quality of instruction

that students are receiving. So you may be familiar with

lots of educational reforms, that have come out where we tend to quite

explicitly make inferences about schools, or about teachers,

based on student level test scores. So during No Child Left Behind,

there was a lot of very sort of rigorous evaluations of schools,

based on test performance. More recently, in states outside of

California, California lives in some ways in this wonderful bubble, teachers

are now being evaluated in this way. So there are many states in which there is

a state level teacher evaluation system, where student performance on test

aggregated out to the teacher level, is used to design whether to label

a teacher as someone who’s successful, or someone who is failing, and therefore

need sort of intensive intervention after professional development,

after supervision from their principal. And so, in particular, what I’m

concerned about, is the way that we think about tests capturing the effect

of-it can be teaches or schools, but I’m gonna focus on teachers today. I also wanna point out that in many,

many areas of research, we use test scores as a factor,

as a proxy for the effectiveness of some intervention, that were

having teachers implement, right? So it’s easy to adjust the policy concern. This is an education of a such concern,

from the perspective value, often do thing to teachers, and then see

if that thing we did to teachers, or we trained teachers to do differently,

changes how kids do. And in reality, the way that

we tend to evaluate tests for, in terms of test validity,

never takes into account whether, or not, they can be used to make inferences

about instruction at all. And in fact there’s a far amount of

work by people like Bruno Zumbo at the University of British Columbia,

who has looked at things like the factor structure of measures, when you measure

them at the individual level, and then the factor structure when you

measure them at say the group level. And with the same data set,

you can have entirely different factors at the student level, and now when

you aggregate that to the teacher, it looks like you’re measuring

different constructs all together. And this talk today is gonna focus

specifically on instructional sensitivity of items. There’s also been a lot of work on

instructional sensitivity of test. And in the scholarly side of things,

not in the test developer side of things. And I’m particularly interested

in item level evaluation, because it gives us information about

which test items we have to improve in the test development process. So as I said, test and item sensitivity, are not currently evaluated by highs

takes testing programs at all. My adviser, and I wrote a paper evaluating

the test sensitivity of a test in Arizona, and I think it was 2008, and

that’s the last published piece that I’m aware of that looks, at this at

examining of a particular instrument. When we think about item sensitivity,

there is no one established method. And the reason for this is that,

when you think about how large scaled tests are structured,

there’s usually about 50 items per test. There’s thousands of teachers per state,

hundreds per district, and 25 students per classroom. So we have a problem where often don’t

have, for the number of items we have, then when we get up to the student level,

things don’t estimate correctly because we don’t have enough students to

then make good teacher estimates. So historically, sensitivity has been an evaluating just

with a quasi experimental approaches. One approach is that you would give

a test, you hired a group of kids, you have a teacher provide instruction

on the content of the test and then you give the test again. And there’s a bunch of different

statistics you can populate to think about the differences in the pre-test and

post-test scores. If the items that go up the most are the

most sensitive to instruction, on average, and the ones that don’t change must

not be sensitive to instruction. So if it gets much easier to

answer the item, it’s a good item. And if it’s not, there’s no change,

it’s a bad item. There’s also some work in this

bill called opportunity to learn, where we actually go out and measure

what teachers are doing in classrooms. Are they teaching the skills

that are on the test? Are they doing a good job

of teaching those skills? How much are they emphasizing

what’s on the test? Also putting in some other things

like overall achievement and student demographics to think about how well these

two things predict item performance. If things like teaching this

skill that the item is measuring. Improves the probability the answer

of getting the item correct, then it is a sensitive item and if OTL doesn’t seem to predict

item performance, then it’s not. There’s a couple of

problems with this method. One is that it’s burdensome because you

have to do additional data collection, beyond collecting operational test data. The operational test data is just

the stuff you give to kids in the testing moment. Let’s go back. If we use something like this in

particular and the instructional effect. If you think about a state test ,it

normally gets that measurement and it gets that number sense and it gets

that geometry and it gets that algebra. So you end up having to have a lot

of different lessons here to measure the effect of instruction on the test And

in addition to that, figuring out what’s actually going on in a

classroom is really, really, really hard. Because there are about

180 days of instruction, and things change a lot day to day. I bet anyone here who’s actually taught

even in a university classroom, and sort of acknowledge that any given

lecture may be their best lecture or maybe not quite their best lecture. So I’ve been playing around with and I really do want people to

write other ideas here, statistical models that we might use

to figure out, to item sensitivity. And what I’ve been doing is borrowing

a lot from the teacher effects literature and trying to think about teachers

effects on either performance. I wanna acknowledge that there’s this

whole literature out there on teacher effects or themes like multi

level measurement models. The people who write these papers aren’t

even aware that the instructional sensitivity exist. And then people who are aware that

the structural sensitivity exists, for the most part, are not methodologists. So there has been very little sort

of meeting of the minds here. And so here’s what a multi-level

measurement model looks like, what we’re talking about is predicting the

probability that a student answers an item correctly. With the item level performance on

all the items on the test except for that one item that you are predicting. And then we have estimates. We are taking the clustering

of the items with student into count and

in In an IRT model, which is again, predicted on the probability of

answering an item correctly, we would have a random effect just for

the intercept at the student level. And so, what we can do then is estimate

the, you have a probability of answering the item correctly that’s based on

the student level random effect. Which, Jason just referred to as, or

Jacob, just referred to as error or noise. But in our thinking,

if it’s at the student level, then it’s something about the student,

difference from the typical student is a characteristic of how

much the student knows. And then we take off the item difficulty,

And it’s a logistic model. Questions about this model? We can add a third level to

get to that teacher effect. And again I bet when we are just treating

that as a measurement model, not trying to get at the teacher effect on each item we

just have an error for the teacher here. Which is sort of the teacher,

the difference of a hidden teacher from the average teacher writes

that stat residual. Now the item in what you know and

both the characteristics of the student, whether they are above the residual or

not. The characteristics of the teacher, whether they are above

the average teacher or not. Minus the characteristic of the average

item and the particular item So this is just,

if you’ve heard of item response theory, this is just item response theory

written as a multilevel model. If I want to estimate

the actual teacher effect, now I have to add in a whole

much more error terms, right? So each one of these If the attitude and

effect on any given item, right? And then this is the difference

of any other particular student from the average student, right? So this is the student level residual for item two and this is one for item three

and this is the one for item four. So I can see how much variability there is

in student performance, around each item. Then I have the same thing for

the classrooms. How much variability is there of teachers

away from the teacher level average performance in terms of the probability

of answering specific items correctly? And again, I’ve moved from the world

where this is noise to where this is some sort of meaningful effect of a teacher. Given what I said about

the structure of my data, can anyone predict what

the problem is with this model?>>So you get a shoe of all these errors,

are normally distributed? Are they normals, is that?>>Well I would have to,

except think about if I have 55 items instead of bk- 1j,

it’s literally beta naught to beta 49. And then at the student level I only have 25 students to estimate the variability

around each item average item performance. And at the teacher level I’ve got

at least 100’s of teachers here. So here’s what happens when

I actually try to run it. It just wont even converge. Right, cause what I’m trying to estimate,

the number of parameters I’m trying to estimate is greater than the number of

observations left, data points I have. And in fact, when people sort of study

multi level IRT, through simulation studies using a [INAUDIBLE] They’re almost

always focusing on about five items. And I’m pretty sure that more than five

items is where things just won’t converge. So the problem we really have is that

the way that we’re approaching multilevel modeling from a basic

research perspective, doesn’t actually apply to sort of

operational testing environments where we’re dealing with

50 items per grade level. Yes?

>>How many students would you need for the [INAUDIBLE] converge? I’m thinking about like schools where

a teacher may have several different classrooms of the same subject, so that in

fact they have more than 25 students but even then it might not be enough. So I’m just curious.>>So in the high school level,

this might work. In the high stakes testing world, there’s usually sort of one grade

level in high school that has a test. So that’s a really good point and I haven’t played around

with the high school level. There has been some work around a

different kind of modeling that I’m going to present, that has found that things

sort of hold up pretty well once you hit around 50 students per classroom. So I have two workarounds I’ve been

playing with in an applied way. One is that I just estimate

the two level model or I estimate the variability around the

student and I sort of drop the teachers, I’m sorry on the classroom, and I drop

the student level out all together, right? And if I brought the kids out,

I can get classroom level residuals. And if I do that, you might have

heard something called [INAUDIBLE] in the education world, it was a buzzword for a while because it’s the way that test

scores are used to evaluate teachers. And what you do essentially,

is fit a regression line for each student predicting their test

score based on prior performance and lots of student demographic

characteristics. And then you look at

the magnitude of the residual, the difference between the predicted and

the observed score for that student. And you aggregate all those

residuals all within a classroom. And if on average kids are doing

better than projected, then the teacher must be successful. And if the kids are doing worse,

then it must be the teacher’s fault. So I sort of borrowed that idea and

instead though, what I’ve done is, I’ve sort of

aggregated up the student level, sorry the classroom level

residuals around each item. So I’ve done the student level residuals

around each item and I’ve aggregated them to the classroom level but

instead of looking at the mean residual, I’m looking at the variability between

the fast groups and the mean residuals. But this is like, I don’t know,

it’s like a back of an envelope, not necessarily statistically

correct approach. And then the other approach I’ve used,

is something called item difficulty variation, which is

developed by [INAUDIBLE]. And at this point, they also just sort

of get rid of the student [INAUDIBLE]. Instead of looking just at the student

level, they get rid of the students and just look at the classroom level. And so this is a very similar sort of

situation where they thought the average classrooms test performance and then they

think about, then they take off what would be sort of the average difficulty

for an item, the difficulty of an item and they adjust it for sort of the mean

test performance across the states. So is this item more difficult then

typical items or less difficult? And then they add in a random difficulty

item, and that’s associated with sort of how much more difficult is this item in

this classroom after we’ve accounted for average performance in the classroom and

general difficulty of the item. This will converge and run, so I can, but you’ll notice that as opposed to this

model where I’ve incorporated all of the items into one model,

this has to be run item by item. So when I do that,

what’s the big problem if I have 50 items? This is like a very basic

statistics question and I have several graduate students in

the room who I’m not afraid to call out.>>[LAUGH]

>>[INAUDIBLE]>>Yeah, I have like a type one error issue, right? Where I’m sort of testing things over and

over and over again. And when you get to 50 items,

if you wanna even make some sort of post hoc adjustment, you’re talking

about very, very, very small numbers. You have to have a very small probability

of having the result by chance in order to have it work. It also ignores the student

level information altogether, which seems problematic as well.>>So these aren’t, are they likely to

be independent each of those tests or will there be positive correlations?>>The tests?>>When you did the 50 items.>>So the items themselves,

in most testing situations, are assumed to be independent. That’s quite common in the testing world,

especially when you get to situations like if you’ve ever taken a language arts test,

usually you read a passage, right, and then you answer

several items from the passage. And so we know that they’re

not really independent because they’re all linked to the same passage.>>Well I think it’s not so

much the item itself but the statistic that you’re using, that

you’re getting at the end [INAUDIBLE].>>The probability of

answering it correctly?>>No I think, you said this problem

with doing, on the next slide.>>Ignore student level information, yeah.>>Right.>>Yeah.>>I thought you said that doing this

test separately, 50 times, this->>Yeah, as I do the items 50 times. And so you are right that you can do this

either where you get the overall item, we take the overall test

performance into account, right? And so there are models that sort

of take overall test performance, holding out the item

that you’re evaluating. So it’s overall test performance on say,

49 of the 50 items one at a time, to adjust for that fact that the item

would be part of the ability measure. Is that the question?>>I can’t talk [INAUDIBLE]

>>Okay, there’s a final general approach I wanna talk about, which is called

multi-root confirmatory factor analysis. And what we do there, is we have a factor

analysis where we have a factor loading. We have item indicators that are sort

of loading, that are predicting math achievement, we have sort of an average

performance of the item difficulty, and we’ve got the relationship between

the item and the blade construct, that’s the lambda there. And we can create these sort of factor

models for, often you will see these when we’re validating tests where it’s just for

all students, all together. But we can run them for

sort of, at the teacher levels. So we do it for each classroom or for each

school, and then we compare whether or not both the difficulties and the loadings

are equivalent across the classrooms. And if there are any differences in

the relationship between the item and say overall math achievement,

then we take that to mean that there’s something different about what

math achievement looks in this classroom, that’s related to the item. So if this is a stronger relationship and

this is algebra, and over here there’s a stronger relationship

here in it’s number sense, then we’re going to assume that this teacher

teaches number sense really well and this teacher teaches algebra really well. So then we could make some inferences

about differences in instruction or teacher effects that are being detected. But in reality, the problem we have,

is that when you think about multi-group measurements it varies, but

these kinds of studies, again, we’re normally talking about small

numbers of groups, often just two. But if we have even, say, on one data set

I was working with, I had 40 teachers. Now I have 40 different CFA

models to compare, and that, again, It will actually converge,

I will give it that, but it won’t generate fit statistics,

which seems like a pretty big problem. And I had a student who worked on that, cuz she was really convinced that

the fit statistics were not a problem. So we ran some simulation studies,

and found that our parameter estimates were really, really

biased, that we couldn’t trust them. Although, if you move up to clusters

of 100 students in a classroom, Aspera, Hook, and

Moutin suggest that it does work. Okay, so that’s sort of, it is just different ways I’ve tried

to run multilevel measurement studies. And the problems with trying to generate,

to try and think about the effectiveness of

a test to measure instruction, given the way that schools

are structured today. Questions or thoughts, things I

might try that I haven’t tried yet?>>[APPLAUSE]>>[INAUDIBLE].>>What, converge, it-

>>[INAUDIBLE] too much time, or it’s not adding time, [INAUDIBLE]?>>It basically gets looped up and

never generates estimates. You can leave your computer on for

five days, and you can adjust the number of iterations upwards to

the point that it seems ridiculous. And it won’t ever actually hit the,

it will never come to a solution.>>[INAUDIBLE] is not justified, maybe if you change other words and

stuff somewhere.>>Yeah, so I’ve actually talked to

people when I was trying to make this particular stuff work, here. I actually spent some

time with Linda Mutane. I don’t know how many of you

are familiar with Amp Plus, but it’s a very big, sort of very

popular psychometrics package. It also does a lot of interesting

modeling, really, in general, and they just kept saying. So you’re trying to estimate 4,000

parameters with 12,000 participants, that’s just a silly idea to begin with. And in particular, this approach, we used

Bayesian and [INAUDIBLE] because Bayesian modelling, it is better at dealing with

small numbers of a small sample size. But you can’t fix everything

with Bayesian statistics. If you’re trying to make too many

inferences from too little data, there’s very little I think you can

actually do about it, at least, that’s the conclusion I’m coming to.>>Is it correct to say that none

of these methods will let you recover the [INAUDIBLE] between

student variants estimates, that the number of students is

too small in the classroom?>>So what I can do is, if I take the classroom out I can get

between student variance estimates. So one of my workarounds was to

sort of get rid of level three and just look at the between

student variance estimates. And then to take those estimates and

aggregate them up to the classroom. And by hand, get a proxy between classroom

variance estimate, by just tapping the variance between the aggregate

classroom, these aggregated residuals. So that was my back of

the envelope approach, but I’m sure there’s all kinds of

reasons why that’s a bad idea.>>[INAUDIBLE]

>>But I don’t know, I was waiting for someone here to be like,

why on Earth would you ever do that? But that’s the closest

thing I’ve come up with.>>Okay, so this is one of the big

divides between economics and the rest of the world. We do not do multi level model,

except for random effects model. And that’s why I’m not [INAUDIBLE] to,

probably couldn’t anyway if I knew it, but I’m not the person to ask. But I’m curious to know, first of all,

who here actually is part of the research, either now or going to be soon,

to be doing multi level modeling? All right, so

we’re in two different worlds here, okay. And the second question is, it already

came up once to [INAUDIBLE] economics, but it wasn’t well advertised. Chuck Huber from Stata will give a talk

on structural equation modeling. Who would be interested in

having him come here to do that?>>You know that psychology offers

amazing [INAUDIBLE] classes too, are you familiar with that?>>Yeah, but the thing is,

a class is a class.>>That is true.>>[LAUGH]

>>The benefits of this is much more efficient, and

I can read books and so on. But there’s many things I need to know, the people around me don’t do

structural equation modeling. I’ve actually worked on it,

cuz it’s just a question of time, as to how much we put in the new

edition of [INAUDIBLE] Stata. Just to point out to people who don’t use

it what they could be missing out on. But I’m just saying,

it’s two different worlds, yeah.>>So

this isn’t really a stats question, but would you be able to get anything

from grouping items together?>>So there are these things called

differential bundle functioning and differential test functioning models,

that do allow for clustering of items. And that is actually the next

place we’re starting to move. I’m still trying to really

understand sort of theoretically that the items are clustered

the way that Paul was describing. And they’re considered to be independent,

but I’m not deep enough into the meaningful differences in the models

yet to feel comfortable running them.>>Okay, thank you.>>[APPLAUSE]

>>I apologize. I closed it and it went to sleep.>>Okay, so last speaker is

Eduardo Estrada from psychology. And I’ll give you a five minute warning,

25 minutes. Good afternoon, and

I am a Rhodes Scholar in the second. I will be and I’m going to share some of

my ideas that we have been working on, and I would really like to

hear your comments about it. So feel free to tell me and

say [INAUDIBLE]. For this product, we are collaborating with some

colleagues in the school of medicine. This is, we’re talking about [INAUDIBLE]

method we want to propose, but in this context, we apply it to reading abilities,

the development of reading abilities. So in pretty much all

fields identifying change in the individual level is very useful,

but not in psychology. For example, if you are a physician, and you are planning intervention,

a behavioral intervention, you might want to know which

particular persons are not changing or not changing enough, so they need some

pharmacological treatment, for example. In the context of inflation, for

the usual question, you often have them. Okay, tell me, which children are not

catching up with that in class? Which ones? Which specific person [INAUDIBLE]. And the traditional methods for this

[INAUDIBLE] statistics focus on samples, and the whole sample [INAUDIBLE]. So, our purpose is to propose a method

that can predict individual change and identify the scores that show that

person is not changing enough or is changing too much. [INAUDIBLE] And one of the things I’m right by is I try to propose things

that can be actually used by people. And when I say people,

I don’t mean the people in this room. I mean the people in schools, which often don’t know as much

as we do about statistics. So I try to keep things simple. To [INAUDIBLE] that might

have some [INAUDIBLE] too. You can [INAUDIBLE] something. So what are we doing? Linear regression. How many of you have

used linear regression?>>[INAUDIBLE]

>>[LAUGH]>>Some people have [INAUDIBLE] not aware. How many of you has completed

the confidence interval for [INAUDIBLE] linear regression? So this should be a half, here, here. So in a linear regression, lets assume

this is a [INAUDIBLE] score, [INAUDIBLE]. This is time one score and

this is time two score, this is score. So we are trying to predict

[INAUDIBLE] n2 with n1 and for this point estimation,

we can compute our confidence interval. And the way people usually do this is,

okay, for the people [INAUDIBLE] What is

going to be the meaning then of [INAUDIBLE] in the conference invalid for

the mean [INAUDIBLE]. But you can also compute

the confidence interval for each individual prediction, okay? Am I making sense? The other thing that this

changed [INAUDIBLE] and the way of computing the [INAUDIBLE]

is distant too it must be [INAUDIBLE] there is a different computation,

different format. So, [INAUDIBLE] with [INAUDIBLE] it could be the [INAUDIBLE] for a different level of [INAUDIBLE]. The confused thing for

the mean has this shape, and if we compare that for

the individuals, it’s much wider. It’s special corp, but with an. But how this can be used

to interpret change? Well, we can look people [INAUDIBLE]. This person change acceptable [INAUDIBLE]. This person who change more

than just [INAUDIBLE] for the root but

still is we think that [INAUDIBLE]. This person influencing

[INAUDIBLE] Stops at three. And this person crossing Italy,

it’s big words and expected. Even if he or

she [INAUDIBLE] this is higher than this. [INAUDIBLE] One of this is that,

I have to do that, and judging by the videographs, and

judging by the [INAUDIBLE], we are testing the whole sample, okay. As compared to single case exam,

for example. So we use, we basically use an [INAUDIBLE]

as a measure of [INAUDIBLE]. And we can set the confidence level to show a typical change. If [INAUDIBLE] confidence for the

individual, we predict the new hypothesis, which in this case states that these

cases are of this propagation. So these two are not from this

propagation, [INAUDIBLE]. Okay so far, does it make sense? So any score about the upper

limits means that the person was actually given his

the average group change, and the expected change for the people with

that score, anyone, and below the. Even in the context of an inquest, this person will even know it

is enough to ignore [INAUDIBLE]. So this method [INAUDIBLE] in

the context of neuropsychology and [INAUDIBLE] cases with two

[INAUDIBLE] in [INAUDIBLE] showing that this works very well. [INAUDIBLE] type one errors. So when you said confidence level,

you have the right of type one error. So while what we are doing here

is to go one step beyond and extend the [INAUDIBLE] Not two ten points,

but several ten points. [INAUDIBLE] This is not

an actual autoregressive model because we are not saying that [INAUDIBLE]

we are not what happened before. We ask [INAUDIBLE] Here will

give this information, okay?>>[INAUDIBLE]>>This is so for computing the confidence-

>>I think it’s easy if you go back to the picture, stop.>>[INAUDIBLE]

>>Yes. So the first confidence band is based on the fact that we got a beta hat and

not a beta. Just controlling for

estimation noise across everything. Now for that one,

you can control [INAUDIBLE] You can get [INAUDIBLE] correctly. You get the second one, you’re having to

make assumptions about the [INAUDIBLE] and the topics of the [INAUDIBLE]. My guess is that it’s assumed that

the errors are [INAUDIBLE], all right? You’re adding in the same amount. But actually,

you could argue that out in the tails, you should be Having a bigger one.>>So if this is going to

there is no displacement here.>>But I think it’s coming across to

beat a hat rather than your assumption. Well, if you come back to your equation, you are having,

to one with the [INAUDIBLE] term start. All you [CROSSTALK]

>>I have [INAUDIBLE] at the end of the presentation [INAUDIBLE].>>But you are going to

have to say something about what that [INAUDIBLE] for individual life. Yes, the one, that one is saying everyone has the surname, individual there seems squared.>>[INAUDIBLE]. That’s not caused the [INAUDIBLE]

speaks.>>Actually, what I wouldn’t show that,

no it doesn’t [INAUDIBLE]>>Yeah, it’s a [INAUDIBLE].>>Yeah.

>>That’s something [INAUDIBLE]. [INAUDIBLE] individual change. It doesn’t go to, it gets to a point where

you cannot make it thinner by more time or more [INAUDIBLE]

>>It’s a varience [INAUDIBLE] that always has been intriguing [INAUDIBLE]

>>Okay, so the first thing we need with this idea, is

applying some interesting empirical data. We are using data from

the Connecticut longitudinal study, which is an ongoing study improving our

co- a very large group of children for which measures of learning attention and

reading disabilities, not abilities, are taken. They’re using this sample size of this

study and these kids were test at first grade and then annually assessed all

they way up to grade nine in the data. And a way to enter the graphics,

but basically, the main idea here is that they have and

to US publishing by the time is that. So the measures I’m using here they

are about I mean these are rates 1, 3, 5, 7, 9, particularly the west lands game. That’s a measure of their non-verbal

ability [INAUDIBLE] the [INAUDIBLE] from the [INAUDIBLE] that’s

probably more narrow abilities.>>[INAUDIBLE]

>>This is the way data [INAUDIBLE]

45 sample. As you can see, it is very common

in the context of location, eight numbers course,

if you have IQ curves, the box curves. So in this curve, the higher is the curve,

the more ability. Here, the mean of each rate is 100, so you can see the radiation from

each person to to their mean. And because these first addressed to

encryption to educational leadership, we are using this [INAUDIBLE] but

the same method can use here.>>Raw scores, that’s the number

correct or the percent correct? The raw score, is the number

of items answered correctly or the percent of items answered correctly?>>In these three cases it is.>>So technically they’re ordinal,

not interval, because the item difficulty, differences difficulty are not

being taken into account.>>Right.

>>So the IQ method is a better one for parametric statistics

from that perspective.>>Seems fair, I didn’t notice that,

it may not be that, but that’s not generalization to use.>>[LAUGH]

>>But my point here is that this technique can be applied to both types of

data and I was focusing here on the shape. So this is [INAUDIBLE] and

this is [INAUDIBLE], or that’s another specter to increase,

that means it’s going to be 100 always and the method to be used to both

as even to decrease it’s course. So we are going to use this

[INAUDIBLE] method to predict this data to,

we’re getting data to this here and see how it works, we’ll see what happens. The first thing you need to know is you

need to find out how many unit measures are useful. We have four unit measures [INAUDIBLE] so

basic approach to these, yes, you just need the expect

version of regression and use as far as measure of well [INAUDIBLE]. If the further measures have

improved the partitions so much and that will wake up the less of more

we [INAUDIBLE] based on that model. So for basically, where we have

one previous array of unit five. When we add this,

we first include array three with array one and the is significant. For array seven, we can use arrays

five and one, but for array 9. After we enter with same as [INAUDIBLE]

three we want to know that very useful. So we know the reason with that

we see that with these data and this data did that. After three measures are enough even with

that we need to move on for the exam. So where we to estimate the Cii for

this, this, this and this computer. And this is one of the ways

we can plot it, okay. These are the predictive and

observance code for break free. And this is very similar to

the first problem I showed, because this is not a in day one,

but it’s are linear transformation. So this would be the least case,

showed the exact same thing. So it has the value we

have 60 cases [INAUDIBLE], cases below the [INAUDIBLE] in this case. In the remaining three plots, this is actually a linear combination

of the previous occasions, right? And mirror conditions are real

nice errors of the cases, and we can see that the type one errors

are close to 5% in all cases. Another way of plotting this or

carrying this, and this is actually probably more

informative for applied practitioners to need to of one person,

because we want to [INAUDIBLE]. So let’s take case 5570 [INAUDIBLE]. And I have list here is the SR95 for

this group, for the whole group and

they [INAUDIBLE] and the medium, so these three [INAUDIBLE]

group indicators and the first resource [INAUDIBLE]

with this information, and the information for the whole group

[INAUDIBLE] for this person [INAUDIBLE]. So, basic comes and we see that this

special [INAUDIBLE] was the [INAUDIBLE]. But with this data with this kind of

data [INAUDIBLE] Which a break free near around the north, which is very long

top piece low line and this under line. So we the next person and we verify

discussion in much better perspective so great center correcting up close or

incorrect information. So in the next step, it is correct again. And then red line,

the group prices are listed. That’s the still the model,

that you validate our [INAUDIBLE]. [INAUDIBLE],

any questions so far? So one of the things that left out

of this, so I will lecture this, let you see that this is useful. Because this could be used in a school. And we could have like

balance of these for different kids,

can have three different kids. Three different kids in. So we can see that these two

kids started very similarly. But the evolution, the evolution,

is very different for each of them. We can see that this person is very good. She has a very high level. But nothing [INAUDIBLE] some these

have to [INAUDIBLE] performance. So these other things go in here and

[INAUDIBLE] can I do about [INAUDIBLE] in here [INAUDIBLE]

this [INAUDIBLE] again [INAUDIBLE] positions [INAUDIBLE] receive

an intervention for [INAUDIBLE]. To sum up, in the [INAUDIBLE]

based on linear regression, allows to first simulate the expected

change for sample at each equation. Second, compute the expected confused

they know those capacities for the [INAUDIBLE] recognition

of third values. Third, determine which individuals

show a typical change, higher or lower than expected. Fourth, study trajectory

in different variables for this individual, and

[INAUDIBLE] estimated [INAUDIBLE] for different universe that were even without

being in the original sample, right? And this is one of the few of them

that is interesting and unusual. So I will move this up. So here and we are completing this with information

that we already have that we put. Once we have the knowledge and

assuming that we have new cases from the same operation we could

actually estimate the whole [INAUDIBLE] have this [INAUDIBLE] and

then update it at each spot. And so we can actually focus

[INAUDIBLE] in this case. If this person [INAUDIBLE] application. [INAUDIBLE] for each new of

the oscillation there aren’t changed for the rule the person we used his

car there’s active change of for those particular scores underneath the

value factors of association and in this case one of the findings for this reading

that [INAUDIBLE] indications are enough. And one of the main

solution of this part is that we have a [INAUDIBLE]

of this model [INAUDIBLE]. Not to be a good idea. [INAUDIBLE] But

I think that this is interesting. And [INAUDIBLE] nice thing [INAUDIBLE] this distinguishes from [INAUDIBLE] first. The different estimation procedures. [INAUDIBLE] this is [INAUDIBLE]. Time?>>I forgot to reset my clock,

we’ll give you, say another three minutes.>>So another thing we could do

is change the confidence level. So if we want to gain sensitivity,

well then we need to work with the 95%. With liquid and

set on the rejection [INAUDIBLE] in the lower bound of the plot,

until they get 100%, but are not increasing at the right. Now, some things that this model does,

this does not do is first characterize. Which is this is not a model for change. We are not describing the change. We are just trying to

predict this kind of. And it doesn’t explain why people

doesn’t change as they should. You can add covariance to the prediction, to the model, but

the repetition of this is [INAUDIBLE] because if you

have the right covariance, that have [INAUDIBLE] to the node. So you’re going to [INAUDIBLE],

does that make sense? So, the [INAUDIBLE] covariance. [INAUDIBLE] In division. What he does is even define throughout

checking the real cases and changing more on this and

changes mode step and leat step. I mean the one it seems that you

can be the one you do not rate us. With other of electron check so

I said this before. If we have something that will serve for

and we once we have the model, we can distinguish, we can focus for

these new cases and see where got it. On one of the applications of these and most of the fall out will be to detect

the cases with dyslexia for example and will decide possible interventions for

this case based on this script. Now some them western

like to is incorporate this method when there is a change. How can we do this estimate? For example, in situations, when you have structure

paths that need to. It should be easy to contribute

if [INAUDIBLE] I don’t think [INAUDIBLE] is what [INAUDIBLE]. And another thing that we have

to do is to [INAUDIBLE] with [INAUDIBLE] an [INAUDIBLE] is

the model that try to find for us to find the different sub

correlation within the correlation. This is the same thing this

model is doing conceptually. So if this will tell us the same thing or

not. Okay, That was everything I wanted to.>>[APPLAUSE]

>>Any questions.>>So I have a quick one. Can you get back to your equation? The one with the epsilon number. It has an error return in it.>>One second. Okay, suppose we could

perfectly estimate this. Then the only thing to

worry about is this. So we’re asking,

when this gets big enough, there’s a flag. Something’s gone wrong. Okay, and

during the confidence interval approach, we’re basically saying if

it’s 1.96 times the estimate of the estimated standard deviation,

that’s bigger, okay? Now what if you have one person who

year after year is just really steady on the test?. They have small variability of epsilon. And then there’s someone else who’s

just all over the place, all right? And one year they are way

up while you’re going down. Well that person you’re going

to misidentify as needing attention when they don’t need attention,

right? So I think the heteroscedasticity

here may matter. Now given your results, it suggests to me

that you were working in a setting where there wasn’t much heteroscedasticity. There wasn’t much variability in the area. But with your data you could check that. And you could run simulations with

wildly heterscedastic data and just see then that you’re not

getting it adding up to 5%, right? That’s what I wanted to say. I’ve concluded this is very, very useful. And I was persuaded by your

particular data example. Just a more matter of robustness across,

right? And this sort of data,

it may be that’s the way it always is. Okay, so yeah.>>I just would like to suggest

another place you might want to go. So we know that the growth of children,

especially in reading, is not linear, especially based

on where they start, right? So there are people who found some work

in quantile regression, where they sort of calculated those regression

curves completely separately for the different levels of

initial performance. And it would be interesting to just

sort of see how this might be applied. I mean, I know this sort of one kind of

slice of it, but I think it works better than OLS for things where you’ve got

growth that you know is going to look very different at different parts of the

distribution in terms of where you start. Did that make sense?>>Yeah it makes sense, it really does. We might adopt a new plan.>>Okay,

we’ll make this the last question.>>Yeah, I have a comment. I think it’s kind of related

to what Collin said, which is about the effect of

measurement error on the method. Because if you don’t take into

account measurement error, it’s basically confounded with

a prediction error, right? So, if you get a huge epsilon, it could

be a prediction error, but it could also be that you’re just not measuring it-

>>It’s a measure of both.>>Right, yeah, so

one direction that you may want to explore is how measuring error

can have an affect on this.>>And how would you,

what is your intuition? How do we attack those? Because I don’t know

what’s related variables. And but I really don’t know how to

use this with little variables. Because you are using the whole matrix

stuff, score of the minus tables. So you will need, I think, the outer

scores, and that opens Pandora box.>>Yes.>>I understand that well enough. That’s very interesting. I definitely want to go over it, too.>>Okay, we’re done. Well, thank you very much. Thank you.>>[APPLAUSE]