Cette page appartient aux archives web de l'EPFL et n'est plus tenue à jour.
This page belongs to EPFL's web archive and is no longer updated.

Daniel Grollman

Last Entry

Today is my last day.  I've gotten all of my stuff squared away, and also managed to get a little writing done.  Unfortunately, not enough, so I'll need to put in a solid day, or maybe two, next week.  But I should be able to do that.

The scholarpedia article needs a fairly extensive rewrite.  But fortunately, that's mostly just textual, the formulation itself is fairly well done.  I may need to redo an image or two, but that should be ok.

The journal article is more troublesome, I'll need to do a fair bit of math.  I'm going to try and present everything we've done on the LfF side of things, even if it didn't work.  Negative results are still results, yes?

Posted by Daniel Grollman at 17:43

I was supposed to spend today drafting my final documents, but I got bogged down in some other last minute details that have to be done.  Nevertheless, I've got some outlines, just need to put some meat on them.

I'm a bit dissapointed that there's a whole bunch of stuff I've done that's not being published and will likely get lost once I leave.  For example, the comparison between GMR and GPR, the linear Donut stuff, the MultiDonut and the full GPR gradient.  My hope, of course, is that I find some use for it down the road, but I still feel like a white paper or two may be doable.  Perhaps my upcoming vacation will be more work-heacy than I expect.

Posted by Daniel Grollman at 17:23
Well now

That didn't work.  Lots of interesting math, some cool stuff, but it doesn't do what we want.  And I think it's time to stop barking up this tree, so tomorrow I start writing, and start running some experiments to actually give some results.  I aim to show, at least, the utility of learning from failures.  So it might end up being more of a position paper...hmmm.

Posted by Daniel Grollman at 18:26
which way did he go?

Is the big question.  Related, of course, to the gradients and the reward stuff.  I think I've got the gradients mostly understood.  I can shape the Gaussian, but I'm having trouble coming up with correct covariance matricies, given directions.  Man, that is harder than it should be.

But, before it gets too complicated, I want to run some experiments with the simple method.

Posted by Daniel Grollman at 17:53
Not dead yet

Nor am I where I wanted to be.  Alas, I'm still doing research, although I did clean out my desk during a break today.

I'm doing more with this grdient stuff.  I've been focusing on learning how the kernel parameters change the shape of the predicted values.  (I'm still not using anchors).  The key seems to be to have good kernel widths, and then the values kinda take care of themselves.  In one-D, I derived that the maximum correlation between gradient and value occurs when distance=sqrt(width).  And that gave pretty decent results, making sure that the gradient 'extends' into the next point.  There was also decent results just taking width=gradient, although I'm not sure why.

Going into 2D, however, I ran into some trouble.  The formulation assumed equal widths in all dimensions, and no cross-dimensional terms.  So I extended the derivation to deal with those.  Now we can have values in any gaussianly shaped region we want.  But, I don't know if that makes sense.  Sortof intuitively, I expect more of a wedge-shape to my influence on reward.  Or, almost, even a plane.  Or, Gaussian in the direction of gradient, with peak at the next point, and infinite in the perpendicular direction?

Still need to USE this somehow as well.

Posted by Daniel Grollman at 18:18
More Gradients

I'm running with this IRL from failure thing, trying to estimate the reward function given only gradient information.  It's working, kinda, but not exactly the way I want/expect.  I've one set of issues centered around whether or not I'm doing it right.  See, I'm concerned that the predicted gradients don't exactly match the ones I'm training on.  But I just did a test with learning grads from grads, and it looks right, so perhaps there's a code bug.  The other issue in this set is that predicting values just from grads may not be a sensible thing to do.  In that I'm not sure if I'm doing right, or if it's even possible.  I'm thinking about putting in an anchor point - assume the first demo has zero reward, that sort of thing.

The second set of problems has to do with the kernel function.  Right now, it looks to me as if gradients have a limited area of effect that is constant in size.  But I think larger gradients should have effect on larger areas, so we may need a non-stationary kernel function.

Posted by Daniel Grollman at 18:02

Today I looked at using the velocity of human trials in parameter space to constrain the gradient of the reward function.  Using a GPR, we can condition the value at a point on the gradients of the value at another point.  The novelty here is that we never observe any actual values, but we can predict them

Doing the math, things work.  But the results don't quite make sense.  The predicted values are HUGE, which I gues is ok since I never observe real values, and have fairly large gradients.  But, the predicted gradients are also ginormous, and not always pointing the right way.

And yes, I had to drop BACK to 1D to test this.  Rargh.

Posted by Daniel Grollman at 18:08

Well, that didn't work.  After all this work, I come up with what should've been an unsurprising result.  Basically, the best thing to do overall is to plop a big Gaussian down over the human's last trial and go from there.  A circular search going outward should find the solution better than anything else.  We can speed it up a bit by stretching the circle to go along with the human's trials (actually, maybe not if the human missed something), and we can of course skip over the areas that the human already tried, but that's about as good as we get.

And it's not good enough.

It comes down to the fact that without reward information, we're flying blind.  And, since we're not getting a reward function, we're going to have to infer one.  Which means IRL of some sort.  But, from failure....dum dum dummmmmmm.

So, if we assume reward is linear in feature space (parameters or gaussian kernels or what?), can we use the demonstrated search pattern to infer a reward function?  And then try the max?  And when that doesn't work, update?

I'm thinking perhaps a hetGPR.  I remember reading a paper where the gradient information is used to constrain the values, need to look that up.  It was: GP Implicit Surfaces for shape estimation and grasping

Posted by Daniel Grollman at 18:00

Today I looked into exponential weights.  Of course, the issue is how to set them.  So I tried fitting exponential curves to all of the data I have, and taking the average.  Unfortunately, this didn't seem to work too well.  Different parameters may do better...

But, I ran a big comparison.  All datasets with all approaches.  One nice result is that temporally sized, positiviely shaped negative holes are in the top 3 scoring algs for all datasets.  Generally the difference in the mean is less than 3 trials, with one exception.

This exception is related to another issue, in that I'm using all of the femos to train the model.  Which means that the human only needed one more try to get it right, which is kinda silly.  When I roll this back, things break.  I think it's because the correct solution is not sufficiently close to the demos, so I'm trying the old 'multiply the positive model by 3' trick to see if that works.

Posted by Daniel Grollman at 17:45
Temporal Scaling

Again with the stepping back.  Simplify simplify simplify, and once it works, make it more complicated.  I think I leapt too fast from 1D to highD and didn't make sure things worked properly enough.

So, I collected a bunch of data from my labmates, to test things out.  I realized that I might be biasing the algorithms to my own method of searching.  So I've 5 additional datasets now.

Things I am trying:

1) Just build a Gaussian model of the demos

2) Weigh the demos linearly with time, build a Gaussian

3) Center the Gaussian from 1 on the last demo

4) Constant sized and shaped fingers on the demos (of depth0)

5) Shaped fingers to the weighted Gaussian, sized to sum to one

6) Scale the shaped fingers linearly with time.


And, huzzah, 6 seems to be the best overall.  Which I like.  Some notes:

1) Non-zero weights don't really seem to make sense, since we KNOW these are bad demos.  They may re-appear if we use pseudo-demos (merging)

2) Perhaps an exponential weight decay would be better?  Fitting exponentials to the data seems to be reasonable...

Posted by Daniel Grollman at 18:02
Page : 1 2 3 4 5 6 7 8 9 10 Next »