Download - Fcv poster parikh

Transcript
Page 1: Fcv poster parikh

Image   Binary  descrip0ons   Rela0ve  descrip0ons  

not  natural,  not  open,  perspec0ve    

more  natural  than  tallbuilding;  less  natural  than  forest;  more  open  than  tallbuilding;  less  open  than  coast;  more  perspec0ve  than  tallbuilding;  

not  natural,  not  open,  perspec0ve  

more  natural  than  insidecity;  less  natural  than  highway;  more  open  than  street;  less  open  than  coast;  more  perspec0ve  than  highway;  less  

perspec0ve  than  insidecity  natural,  open,  perspec0ve  

more  natural  than  tallbuilding;  less  natural  than  mountain;  more  open  than  mountain;  less  perspec0ve  than  opencountry;    

White,  not  Smiling,  VisibleForehead  

more  White  than  AlexRodriguez;  more  Smiling  than  JaredLeto;  less  Smiling  than  ZacEfron;  more  VisibleForehead  than  JaredLeto;  less  

VisibleForehead  than  MileyCyrus  

White,  not  Smiling,  not  VisibleForehead  

more  White  than  AlexRodriguez;  less  White  than  MileyCyrus;  less  Smiling  than  HughLaurie;  more  VisibleForehead  than  ZacEfron;  less  

VisibleForehead  than  MileyCyrus  not  Young,  

BushyEyebrows,  RoundFace  

more  Young  than  CliveOwen;  less  Young  than  ScarleMJohansson;  more  BushyEyebrows  than  ZacEfron;  less  BushyEyebrows  than  AlexRodriguez;  

more  RoundFace  than  CliveOwen;  less  RoundFace  than  ZacEfron  

2.  Learning  Rela-ve  A0ributes  

Categorical  (binary)  aMributes  are  restric0ve  and  can  be  unnatural  

Rela-ve  A0ributes  Devi  Parikh  (TTIC)  and  Kristen  Grauman  (UT  Aus0n)  

6.  Datasets  Proposed  idea:  Rela-ve  A0ributes   Richer  communica0on  between  

humans  and  machines   Describe  images  or  categories  rela0vely  

e.g.  “dogs  are  furrier  than  giraffes”,  “find  less  congested  downtown  Chicago  scene  than  

 Learn  a  ranking  func0on  for  each  aMribute  

1.  Main  Idea   4.  Rela-ve  Zero-­‐shot  Learning  Mo-va-on:  

Natural   Not  Natural  ?  

Smiling   Not  Smiling  ?  

Enables  new  applica-ons   Novel  zero-­‐shot  learning  from  aMribute  

comparisons   Precise  automa0cally  generated  textual  

descrip0ons  of  images  

5.  Describing  Images  Rela-vely  { }{ },. . .∼},( ){ ,

. . .�For  each  aMribute  

rm(xi) = wTmxiLearn  a  scoring  func0on    

∀(i, j) ∈ Om : wTmxi > wT

mxj ∀(i, j) ∈ Sm : wTmxi = wT

mxj

that  best  sa0sfies  constraints:  

Supervision  is  am, Sm:Om:

Max-­‐margin  learning  to  rank  formula-on  

� ∼ �Young:   Smiling:  …  

Learnt  rela-ve  a0ributes  

Rela-ve  a0ributes  space  

 Training:  Images  from  S  seen  and  descrip0ons  of  U  unseen  categories  

 Tes0ng:  Categorize  image  into  N  (=S+U)  categories  

Young:   H  C  S  � � Z  M  �

Unseen  categories  

Smiling:   Z  M  � Need  not  use  all  aMributes   Need  not  relate  to  all  S  

Smiling  

Youth  

S  

H   Z  C  

M  

Auto  -­‐  generate  textual  descrip-on  of:  

�Density:   …  Learnt  rela-ve  a0ributes  

3.  Ranking  Func-on  vs.  Binary  Classifier  Score  

min

�1

2||wT

m||22 + C

��ξ2ij +

�γ2ij

��

s.t wTm(xi − xj) ≥ 1− ξij , ∀(i, j) ∈ Om; |wT

m(xi − xj)| ≤ γij , ∀(i, j) ∈ Sm;

ξij ≥ 0; γij ≥ 0

%  correctly  ordered  pairs   Classifier   Ranker  

Outdoor  scenes   80%   89%  

Celebrity  faces   67%   82%  

Rela-ve  a0ributes  space  

Density  

1/8  dataset  

“more  dense  than    ,  less  dense  than  

C   C   H   H   H   C   F   H   H   M   F   F   I   F  

“more  dense  than  Highways,  less  dense  than  Forests”  

”  

Binary RelativeOSR TI S HCOMF

natural 0 0 0 0 1 1 1 1 T≺I∼S≺H≺C∼O∼M∼Fopen 0 0 0 1 1 1 1 0 T∼F≺I∼S≺M≺H∼C∼O

perspective 1 1 1 1 0 0 0 0 O≺C≺M∼F≺H≺I≺S≺Tlarge-objects 1 1 1 0 0 0 0 0 F≺O∼M≺I∼S≺H∼C≺Tdiagonal-plane 1 1 1 1 0 0 0 0 F≺O∼M≺C≺I∼S≺H≺Tclose-depth 1 1 1 1 0 0 0 1 C≺M≺O≺T∼I∼S∼H∼FPubFig ACHJ MS VZ

Masculine-looking 1 1 1 1 0 0 1 1 S≺M≺Z≺V≺J≺A≺H≺CWhite 0 1 1 1 1 1 1 1 A≺C≺H≺Z≺J≺S≺M≺VYoung 0 0 0 0 1 1 0 1 V≺H≺C≺J≺A≺S≺Z≺MSmiling 1 1 1 0 1 1 0 1 J≺V≺H≺A∼C≺S∼Z≺MChubby 1 0 0 0 0 0 0 0 V≺J≺H≺C≺Z≺M≺S≺A

Visible-forehead 1 1 1 0 1 1 1 0 J≺Z≺M≺S≺A∼C∼H∼VBushy-eyebrows 0 1 0 1 0 0 0 0 M≺S≺Z≺V≺H≺A≺C≺JNarrow-eyes 0 1 1 0 0 0 1 1 M≺J≺S≺A≺H≺C≺V≺ZPointy-nose 0 0 1 0 0 0 0 1 A≺C≺J∼M∼V≺S≺Z≺HBig-lips 1 0 0 0 1 1 0 0 H≺J≺V≺Z≺C≺M≺A≺S

Round-face 1 0 0 0 1 1 0 0 H≺V≺J≺C≺Z≺A≺S≺M

0 1 2 3 4 50

20

40

60

80

Accu

racy

# unseen categories

OSR

0 1 2 3 4 50

20

40

60

Accu

racy

# unseen categories

PubFig

DAP SRA Proposed

1 2 5 150

20

40

60

Accu

racy

# labeled pairs

OSR

1 2 5 150

20

40

60

Accu

racy

# labeled pairs

PubFig

DAP SRA Proposed

6 5 4 3 2 10

20

40

60

Accu

racy

# att to describe unseen

OSR

11109 8 7 6 5 4 3 2 10

20

40

60

Accu

racy

# att to describe unseen

PubFig

DAP SRA Proposed

1 2 30

20

40

60

Accu

racy

Looseness of constraints

OSR

1 2 30

20

40

60

Accu

racy

Looseness of constraints

PubFig

DAP SRA Proposed

1 2 30

20

40

60

80

100

# top choices

% c

orre

ct im

age

in to

p ch

oice

s

BinaryRelative

  Outdoor  Scene  Recogni-on  (OSR):  2688  images,  8  categories:  coast  (C),  forest  (F),  highway  (H),  inside-­‐city  (I),  mountain  (M),  open-­‐country  (O),  street  (S)  and  tall-­‐building  (T),  gist  features;    

  Public  Figure  Face  (PubFig):  800  images,  8  categories:  Alex  Rodriguez  (A),  Clive  Owen  (C),  Hugh  Laurie  (H),  Jared  Leto  (J),  Miley  Cyrus  (M),  ScarleM  Johansson  (S),  Viggo  Mortensen  (V)  and  Zac  Efron  (Z),  gist  and  color  features  

8.  Zero-­‐shot  Learning  Results  

Whereas  conven0onal  Binary  descrip-on:   “not  dense”  

Not  dense:  Dense:  

Rela-ve  descrip-on:  

7.  Image  Descrip-on  Results  

wb wm

Number  of  unseen  categories  

Amt.  of  labeled  data  to  learn  a0ributes  

Amount  of  descrip-on  

Quality  of  descrip-on  

Baselines:     Direct  AMribute  Predic0on  (DAP)              [Lampert  et  al.  2009]  (binary)  

   Classifier  instead  of  ranker  (SRA)  

min

�1

2||wT

m||22 + C

��ξ2ij +

�γ2ij

��

s.t wTm(xi − xj) ≥ 1− ξij , ∀(i, j) ∈ Om; |wT

m(xi − xj)| ≤ γij , ∀(i, j) ∈ Sm;

ξij ≥ 0; γij ≥ 0

min

�1

2||wT

m||22 + C

��ξ2ij +

�γ2ij

��

s.t wTm(xi − xj) ≥ 1− ξij , ∀(i, j) ∈ Om; |wT

m(xi − xj)| ≤ γij , ∀(i, j) ∈ Sm;

ξij ≥ 0; γij ≥ 0

Adapted  objec0ve  from  [Joachims,  2002]  

How  do  learned  ranking  func0ons  

differ  from  classifier  outputs?  

Infer  image  category  using  max-­‐likelihood  

More  smiling  than  

Less  smiling  than  

More  VisFHead  than  

Less  VisFHead  than  

More  chubby  than  

Less  chubby  than  

Human  subject  experiment:  Which  image  is?  

OSR   PubFig  

c(x) = argmax�

m

P (acm|x)

classical  recogni0on  problem   binary  ~  rela0ve  supervision  

<<  baseline  supervision   can  give  unique  ordering  on  all  classes  

An  aMribute  is  more  discrimina0ve  when  used  rela0vely  

Rela0ve  aMributes  jointly  carve  out  space  for  unseen  category  

”  

?   ?   ?  Example  descrip-ons