Design an Experiment

Course note of A/B testing on Udacity: Lesson 4


#Design an Experiment

Choose “subject”

Unit of diversion

What an individual subject is in the experiment.

Unit of diversion examples:

  • commonly used:

    • user id

      • stable, unchanging

      • personally identifiable

    • Anonymous id (cookie)

      • changes when switching browser or device
      • users can clear cookies
    • Event

      • no consistent experience
      • use only for non-user-visible changes
  • less common:

    • Device id
      • only available for mobile (so here the device id is just for mobile)
      • tied to specific device
      • unchangeable by user
    • IP address
      • changes when location changes
  • **Quiz: **

    When would the user be switched to another group in experiement? (When the unit of diversion will change? )

    User-id: only valid when signing in

    Cookie: do not change if users do not clear cookies through these events, otherwise it would change (so some are “?")

    Event: change among every events

    Device id: only valid for mobile

    IP address: (hard to tell whether it changes in “?” )

1. Consistency of Diversions
  • user consistency

  • Quiz:

    Choose event when event is enough; choose cookie when event is not good and cookie is enough; choose user-id when event and cookie are not enough for experiment

    Events: for the changes that users will not notice

    Cookie: it iwll be very strange if the button size and color change when user reload the pages.

    User-id: when cross device consistency is important

2. Ethical Considerations for Diversion
  1. User-id is identified, so using user-id must be careful.
  2. Informed consent

1st: when user log in, they have already submit their email and may have already approved that their data can be collected.

2nd: if emails are collected by cookies, then it will need inform user.

3. variability

empirically/analytically computed varaibility

Sometimes empirically will be higher than analytically because unit of analysis is different from unit of diversion (unit of analysis: the denominator of metrics. For example, pageviews in CTR (clicks/pageviews)) is unit of analysis. If unit of diversion is pageviews, empirically $\approx$ analytically. If unit of diversion is cookie/user-id, varaibility will be higher, so it is better to use empirically given unit of diversion. - why: when compute analytically, make assumption on distribution of data and indenpendence

In the example above, metric is coverage and unit of analysis is query. If use cookie as unit of diversion, the standard error will be higher than that with query as unit of diversion.

1st: unit of analysis: pageviews; unit of diversion: cookie; unit of analysis (event) is “larger” than unit of diversion(cookie). All pageviews of one cookie will belong to one group.

2nd: unit of analysis: cookie; unit of diversion: pageview; unit of diversion (pageview) is too fine-grained. That is, one cookie may have multiple pageviews. So, 2 pageviews of one cookie may belong to different groups.

3rd: two units are consistent.

Choose “population”

Inter-User experiment vs. Intra-User experiment

Inter-User experiment: A/B testing

Intra-User experiment: show different things to same user in different time

Target Population

restrict the population by geo, language, browser; Also good for run multiple testing (not overlap) in company;

In the example below, practical significance d_min is not given. So, we just use statistical significance as 0.

For global, must add all data from New Zealand and data from Other. d_hat-m<0, so it is not significant.

The difference between these two results is due to the difference of d_hat.

Population vs. Cohort

For population, people may exit the experiment, or maybe their IP address can change in experiment. For cohort based on time, it is just people who enter the experiment as the same time. It is harder to analyze since it lose data. Typically, only use cohort when we need user to be stable and established.

Size

In lesson 1, size based on practical significance, statistical significance and sensitivity.

Choice of metric/unit of diversion/population will influence variability, and variability will change size.

  • Quiz:

    1st: Change the unit of diversion to the unit of analysis will decrese the variability (also, SE).

    2nd: Other traffic will dilute the results and increase the variability.

    3rd: often it doesn’t make significant difference. But if there is a difference, varaibility would probably go down because unit of analysis and unit od diversion are same.

  • Targeting the size to specific traffic will cause some problems:

    1. We do not know what results will this traffic lead to. It is hard to predict.
    2. Hard to detect impact.

    Therefore, run a pilot, turn on the experiemnt for a little to see who’s affected, or just use the first day/week to get a better guess.

Duration

  1. what duration?
  2. when to run the experiment?
  3. what fraction of traffic are going to use? - given size, less proportion lead to a longer duration.

Why not run on all traffic: safety. press. run on multiple days for weekdays and weekends/ holidays and normal days. run multiple experiments and tasks.

Risky: may cause big problem if the result is not good enough.

**Learning effect: **

Some users do not like new changes, while other users are excited to see new changes.

Measure user learning:

If there is learning effect, there won’t be a big change right from beginning but increasing over time.