The Danger of Not Looking Beyond Numbers in Experimentation

Published on May 13, 2024

by Jonas Alves

A/B testing metrics

Numbers often serve as the foundation of decision-making in the nuanced field of data analysis and A/B testing. However, the real challenge—and opportunity—lies in interpreting these numbers correctly, particularly by understanding the reasons behind them. This approach helps uncover genuine insights that can drive meaningful improvements rather than being misled by surface-level data. I hope you find it useful in your experimentation efforts, please feel free to share. 

The Misleading Surge in Bookings

At Booking.com, we encountered a situation that highlighted the importance of digging deeper into the data. We implemented a new interface tweak aimed at enhancing the user experience. This change led to a surprising 20% increase in bookings. The traffic was large enough that many times, we would have enough power to make decisions in a few days. Still, for better predictability of future results, and because the weekly seasonality was extreme, we always tried to have two full weeks of data in every experiment. But this initial data was so compelling that we decided to implement the changes fully after just one week.

The Oversight

However, a colleague, Nick Alliwel, made a critical observation the next day. He suspected the decision to roll out the change might have been premature. Nick pointed out that although there was a noticeable increase in bookings, this did not necessarily indicate a better user experience or business outcome. His suspicion was that the new UI made it significantly more difficult for users to understand that multiple rooms could be added to a single booking. 

Unintended User Behaviour

As a result, instead of making a single booking for multiple rooms, users were going through the booking process multiple times when they needed to book more than one room. This redundancy inflated the booking numbers without reflecting an increase in business performance. There were no corresponding increases in revenue or total room nights, which would have been indicators of actual growth. And on top of it, the user experience became much worse. Forcing users to insert the details multiple times to achieve the same end result that could be done in a single step before was far from an improvement.

Lessons Learned and Twyman's Law

This experience underscored several key lessons:

  • Surface Metrics Can Deceive: An increase in one performance metric (like bookings) does not necessarily equate to overall business success.

  • User Experience is Paramount: Changes that make essential functions less intuitive can degrade the user experience, leading to inefficient behaviours that might initially appear as positive growth.

  • Always Invoke Twyman's Law: Named after the economist who stressed the importance of scepticism in data analysis, Twyman's Law suggests that when results show a significant impact, one should by default question their validity. This law would prompt teams to scrutinise extraordinary data more critically before making decisions.

  • Deep Dive Into the 'Why': Understanding why changes in metrics occur is crucial. This involves looking beyond the numbers and considering how users interact with the platform, seeking qualitative feedback, and observing the broader implications of data trends.

  • Maintain Rigorous Testing Protocols: Even when early results are promising, maintaining standard testing protocols is essential. This includes waiting for sufficient data to accumulate that reflects typical user behaviour across multiple cycles.

This scenario at Booking.com is a powerful reminder of the importance of context in data analysis. By looking beyond the numbers and seeking to understand the underlying user behaviours and experiences, businesses can make more informed and effective decisions. This story illustrates not only the potential pitfalls in A/B testing but also the strategic value of a more thoughtful and analytical approach to interpreting data and higher quality experimentation.

The Danger of Not Looking Beyond Numbers in Experimentation

Published on May 13, 2024

by Jonas Alves

A/B testing metrics

Numbers often serve as the foundation of decision-making in the nuanced field of data analysis and A/B testing. However, the real challenge—and opportunity—lies in interpreting these numbers correctly, particularly by understanding the reasons behind them. This approach helps uncover genuine insights that can drive meaningful improvements rather than being misled by surface-level data. I hope you find it useful in your experimentation efforts, please feel free to share. 

The Misleading Surge in Bookings

At Booking.com, we encountered a situation that highlighted the importance of digging deeper into the data. We implemented a new interface tweak aimed at enhancing the user experience. This change led to a surprising 20% increase in bookings. The traffic was large enough that many times, we would have enough power to make decisions in a few days. Still, for better predictability of future results, and because the weekly seasonality was extreme, we always tried to have two full weeks of data in every experiment. But this initial data was so compelling that we decided to implement the changes fully after just one week.

The Oversight

However, a colleague, Nick Alliwel, made a critical observation the next day. He suspected the decision to roll out the change might have been premature. Nick pointed out that although there was a noticeable increase in bookings, this did not necessarily indicate a better user experience or business outcome. His suspicion was that the new UI made it significantly more difficult for users to understand that multiple rooms could be added to a single booking. 

Unintended User Behaviour

As a result, instead of making a single booking for multiple rooms, users were going through the booking process multiple times when they needed to book more than one room. This redundancy inflated the booking numbers without reflecting an increase in business performance. There were no corresponding increases in revenue or total room nights, which would have been indicators of actual growth. And on top of it, the user experience became much worse. Forcing users to insert the details multiple times to achieve the same end result that could be done in a single step before was far from an improvement.

Lessons Learned and Twyman's Law

This experience underscored several key lessons:

  • Surface Metrics Can Deceive: An increase in one performance metric (like bookings) does not necessarily equate to overall business success.

  • User Experience is Paramount: Changes that make essential functions less intuitive can degrade the user experience, leading to inefficient behaviours that might initially appear as positive growth.

  • Always Invoke Twyman's Law: Named after the economist who stressed the importance of scepticism in data analysis, Twyman's Law suggests that when results show a significant impact, one should by default question their validity. This law would prompt teams to scrutinise extraordinary data more critically before making decisions.

  • Deep Dive Into the 'Why': Understanding why changes in metrics occur is crucial. This involves looking beyond the numbers and considering how users interact with the platform, seeking qualitative feedback, and observing the broader implications of data trends.

  • Maintain Rigorous Testing Protocols: Even when early results are promising, maintaining standard testing protocols is essential. This includes waiting for sufficient data to accumulate that reflects typical user behaviour across multiple cycles.

This scenario at Booking.com is a powerful reminder of the importance of context in data analysis. By looking beyond the numbers and seeking to understand the underlying user behaviours and experiences, businesses can make more informed and effective decisions. This story illustrates not only the potential pitfalls in A/B testing but also the strategic value of a more thoughtful and analytical approach to interpreting data and higher quality experimentation.

The Danger of Not Looking Beyond Numbers in Experimentation

Published on May 13, 2024

by Jonas Alves

A/B testing metrics

Numbers often serve as the foundation of decision-making in the nuanced field of data analysis and A/B testing. However, the real challenge—and opportunity—lies in interpreting these numbers correctly, particularly by understanding the reasons behind them. This approach helps uncover genuine insights that can drive meaningful improvements rather than being misled by surface-level data. I hope you find it useful in your experimentation efforts, please feel free to share. 

The Misleading Surge in Bookings

At Booking.com, we encountered a situation that highlighted the importance of digging deeper into the data. We implemented a new interface tweak aimed at enhancing the user experience. This change led to a surprising 20% increase in bookings. The traffic was large enough that many times, we would have enough power to make decisions in a few days. Still, for better predictability of future results, and because the weekly seasonality was extreme, we always tried to have two full weeks of data in every experiment. But this initial data was so compelling that we decided to implement the changes fully after just one week.

The Oversight

However, a colleague, Nick Alliwel, made a critical observation the next day. He suspected the decision to roll out the change might have been premature. Nick pointed out that although there was a noticeable increase in bookings, this did not necessarily indicate a better user experience or business outcome. His suspicion was that the new UI made it significantly more difficult for users to understand that multiple rooms could be added to a single booking. 

Unintended User Behaviour

As a result, instead of making a single booking for multiple rooms, users were going through the booking process multiple times when they needed to book more than one room. This redundancy inflated the booking numbers without reflecting an increase in business performance. There were no corresponding increases in revenue or total room nights, which would have been indicators of actual growth. And on top of it, the user experience became much worse. Forcing users to insert the details multiple times to achieve the same end result that could be done in a single step before was far from an improvement.

Lessons Learned and Twyman's Law

This experience underscored several key lessons:

  • Surface Metrics Can Deceive: An increase in one performance metric (like bookings) does not necessarily equate to overall business success.

  • User Experience is Paramount: Changes that make essential functions less intuitive can degrade the user experience, leading to inefficient behaviours that might initially appear as positive growth.

  • Always Invoke Twyman's Law: Named after the economist who stressed the importance of scepticism in data analysis, Twyman's Law suggests that when results show a significant impact, one should by default question their validity. This law would prompt teams to scrutinise extraordinary data more critically before making decisions.

  • Deep Dive Into the 'Why': Understanding why changes in metrics occur is crucial. This involves looking beyond the numbers and considering how users interact with the platform, seeking qualitative feedback, and observing the broader implications of data trends.

  • Maintain Rigorous Testing Protocols: Even when early results are promising, maintaining standard testing protocols is essential. This includes waiting for sufficient data to accumulate that reflects typical user behaviour across multiple cycles.

This scenario at Booking.com is a powerful reminder of the importance of context in data analysis. By looking beyond the numbers and seeking to understand the underlying user behaviours and experiences, businesses can make more informed and effective decisions. This story illustrates not only the potential pitfalls in A/B testing but also the strategic value of a more thoughtful and analytical approach to interpreting data and higher quality experimentation.