An algorithm which performs well on the dev/test set according to evaluation metrics, but does not satisfy customer requirements (i.e. performs badly when deployed), indicates that we are missing the right target data in our dataset. In this situation we need to make changes to our dataset since it is not representative enough for the target application. Consider classification of cat images. If the train/dev/test set is using high resolution, good quality images (perfectly posed cats), while the target application is looking at the images which have cats from different viewpoints or that are in motion (blurry), we can expect our algorithm to perform badly when deployed.