“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, Lora Aroyo

When people worry about AI, their concerns are regarding robots taking over the world. As real as these concerns may seem, they are far-fetched at this stage in this field's development. This paper [1] highlights the circumstances the world should think about while worrying about AI. The authors state that all AI research and implementation is highly dependent on the data used for models to learn, but data work is not given the importance it deserves. The authors found out, through their research, about the data cascades prevalent in current AI practices and research. They have discussed the possible cascading effects of impacts, triggers, signals, and the distribution of incomplete or incorrect data. The shortcomings brought out were supplemented with the reasons for their presence, like lack of awareness, resources, the generalizability of existing data, and implications of the issues. Through these, they established the need for education and awareness of data jobs in the field while highlighting the need to incentivize them. Sambasivan et al. have pointed out that everyone in Computer Sciences wants to be involved with fancy AI model development work.

The research methods used in this study are literature review and qualitative data collection with semi-structured interviews. Sampling methods were purposive sampling and snowball sampling. The sample selected was from India, the United States, and African countries. This choice is a small step towards the decolonization of HCI research, catering to high income and low-income countries alike. Moreover, interviewees' roles were spread across academia and industry, and domains like health and wellness, food, environment, finance, education, etc. As mentioned in the paper, the gender ratio was highly biased towards males but reflected the industry's disparity. This highlights the absence of enough women in STEM, especially in AI. Although the sample is representative of different sectors, it still lacks the inclusion of the population engaged with the "dirty" data work. These are the people hired to collect data, sometimes from scratch. Their perspectives would lead to a richer dataset; their work is crucial. The research team was representative too. They chose people with backgrounds in HCI, AI, human computation, and data quality research. Their experiences gave the study a socio-technical perspective.

The research is novel. HCAI is an upcoming domain, and this research offers an interesting perspective of under-valuation and de-glamorization of a segment of the field. The title is justified. It states a user quote and does not seem over-glorified.

By commenting, "Humans are relatively unchartered waters for AI researchers," the paper also accounts for the need for HCAI in interfaces, measurement, incentives, and education, rather than only AI. I found particularly interesting the viewpoint of not blaming cascades on malpractices or algorithmic inconsistencies but bringing up the basic human tendency to lack incentives. It's simple yet unobvious from a purely AI standpoint.

The authors' research maintains a balance between the AI and humans' inclusion as developers and influencers in its' scenario. More comments could have been made on the impact of cascades on the users. The paper has indirectly raised thoughts about accountability in case cascades by stating the primary stakeholders and influencers like the government and the field partners. Overall, the discussion section of the paper is well worded. It gives a clear summary of the issue in the fragile domain, why it is happening, the impact, and what should be the first thing done to prevent mishaps in the future. The suggested changes, if adopted, are although qualitative, but are measurable in the long run by drawing comparisons between the number of people involved dedicatedly in data work for high stake AI systems.

Along with education about data collection and informing involved people about the requirement and possible impact of the AI model they are collecting data for, the need of the hour is for regulatory data collection policies to walk hand-in-hand. In a world where big data is another big thing existing without qualitative knowledge, the research can be extended to explore: if all the required correct data is collected, how will it be selected, refined, and used. What will be the complications then will be an exciting find. This extension will be useful in the near future.

References

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora Aroyo. 2021. Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI.