A wee thought on machine learning and data provenance.
This morning on twitter I read a rather amusing story.
Almost every article on Scots Wikipedia is written by one American teenager, who does not speak Scots and is just writing English in an "accent".
— Robyn Speer (@r_speer) August 25, 2020
If you have a multilingual language model, this fakery might be your _entire training data_ for Scotshttps://t.co/Rc1wkA0S2P
This would go someway, but not all the way, to explain this. (BBC Voice recognition comedy).
In all seriousness though. If you are looking at using machine learning, / AI from an HRTECH vendor, you need to ask a lot more questions about the provenance of the data. How did they get the data exactly, how have they cleaned it, what assumptions have they made about data quality, how will the data be augmented over time. Oh and do mention our friend GDPR.