Why an AI?
We wanted the video artwork to be disseminated via internet. The most accessible way of reaching out to people who might be interested in the video was twitter: we could browse through tweets related to certain themes and reply to some. It was never an option to spend days searching for tweets to reply to. The next best thing was to automate the process: retrieve tweets that match certain keywords and reply to them.
Filtering with a keyword search is a very coarse way of identifying potentially interested users. Many of the results were simply not relevant to what we were looking for. Moreover, some tweets were relevant but we wanted to eliminate the very inflammatory tones i.e. use of vulgarity, racism or/and misogyny.
So we got the idea of developing an AI to mimic who a human would reply to (in this case, Gina, as the trainer). We felt it would make an interesting case study of how and why AI is used, and what one can truly expect from it.
What is AI?
AI is not used because of its supposed intelligence. In most cases, AI is simply a productivity tool, used to automate a process. Replying to tweets on a daily basis is doable, but it is time-consuming. For a company, the equation is: it can be done by a human for a certain salary or it can be done for free by an AI.
Like any automation process, there is the question of quality versus cost. As you will see, our AI is not very good at mimicking the trainer. It is however much better than randomly choosing who to reply to, which was our initial option. For an analogy closer to the real commercial world, imagine you are reading books on mobile devices, and you want a service that gives you recommendations for new books to read. The service could be based on a few keywords attached to each e-book you've read. If you've purchased a lot of crime novels, then the service will recommend more crime novels.
You could use an AI to do this. One could say that this is so simple that you don't even need an AI for it. The truth is that some AI algorithms are extremely simple. In fact, if you were to take the recommendation services that Netflix was using in its early days (when it was about DVD rental), they were likely to be using some algorithms that are still at the core of AI today.
For instance, a very simple but effective tool is "Bayesian Inference". Google "Bayesian Inference AI" and you'll find millions of hits. It is named after Thomas Bayes, a famous AI scientist born in... 1701. Well, not really AI scientist: he was a Presbyterian minister who liked statistics, but his theorem is one of the core principles of AI.
So back in the early 2000s, Netflix was probably already using algorithms that would have fallen into the AI category. They were not using the terms AI though, because it would have sounded a bit daft to use such a grand term, "Artificial Intelligence", for something so simple.
What has changed since then? Not the AI algorithms. Most of them have been here for decades (or centuries). So it's mostly two things. Marketing first: today a service based on 18th-century Bayes theorem would be shamelessly called "AI". The other thing is the quantity of information that is available, coupled with the computing power to process it.
If we go back to our book recommendation service, another alternative is to actually analyse the books you've read. Parsing through the book, an algorithm will extract the vocabulary used and find your "taste" in books. At a very basic level, a romance will have the word "love" quite often, while a crime novel will find the word "murder" more often. Because we not only have the whole book available to analyse (the quantity of information) but also the computing power to analyse it, the analysis can be more refined.
For instance, instead of using simple one-word vocabulary, the algorithm can look at all the 2-word expressions ("I love guns" is "I love" and "love guns"), or 3-word expression, or more. To get a feel of the style of writing you like, you can add the length of the sentences, the richness of the vocabulary, the sentiment of the words and how they are used in the same sentence (to identify sarcasm, dark humor, etc). To find out how much you've liked the book, it can look at how many days you've taken to read it, how long per reading session, etc.
So what kind of cutting edge AI algorithms are used to do the analysis? Well, the Bayes Inference works really well for text analysis... So no, if AI is everywhere today, it is really not because AI theory has come up with disruptive algorithms lately.
The consequences of AI?
Now to finish on the book recommendation, there is a third alternative: you enter your local book shop and talk to the passionate owner, tell him what you've read lately and asked for his or her advice. And this is what AI is all about: there are services that humans can provide with the highest quality (bookstore recommendation, or our initial option of replying to each and every tweet). There are AI-based services that give a pretty good alternative (as far as recommendations are concerned) but are quite costly: the processing required for full-book analysis is quite intense and not likely to be done for each and every customer, or at least not for free. And then there is the down and dirty AI that is called AI mainly to sound good on a marketing campaign, but that would give a very low-quality recommendation.
Of course, AI is not only about book recommendation. It can, and will, (and does) spread to all aspects of our society. Unfortunately, the quality vs cost paradigm will always be there. Many companies will choose to stop providing a human-based service for a possibly lesser-quality AI-based service because that will reduce their cost of sale. For consumers services, the good old capitalism rules will work in our favour: those providing lesser quality will lose market shares to those providing better services.
But what about AI in medical services? Hiring? Getting a loan?
Example: bank loan application
Let's take the example of banking services. At the end of the day, a bank does not really care if it discriminates in its loan applications. The ONLY thing that matters is to minimize its nonperforming loans (NPL): the loans for which the repayments have not been made for some time. From a purely economic point of view, if, for instance, discriminating against single mums improves their NPL ratio (i.e reduces the number of bad loans), then it is the right thing to do... for the bank. Clearly not for the single mum who would have had her loan approved if she had not divorced her husband, even if said husband did not earn any income.
So, when a bank hires a firm to develop an AI to automate the loan application process, the metrics they are going to use (in other words what would qualify as a good AI) is their NPL ratio: what kind of NPL ratio would the bank get if it were using the AI as opposed to human staff? If the NPL ratio is good and no one checks the AI for bias, then it might discriminate forever against single mums.
Maybe the ethical AI scientist checks for bias and tells the bank: « I can work some more, it will take some time, maybe more money, but I could remove the bias; unfortunately, the NPL ratio might be a little bit worse off » . The bank will have to make a decision: do they delay the launch, possibly at a cost and lose some money over time on bad loans, or do they just go ahead?
One could say that it's part of the job of the IT company to deliver without bias. And indeed, they can, but it means more complex work to achieve a good NPL ratio, therefore higher cost. If it is not required, then there will be IT companies offering a cheaper quotation with no check for bias.
How many banks will chose the more expensive ethical AI company, knowing that the service might, in fact, have a worse NPL ratio? In other words, if we leave it to the banks, how many will choose to pay more and to earn less but be ethical? How many will choose to save on cost, increase profit and not give a damn about single mums?
Bottom line: AI regulation
So the bottom line is not whether AIs can be biased. They are.
It is not even whether bias can be fixed. It can (at least to some large extent).
It is not whether companies will choose willingly the ethical route. They won't.
The bottom line is that it is not a decision companies should have to make. It wouldn't occur to anyone to let the car industry decide what level of pollution is acceptable. It is obviously a matter of national, or even international regulations. AI is going to be so omnipresent in our life that it must be regulated at a national or supra-national level, for the sake of the society.
What would AI regulation mean? It would mean for example that any sensitive AI service would have to report how the AI was built, and be tested against bias before going live. It would mean that there would not be any cheap discriminating AI-services because bias-free AI would be the only legal option in an IT service quotation. It would mean that inspectors could come and test on regular basis the sensitive AIs, which would require for instance that the bank employs someone who knows the AI, knows how it works and why it's working this way.
So the bottom line is simple: AI must be built in a way that they are transparent, accountable and fair so that the society can be protected against possible side effects. If you agree with that statement, I suggest you take a look at our petition demanding that AI be regulated.
A case study: our AI
Our AI is a text analysis tool that aims at classifying tweets in three categories: Reply, Ignore and Not Relevant. Over 4,000 tweets were read during training and a response (reply, ignore, not relevant) was chosen for each tweet. The AI we developed goes through the text of the tweet and tries to find a correlation between the words and the response.
First, the text of the tweet is normalised. It is broken into words (tokens), removing punctuation and stop words. Then it is lemmatised: "works", "worked", "working" become the same token: "work". Based on this normalised text, we create our own dictionary based on all the words of all the tweets we are processing. What worked the best for us was to use a simple 1-word (uni-gram) dictionary ("I loved guns" is "I", "love" and "gun").
Then the normalised tweets are vectorised: "I loved guns" become 1 count of our custom-made dictionary word "I", 1 count of "love" and 1 count of "gun".
There are more elaborate variants that we've tried (using bi-gram, tri-gram, Tf-Idf method instead of simple count, etc) but the simplest one gave the best results. The reason is that we are trying to do something quite difficult (mimic the trainer's feelings) with very little information (a single tweet has at most 280 characters). Applying complex techniques probably diluted the information per tweet.
Then the vectorised data is fitted into an AI algorithm. Our dear Bayes is a very good candidate for text processing. Complement Naive Bayes algorithm did give a decent result, but a better precision was achieved using Support Vector Machine. And if you like the nitty-gritty details, we used a linear kernel with a One-Vs-Rest classifier.
A portion of the data was used to train the AI, while the rest was used to test how well the resulting AI performed: with the testing set, we compare what the AI would respond against the trainer's original response (reply, ignore, not relevant).
The result is not particularly fabulous: the AI chose to reply correctly 72% of the time. Missing good replies (false negative: AI chose to ignore while originally 'reply' was the course of action) is not a big issue because we would have to limit the number of tweets anyway. The problem was more to reply to tweets that Gina deemed not relevant, or that she chose to ignore (false positive). This happened 17% of the time.
So the AI was far from being a digital clone of the trainer's brain, but we let it go live and let it tweet because it was still much better than randomly choosing tweets. It certainly could have been better, but we didn't want to spend too much time on it. This is a trade-off many companies will make as well, particularly because you can come up with something of average quality in no time at all. We used an AI library called scikit-learn, which allows you to simply add AI modules one after the other. As a result, the code was only a few tens of lines of code and the development took only a few days, including testing many algorithms and parameters.
To give an idea, it took much longer to develop the web environment to create the training set: tweets had to be loaded from twitter servers according to certain user-defined queries, then stored, anonymized, rendered, etc. Of the whole IT project, the real AI design was actually the fastest.
This is not to say that AI design is fast and easy. It means that average-quality that focuses exclusively on precision can be attained quickly. Good quality that takes bias into consideration requires much more time and expertise. And this is why the down and dirty (what we did) should never be allowed in anything that matters.
So what about bias?
We wanted to experiment with bias. The only sensitive information we could get from a tweet is gender, based on the user name. So we "pre-processed" the tweets:
What we found out was that, on average, replies were made to 52% of the tweets sent by men and to 61% of the tweets sent by women. So there was a slight statistical bias towards women.
It is possible that the trainer had a positive bias towards women and, even anonymised, gender affiliation was made. Authors' gender identification from text analysis is a popular topic in natural language and AI research. It is also possible that there were simply more women who, on twitter, had a similar view on similar themes she searched, which is not the same as gender bias. It is also possible that men have more of a tendency to write nonsense on twitter and were, therefore, more often discarded than women...
So there is nothing much to say from the bias in the training set.
Now if we look at the testing result, the AI chose to reply to 57% of the men and to 62% of the women. The bias is lesser, but it is still there. We can't really say much about the reduction in value simply because the AI precision is quite poor.
Following up on this idea, we tried to actually inform the AI about gender. For every tweet sent by a woman, we added the word "thisisawoman" and for every tweet sent by a man, the word "thisisaman". The result is an AI where the bias is greatly increased: now the AI replies to only 55% of men and to 66% of women. The quality, however, has barely changed.
What has happened is that the algorithm has picked up that the word "thisisawoman" (in other words the gender), appears statistically more often among tweets that are going to be replied to. So if it finds a tweet with that word, it is more likely to reply to it, hence the increase in bias. But in reality, the trainer didn't have a gender bias, and favoring women users does not improve the precision (i.e quality) of the AI. Definitely not very intelligent...
There are a few things we can learn from this:
That is the tricky bit about bias: if the objective of the AI scientist is to achieve great precision, and today, that is mostly the case, then it is actually often best for him to make use of the bias.
For instance, let's go back to our loan application example. The AI designer has come up with a version that has a relatively good result in terms of nonperforming loans. The gender was available to him but he did not use it as a feature in its AI because it didn't seem relevant. Banks are not supposed to decide on loans based on gender, right? But then he tries a new version by simply adding gender as a feature. And, oh surprise, the precision shoots up.
What happened is that his training and testing data are real loan applications processed by real banking staff over the years. Many of these staff are men, and many of them tend to be more confident that a man would pay back his loan than a woman. They had a gender bias, and the training set shows it. Informing the AI about gender greatly improved the AI precision. It is not that the AI thinks that it is better to lend money to men than to women. In fact, the AI does not know what a man is. But providing the gender information to the AI helps it behave in the same way as the banking staff did.
Note that it does not mean that it is best for the bank to loan more to men than women. In fact, scientific studies have shown exactly the opposite: if staffs in commercial banks were to favour women applicants over men, all other things being equal, then the banks would probably see a reduction in bad loans. The problem is that all the AI knows is the training data it has received. Among the rejected applications in the training set, some would have turned out to be bad loans (as the staff feared), some others no. Having a bias against women means that bank staffs incorrectly reject good businesswomen candidates more often than they reject good businessmen. But this information is not present in the training data because the loans were rejected and no one knows what would have happened if they had been approved. So if the training data shows that banks have been lending more to men then women, then that is what it should be, from an AI perspective.
Now ask yourself: if it's up to the AI scientist, do you think he/she (most likely he) will remove the gender feature and delivers to the bank a less performing AI?
If it's up to the bank, do you think they will choose a less performing AI (meaning one that will cost the bank more money in bad loans) because it's bad to discriminate against women?
Without regulation, how likely do you think that the "looking down on women when it comes to money" attitude will now be literally hard-coded by a supposedly intelligent and non-discriminating AI?