Today in our ‘blc stories’ it’s all about machine translation (MT) – we’ve all heard about it, of course. But ever since the quality of machine translation has improved significantly with neural MT, it’s spreading like this unspeakable virus. And if you don’t know how to use MT properly, it might even end badly.
I love this machine!
“Have you seen this site?” my friend asks and hands me her phone. “It’s crazy! You can upload whole files and get translation results immediately. “And it’s really good!” I take a short look at the screen. “Yep, I know them. They’re doing machine translation. Formerly statistical, now neural.” “Anyway,” my friend continues unperturbed, “we use it in the company now as well. We’ve already put all our brochures in there and the result was surprisingly good. For a machine like this.”
What do they want with our data?
I’m so surprised I choke on my ice-cream. “You do know – uh – that your data will end up on the Internet forever? I hope it wasn’t something unreleased?” She shrugs her shoulders and pats me on the back. “Nope, don’t think so. But I mean what could be the problem with our brochures?” I am still coughing. “Making them freely available to the World Wide Web, for example…anyone could steal your product labels- uh – before you put them out there.” My friend keeps frowning and patting.
Engine with your own terminology?
“Anyway,” she continues, when my face is restored to a healthier color, “we were thrilled with the results. It’s just that sometimes our brand names were missing.” “Have you been training with your own data or did you just use the ‘naked’ MT system?” I ask. From her eyes two question marks stare at me. I try explaining from another angle:
“I mean, did you simply upload your text and click on ‘Translate’ or did you first create your own engine with other texts of yours?” She shakes her head vigorously: “Used it just the way it was.” – “If you would have trained it beforehand, the system could have accessed the terms in your texts. But that requires a lot of data and training.” My friend raises an eyebrow. “And then the system would know all by itself what our products are called?”
Machine translation is like a memory game
“Well, basically, yes,” I’m going for a longer explanation. “But it would have to be taught how to do it with training, re-training, tests and tuning. It’s basically like a memory game: In the beginning everything is a black box. You don’t know what is on the other side of the cards. Then you start the game, you make mistakes and learn in each round by remembering what is underneath each card. After a couple of rounds you turn over a card…” – “… and know which other card it belongs to”, my friend interrupts and claps her hands. “Bingo!”
Great differences in quality for machine translation
I can literally see the wheels turning in her head. “But tell me, how come sometimes the results are really good in one language – English for example is awesome! – and in others not at all? Does the machine make a difference?” I tap on her mobile phone pointing to the overview of available languages. “This is due to the language data that is available.”
“If the system has a lot of training data available, it can remember where all the cards are much faster. It sees them more often, so to speak. If it has less data, it may only see cards once or twice. Then it will miss them more often. There is so much English data available on the Internet that these engines can be trained much better than for example engines with Ukrainian, which is digitally underrepresented. In addition, the similar language structure of two languages that are in the same language family, such as English and German, supports the learning process – these languages can benefit from each other.” My friends face turns into an impressed emoji. “Wow!!”
Machines for everything
“But I can’t just build my own… What’s it called again?” – “Engine.” – “Right, build my own engine, can I?” I take the phone out of her hand and scroll down. “Here, see? Custom engine. You’d have to get in touch with them and then they’d offer you an empty system to fill it with your own data or they will even fill it for you. But hold your horses,” I add, because my friend has already opened the contact form.
First: Think about what you really need!
“There is a huge number of engine providers who have developed MT systems. You have to check if there is suitable material for your subject area – your domain. Then you wouldn’t need quite so much of your own data. And think about how it should be integrated into your IT infrastructure!”
“And, of course, there is also data safety and security to consider: You waved-off earlier, but your boss might be very interested to know where certain company data end up in the internet. So take your time when deciding on a system!” To protect her, I press ‘power’ and the screen of her mobile goes blank.
“My head is spinning,” she complains. I laugh and put an arm around her shoulders. “Don’t worry! We have experts for all this!”