A Search For Easier Video Game Translation Methods

The Motivation

I’ve always wanted to be able to play games that never got ported to the US or translated to my native language, English. I also have a strong curiosity to be able to play the games I love, but without the changes and censorship that occurred when they have been translated into the American ports. One of my biggest nostalgia trips is playing Illusion of Gaia for the SNES, but that game is known for a less-than-best translation when it came to America. This has created the desire to be able to play and understand the original Japanese title “Gaia Gensouki” in earnest, but I don’t understand Japanese well enough to be able to do it myself. Of course, some fan translation patches have been made for various games (including a reputable one for “Gaia Gensouki”/”Illusion of Gaia”,) but what if there was a solution for any game, without needing to run a patch or rely on someone else’s translation work? Additionally I found there was a decent amount of interest from my friends and related retro gaming twitch communities. Thus began my search for an easy way to play an untranslated game and have it dynamically translated as I play.

The First Attempt

Fighting the urge to create something from scratch, I started searching for existing solutions to this problem. My only exposure to anything remotely like this was Google Lens through the Google Translate mobile application, which translates content with a beautiful looking overlay in situ. Another main benefit, as I noticed later on when eventually testing other potential solutions, is that the OCR (Optical Character Recognition) is of a remarkably high quality. I’ve seen some other twitch streamers mount a phone pointed at their monitor with said application to translate as they play. But from what I experienced, their twitch chat couldn’t see it without an additional camera. There is no desktop application of Google Lens, either, but I knew I could run it through an Android emulator. I often use the BlueStacks Android Emulator to play some mobile games on my PC, so I downloaded the Google Translate application to it and tried to figure out how to overcome my first obstacle — how do you get an Android emulator to pretend that the camera is some other application, one level up on the host machine? This functionality did not appear to be built into the emulator for any arbitrary application unless you have set up a virtual camera. My capture card could broadcast my SNES, and could also act as a virtual camera. However, for some reason it was not considered a supported device in the emulator settings.

After quite a bit of research, I found a ray of hope — the BlueStacks support apparently has heard the request of many a Snapchat user to get desktop Android emulation working with arbitrary video input. The article here gave me enough information to get started with OBS and a OBS Virtual Camera plugin, with the caveat of needing very specific versions, including an updated instance of BlueStacks:
https://support.bluestacks.com/hc/en-us/articles/10814798704653-How-to-use-OBS-Virtual-Camera-on-BlueStacks-5
(aside: major props to the people at BlueStacks and in their support discord/ticketing system)

I finally got it all setup and working properly with my entire OBS scene getting sent via the OBS Virtual Camera. My first successful screenshot below shows the general idea of it working – mind my mug in the corner.

This was really cool to finally see work. However, it was not a perfect solution. The biggest problems I found were:

The Google Lens/Translate UI overlay is always there. This includes the big white circle in the bottom middle, the gradient shading at the top, and the language selection and branding in the top middle. I could live with the back button, flash control, and hamburger menus as they were not very intrusive.
Contrary to my expectations, it did not detect changes in real-time. Even when I would run around to go find new text dialogs, you kind of have to “jolt” the view of the camera for it to realize there’s something new worth refreshing the translation. I guess the primary use of the Google Lens application is for taking photographs, or by saying “hey I’m looking at this menu or sign, please translate it for me in situ.” The requirement of having a large enough change threshold is likely an intended feature to account for camera shake and other environmental factors. Because of this, even outside of this video game context, it does not seem to easily refresh translations without some user input. You would have to press the large white button to trigger the snapshot/refresh, or greatly change the data that the camera had so it reached a large enough change threshold. Fellow twitch streamers said in order to get their external phone approach to detect a change, they often had to interact with the phone or wave their hand in between their phone and the monitor.
This process (at least currently) cannot grab the capture card feed directly. It relies on using the full OBS scene, which then emits video in its own window with added latency. Routing this output back into the OBS scene created a sort of cyclical dependency situation. I can’t just have OBS include the translation overlay version because then there would be no translation information to feed to Google Lens. I’d need to have OBS showing both the Japanese and the translated version routed back from the emulator application.
It’s slow because of how many layers of abstraction there are in order to get this setup to work. Latency of the gameplay was in the order of magnitude of several seconds instead of milliseconds. Sometimes dialog would move faster than I could refresh the camera.

To show off how this could look in OBS, with a few renditions of moving and cropping sources:

To correlate with the most recent OBS image, it’s also worth showing what it looks like on the emulator side:

At least we can fix most of the UI stuff in the final OBS picture. But unfortunately, the performance and interaction/UX stuff is janky enough to be a bother — a real enough problem if the game were to be played on stream.

Overall, Google Lens/Translate is not the ideal solution. I strongly believed that something better was out there — something with an intentional desktop application that could be used to achieve my goal. If not, maybe inventing the wheel would be the only true way to go, albeit unideal.

Looking For a Better Solution

So far, I’ve tried out a handful of different desktop applications. I’ve put together a list of expectations and requirements for evaluating potential solutions.

OCR
- Character recognition must be accurate.
- There must also be a way to disable vertical character recognition for Japanese.
Translation
- Quality of translation must be “good enough” for the given characters.
Performance
- Recognition of characters, translation, and changes should be quick and smooth.
- There should be no flickering of translated output.
UI/UX
- The UI should not impose on the viewing/gameplay experience.
- Ideally, the translated output should appear as an overlay over the Japanese text — like Google Lens.
Compatibility
- For now, I am only focusing on applications that can run on a modern version of Windows. I know there are many people out there looking for a *nix solution, but I wanted to limit the scope of my research for this initial foray into the project.

Two notable desktop applications that checked most of the boxes above are:

DeskTranslate (link to website)
- Decent application overall. Application relies on tesseract for OCR. You can switch which online translation engine you point to, but it seems like Google Translate is the best option (so far). The performance of the UI and screen reading/translation is pretty fast. The one downside is that the translated text appears in a separate window, so it is not displayed as an overlay on top of the text it is translating.
MORT (link to Korean introduction article) (link to GitHub repository)
- This is quite a stellar application. You are given choices for which OCR to use, as well as which translation service. The translation even allows for offline dictionaries for translation, if you need that kind of thing. Performance is pretty fast, but also will depend on which OCR and translation service you use. MORT requires a bit more setup and understanding of how it works overall, but this is not necessarily a bad thing for power users. The user also has options for how the translated text will appear, supporting both an overlay method as well as a separate window option. The availability of the overlay methods seem to depend on the choice and compatibility of implementation for each OCR. I was also able to reach out to the principle developer for the application, and the support was very high quality. The documentation and primary language support is Korean but the pages can be translated by the browser, and any language can be worked with if you configure the application accordingly.

I think they both have potential, but the bulk of my effort was given to MORT as it already supported text overlay similar to Google Lens. DeskTranslate checked a lot of boxes, and the roadmap says they eventually want to look into overlays, but I want something that will take less time and effort to set up an ideal solution.

One obstacle arose while testing with my hardware SNES and capture card. Apparently there was a human imperceivable graphical noise that would trigger the OCRs to rapidly detect changes when there was no actual change, which resulted in an unpleasant rapid flicker of translations. In order to rule this out, I had to test with an emulator. With no noise there was no flicker, so likely I will have to circle back on the capture card approach later on with some kind of video smoothing filter.

Looking forward, I will be spending more time learning how to further configure MORT to be an optimal solution. I also plan to look into training tesseract to understand video game fonts better, as it seems that sometimes the OCR would not properly recognize the character, or would confuse it with another sort of similar character. Out of the available options of OCRs, tesseract seemed to be the most open sourced and trainable of the choices, which is why I’m likely going to take that direction. My next post will delve more in depth to MORT, tesseract, and learning how to train the model specifically for video game fonts.

Thanks for reading! If you’re interested in sharing ideas, or talking more about this, please feel free to join my discord as I’ve created a channel specifically for discussion and news about my journey for a real-time translation solution for retro video games!