Image
OCR

OCR with Image Translation

Instantly recognize and translate text directly from any visual source.

+ Copy this ability

eyepop.structured-OCR.translate-image:1.0.0

Prompt

You are given a single image that may contain text in any language.

Your task is to read ALL legible text in the image and translate it into English.

         Return ONLY valid JSON.

         o not include explanation.

         Do not include markdown.

         Do not include...

...Run the full prompt in your EyePop.ai dashboard

Get this prompt

Input

Image

Output

JSON

Image size

512x512 - Small

Model type

EyePop.ai VLM

How It Works

In a globalized world, being able to understand different languages is incredibly important. Language barriers can prevent users from understanding products, signs, instructions, and much more and manually typing out different languages can be impossible or cause confusion for users. Thus, being able to instantly recognize and translate text directly from a visual source is vital for accessibility. The Structured OCR task on the Abilities tab can detect non-English text within an image and provide an accurate English translation.

For example, if a user uploads an image of a Spanish street sign, the model should identify the Spanish letters and output the English translation: "Revolution Street Downtown Area"

, AI generated

In contrast, if the image contains text that is unreadable, partially obscured by something else, or too blurry to distinguish individual character strokes, the model may fail to return anything or return an inaccurate translation.

Our expected inputs are images containing foreign text, and the expected output is a structured text format, a JSON file for this example, containing the extracted translated text.

SDK Tutorial


First, let’s define the ability. Get early access to Abilities here >

ability_prototypes = [
   VlmAbilityCreate(
       name=f"{NAMESPACE_PREFIX}.structured-OCR.translate-image",
       description="Translate the text on the image into English",
       worker_release="qwen3-instruct",
       text_prompt=translation_prompt,
       transform_into=TransformInto(),
       config=InferRuntimeConfig(
           max_new_tokens=250,
           image_size=512
       ),
       is_public=False
   )
]

The prompt we can use here is:
"You are given a single image that may contain text in any language.

Your task is to read ALL legible text in the image and translate it into English.

         Return ONLY valid JSON.

         Do not include explanation.

         Do not include markdown.

         Do not include..." Get early access to Abilities here >

Next, we can actually create the ability with the following code:

with EyePopSdk.dataEndpoint(api_key=EYEPOP_API_KEY, account_id=EYEPOP_ACCOUNT_ID) as endpoint:
   for ability_prototype in ability_prototypes:
       ability_group = endpoint.create_vlm_ability_group(VlmAbilityGroupCreate(
           name=ability_prototype.name,
           description=ability_prototype.description,
           default_alias_name=ability_prototype.name,
       ))
       ability = endpoint.create_vlm_ability(
           create=ability_prototype,
           vlm_ability_group_uuid=ability_group.uuid,
       )
       ability = endpoint.publish_vlm_ability(
           vlm_ability_uuid=ability.uuid,
           alias_name=ability_prototype.name,
       )
       ability = endpoint.add_vlm_ability_alias(
           vlm_ability_uuid=ability.uuid,
           alias_name=ability_prototype.name,
           tag_name="latest"
       )
       print(f"created ability {ability.uuid} with alias entries {ability.alias_entries}")

That’s it! To run the prompt against an image here is some sample evaluation code:

from pathlib import Path


pop = Pop(components=[
   InferenceComponent(
  ability=f"{NAMESPACE_PREFIX}.structured-OCR.translate-image:latest"
   )
])


with EyePopSdk.workerEndpoint(api_key=EYEPOP_API_KEY) as endpoint:
   endpoint.set_pop(pop)
   sample_img_path = Path("/content/sample_img.jpg")
   job = endpoint.upload(sample_img_path)
   while result := job.predict():
     print(json.dumps(result, indent=2))
          
print("Done")

After running the evaluation you can see what the model labelled and compare it to your source of truth. With this, you can improve your prompts and thus improve your accuracy. Get early access to Abilities here >

Get early access

Want to move faster with visual automation? Request early access to Abilities and get notified as new vision capabilities roll out.

View CDN documentation →