OCR with Image Translation
Instantly recognize and translate text directly from any visual source.

eyepop.structured-OCR.translate-image:1.0.0
Prompt
You are given a single image that may contain text in any language.
Your task is to read ALL legible text in the image and translate it into English.
Return ONLY valid JSON.
o not include explanation.
Do not include markdown.
Do not include...
...Run the full prompt in your EyePop.ai dashboard
Input
Image
Output
JSON
Image size
512x512 - Small
Model type
EyePop.ai VLM
How It Works
In a globalized world, being able to understand different languages is incredibly important. Language barriers can prevent users from understanding products, signs, instructions, and much more and manually typing out different languages can be impossible or cause confusion for users. Thus, being able to instantly recognize and translate text directly from a visual source is vital for accessibility. The Structured OCR task on the Abilities tab can detect non-English text within an image and provide an accurate English translation.
For example, if a user uploads an image of a Spanish street sign, the model should identify the Spanish letters and output the English translation: "Revolution Street Downtown Area"

In contrast, if the image contains text that is unreadable, partially obscured by something else, or too blurry to distinguish individual character strokes, the model may fail to return anything or return an inaccurate translation.
Our expected inputs are images containing foreign text, and the expected output is a structured text format, a JSON file for this example, containing the extracted translated text.
SDK Tutorial
First, let’s define the ability. Get early access to Abilities here >
ability_prototypes = [
VlmAbilityCreate(
name=f"{NAMESPACE_PREFIX}.structured-OCR.translate-image",
description="Translate the text on the image into English",
worker_release="qwen3-instruct",
text_prompt=translation_prompt,
transform_into=TransformInto(),
config=InferRuntimeConfig(
max_new_tokens=250,
image_size=512
),
is_public=False
)
]
The prompt we can use here is:
"You are given a single image that may contain text in any language.
Your task is to read ALL legible text in the image and translate it into English.
Return ONLY valid JSON.
Do not include explanation.
Do not include markdown.
Do not include..." Get early access to Abilities here >
Next, we can actually create the ability with the following code:
with EyePopSdk.dataEndpoint(api_key=EYEPOP_API_KEY, account_id=EYEPOP_ACCOUNT_ID) as endpoint:
for ability_prototype in ability_prototypes:
ability_group = endpoint.create_vlm_ability_group(VlmAbilityGroupCreate(
name=ability_prototype.name,
description=ability_prototype.description,
default_alias_name=ability_prototype.name,
))
ability = endpoint.create_vlm_ability(
create=ability_prototype,
vlm_ability_group_uuid=ability_group.uuid,
)
ability = endpoint.publish_vlm_ability(
vlm_ability_uuid=ability.uuid,
alias_name=ability_prototype.name,
)
ability = endpoint.add_vlm_ability_alias(
vlm_ability_uuid=ability.uuid,
alias_name=ability_prototype.name,
tag_name="latest"
)
print(f"created ability {ability.uuid} with alias entries {ability.alias_entries}")That’s it! To run the prompt against an image here is some sample evaluation code:
from pathlib import Path
pop = Pop(components=[
InferenceComponent(
ability=f"{NAMESPACE_PREFIX}.structured-OCR.translate-image:latest"
)
])
with EyePopSdk.workerEndpoint(api_key=EYEPOP_API_KEY) as endpoint:
endpoint.set_pop(pop)
sample_img_path = Path("/content/sample_img.jpg")
job = endpoint.upload(sample_img_path)
while result := job.predict():
print(json.dumps(result, indent=2))
print("Done")After running the evaluation you can see what the model labelled and compare it to your source of truth. With this, you can improve your prompts and thus improve your accuracy. Get early access to Abilities here >
Get early access
Want to move faster with visual automation? Request early access to Abilities and get notified as new vision capabilities roll out.