pdf公式识别问题 #2339

moro0v0 · 2025-03-08T06:43:10Z

moro0v0
Mar 8, 2025

由于我们需要用来处理大量的学术文件，里面有很多数学公式，当前工具已经能识别很多公式了但是还是存在不少公式识别错误。我想了解一是：1.有没有可能通过修改公式识别的模型来提升公式识别准确率？2.如果不行的话能不能提取公式为图片？我确实不太了解整个项目和里面用到的技术，如果可以的话请给我一些指导

moro0v0 · 2025-03-08T07:44:49Z

moro0v0
Mar 8, 2025
Author

我在cut_image.py里面也增加了def ocr_cut_image_and_table(spans, page, page_id, pdf_bytes_md5, imageWriter):
def return_path(type):
return join_path(pdf_bytes_md5, type)

for span in spans:
span_type = span['type']
if span_type == ContentType.Image:
if not check_img_bbox(span['bbox']) or not imageWriter:
continue
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('images'),
imageWriter=imageWriter)
elif span_type == ContentType.Table:
if not check_img_bbox(span['bbox']) or not imageWriter:
continue
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('tables'),
imageWriter=imageWriter)
增加提取公式保存为图片ru
elif span_type == ContentType.InterlineEquation:
if not check_img_bbox(span['bbox']) or not imageWriter:
continue
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('interline_equation'),
imageWriter=imageWriter)
return spans
保存行间公式的代码，这样子是不是可以像图片一样保存在名为images的文件夹下了？然后会在md里面出现链接？

0 replies

moro0v0 · 2025-03-08T09:15:34Z

moro0v0
Mar 8, 2025
Author

我现在看到在images里面存在了行间图片的文件了，但是我运行完发现行间公式还是被识别了出来如果我想让它像普通图片一样是一个链接的形式的话我应该修改哪里？

0 replies

moro0v0 · 2025-03-08T09:48:45Z

moro0v0
Mar 8, 2025
Author

ocr_mkcontent.py的ocr_mk_markdown_with_para_core_v2的 elif para_type == BlockType.InterlineEquation:要怎么改才能把公式图片的位置放在那里， elif para_type == BlockType.InterlineEquation:
# para_text = merge_para_with_text(para_block)
para_text += "被我找到了吧！"
# for block in para_block['blocks']: # 1st.拼image_body
# if block['type'] == BlockType.ImageBody:
# for line in block['lines']:
# for span in line['spans']:
# if span['type'] == ContentType.Image:
# if span.get('image_path', ''):
# para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"直接这么改是会报没有blocks参数

0 replies

jonny4589 · 2025-03-11T03:10:56Z

jonny4589
Mar 11, 2025

另外公式中带文字解析会乱码有解决办法吗各位大佬

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pdf公式识别问题 #2339

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

pdf公式识别问题 #2339

Uh oh!

moro0v0 Mar 8, 2025

Replies: 4 comments

Uh oh!

moro0v0 Mar 8, 2025 Author

Uh oh!

moro0v0 Mar 8, 2025 Author

Uh oh!

moro0v0 Mar 8, 2025 Author

Uh oh!

jonny4589 Mar 11, 2025

moro0v0
Mar 8, 2025

moro0v0
Mar 8, 2025
Author

moro0v0
Mar 8, 2025
Author

moro0v0
Mar 8, 2025
Author

jonny4589
Mar 11, 2025