How to resolve misplaced text and other visual aspects of a PDF manipulated using redactions and insert_htmlbox. #3906
Replies: 3 comments 4 replies
-
I don't have an exhaustive advice, but some single comments:
|
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your reply.
My main problems are now two -
Btw I cannot not fathom what I would've done if I didn't have this amazing package ! Thanks again for maintaining it and being so proactive ! |
Beta Was this translation helpful? Give feedback.
-
Hello again.
But now I am struggling with the font-size.
Without styling code -
With styling code -
What I would like to be the outcome -> |
Beta Was this translation helpful? Give feedback.
-
I am trying to use pymupdf to help translate a foreign language pdf document to an english language pdf document but also trying to maintaining the formatting the best I can. This involves maintaining the text location, font color, font styling (bold, italics), annotations (strikethrough, underline), hyperlinks, images, tables, etc.
I am working with this example Chinese document -
CHINESE.pdf
Here are the output translations -
Option 1 -
CHINESE_OPTION_1_custom_redactions_with_dict.pdf
Option 2 -
CHINESE_OPTION_2_translated_custom_redactions_with_blocks.pdf
Issues -
For some reason the text ff is missing throughout the pdf. Unsure why this is happening. This text is searchable. Searching for
![ff text missing from Office and Affairs](https://private-user-images.githubusercontent.com/16555428/371782485-966bccc1-c05d-42e0-99c1-09cb79b3c6d4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MzA2NDQsIm5iZiI6MTczOTQzMDM0NCwicGF0aCI6Ii8xNjU1NTQyOC8zNzE3ODI0ODUtOTY2YmNjYzEtYzA1ZC00MmUwLTk5YzEtMDljYjc5YjNjNmQ0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA3MDU0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWE2N2JmNDQ1ZTQ5M2E0MzgyMTA4ZWVkNTczNTJmOGFiZGU2MzFhZmVjMmMzZDlkODg0MTM5YTE2MzFjNDg5ZDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.E0kSRTVWtmWsAt9mtdd6egAmq2gI8BRDuHXd9ONU2eQ)
Affairs
highlights the text. Here's the screenshot -Final pdf is not consistent through Option 1 and Option 2. I like the Option 2 output for the Chinese document in question but this same logic fails when the text is spaced out in the line say in table form. I would like to achieve Option 2 output in Option 1 (as this option allows me to gather font, size and other information, performs well for other languages). How can I achieve this ?![table output option 1](https://private-user-images.githubusercontent.com/16555428/371782817-633db508-fe1a-44d7-a4fc-e610b4fd9d3c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MzA2NDQsIm5iZiI6MTczOTQzMDM0NCwicGF0aCI6Ii8xNjU1NTQyOC8zNzE3ODI4MTctNjMzZGI1MDgtZmUxYS00NGQ3LWE0ZmMtZTYxMGI0ZmQ5ZDNjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA3MDU0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2Y2M0MzVjOGRjNGI4NGYyNTAyYjE2Y2UxYTU2MmUyNDc2ZGNlZDE1OTRhNzY4MjJhYTk1OWMxNzBlMjk4ZDcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.o6ilIbuwU4z9bZqj7usDvAhiN9ANlc5gm3qluHprlPs)
![table output option 2](https://private-user-images.githubusercontent.com/16555428/371782838-18583f4e-e2eb-409c-aa2e-235d9be7aa74.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MzA2NDQsIm5iZiI6MTczOTQzMDM0NCwicGF0aCI6Ii8xNjU1NTQyOC8zNzE3ODI4MzgtMTg1ODNmNGUtZTJlYi00MDljLWFhMmUtMjM1ZDliZTdhYTc0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA3MDU0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkyZGIxMGMzOGM2MDM1Y2M0OGY2MjFjY2IyZTU0M2ZiYTQ5ZGQ0ODJkZTIwNjE5OWI1NjZiZTNjN2JkZDE0YWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.2nrc6peKbj8gEZnwGErNmSGX7g-vgxwVC5HqVNOgUx0)
![chinese doc text misplaced using option 1](https://private-user-images.githubusercontent.com/16555428/371784059-4997b49d-6ff8-40ac-bd33-dd14d84edf68.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MzA2NDQsIm5iZiI6MTczOTQzMDM0NCwicGF0aCI6Ii8xNjU1NTQyOC8zNzE3ODQwNTktNDk5N2I0OWQtNmZmOC00MGFjLWJkMzMtZGQxNGQ4NGVkZjY4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA3MDU0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWEwZGNiZjU1N2FiMTRiZjY3MWZiMzAyN2RlNjE2OWQ0YTc3NzU5MmU5NDczNmRlNTBmMmQ0M2VmZjIyZWRiNmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.YmO2SZCE3SZEOO6LAHPYGshxIQPNCQFubRVCk4BCZEo)
![chinese doc text correct using option 2](https://private-user-images.githubusercontent.com/16555428/371784097-e4ca8921-2cb6-4eff-a9cc-e911f073d798.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MzA2NDQsIm5iZiI6MTczOTQzMDM0NCwicGF0aCI6Ii8xNjU1NTQyOC8zNzE3ODQwOTctZTRjYTg5MjEtMmNiNi00ZWZmLWE5Y2MtZTkxMWYwNzNkNzk4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA3MDU0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ4NDM4NmI2NmRiYjg0NTgwZWUzMWIxYzYwNGVhM2NmZWFlMmIyMmE2YWY1ZTU3NzZmNzM1OTAxOTM1NjRhMTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.FPkG5G_9_pK2L2sCsnncW4hfZDnKoVVJ2sJ6H86jbkE)
1. table output for option 1 which is good -
2. table output for option 2 which is bad -
3. Chinese document text misplaced in Option 1 -
4. Chinese document text looks good in Option 2 -
Hyperlinks underline being too long. Is there a way I could resolve this case also ? I think this is happening because of the two reasons - the font being used is not the same in the output and the translated text is not always necessarily the same size and the input text.![hyperlinks underline being long](https://private-user-images.githubusercontent.com/16555428/371784638-2c231068-e583-406b-b52a-256197b9bcc6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MzA2NDQsIm5iZiI6MTczOTQzMDM0NCwicGF0aCI6Ii8xNjU1NTQyOC8zNzE3ODQ2MzgtMmMyMzEwNjgtZTU4My00MDZiLWI1MmEtMjU2MTk3YjliY2M2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA3MDU0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdmOTNlOTFmN2JmNGM4OWI1ZTU3YWZjNjIwOGIzMTUwMzkzNWNjMDk2NjBjOTc1MjM1NDgxNzMyNDY1YTkyNjYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.9ZqhIv9NB8v9L52aTVhseijiWPAywEXVfkhoE_1DjH0)
1.
References -
Here's my example code -
Any feedback would be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions