-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Auto. Make Doomgrad HF Review on 16 January
- Loading branch information
1 parent
0415603
commit f1d7e5b
Showing
13 changed files
with
809 additions
and
1,246 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
[ | ||
{ | ||
"header": "Abstract", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x1.png", | ||
"caption": "Figure 1.MMDocIRcomprises 313 lengthy documents across 10 different domains, along with 1,685 questions. For each question, page-level annotations are provided via selected screenshots. Red boundary boxes represent layout-level annotations.", | ||
"position": 157 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x2.png", | ||
"caption": "Figure 2.Area ratio of different modalities (1) in overall and (2) by domains in MMLongBench-Doc benchmark(Ma et al.,2024b). Note that the white spaces, headers, and footers are removed from the area counting.", | ||
"position": 160 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "1.Introduction", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "2.Dual-Task Retrieval Definition", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "3.MMDocIR: Evaluation Set", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x3.png", | ||
"caption": "Table 2.Detailed statistics forMMDocIRevaluation set. “#Lay/Page” is the averaging layouts per page, reflecting page’s layout complexity. “%Lay” refers to the area ratio of useful layouts (excluding white spaces, headers, and footers) over entire page.", | ||
"position": 375 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x3.png", | ||
"caption": "Table 4.Document statistics for Training Datasets collected.", | ||
"position": 581 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "4.MMDocIR: Training Set", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x3.png", | ||
"caption": "Table 5.Main results for page-level retrieval. “OCR-text” and “VLM-text” refer to converting multi-modal content in the document page using OCR and VLM respectively. “Image” refers to processing document page as screenshot image.", | ||
"position": 744 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "5.Model Training: DPR-Phi3&Col-Phi3", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "6.Experiment", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x3.png", | ||
"caption": "Table 6.Main results for layout-level retrieval. “OCR-text” and “VLM-text” refer to converting multi-modal layouts using OCR and VLM respectively. “Pure-Image” and “Hybrid” refer to reading textual layouts in image and text format respectively.", | ||
"position": 1058 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x3.png", | ||
"caption": "(a)Avg word length", | ||
"position": 1466 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x3.png", | ||
"caption": "(a)Avg word length", | ||
"position": 1469 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08828/x4.png", | ||
"caption": "(b)Distribution density of word length", | ||
"position": 1474 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "7.Related Work", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "8.Conclusion", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "References", | ||
"images": [] | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
[ | ||
{ | ||
"header": "Abstract", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "1Introduction", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "2Related Works", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x1.png", | ||
"caption": "Figure 1:Overview of CityDreamer4D.4D city generation is divided into static and dynamic scene generation, conditioned on𝐋𝐋\\mathbf{L}bold_Land𝐓tsubscript𝐓𝑡\\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, produced by Unbounded Layout Generator and Traffic Scenario Generator, respectively. City Background Generator uses𝐋𝐋\\mathbf{L}bold_Lto create background images𝐈^Gsubscript^𝐈𝐺\\mathbf{\\hat{I}}_{G}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPTfor stuff like roads, vegetation, and the sky, while Building Instance Generator renders the buildings{𝐈^Bi}subscript^𝐈subscript𝐵𝑖\\{\\mathbf{\\hat{I}}_{B_{i}}\\}{ over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT }within the city. Using𝐓tsubscript𝐓𝑡\\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Vehicle Instance Generator generates vehicles{𝐈^Vit}superscriptsubscript^𝐈subscript𝑉𝑖𝑡\\{\\mathbf{\\hat{I}}_{V_{i}}^{t}\\}{ over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }at time stept𝑡titalic_t. Finally, Compositor combines the rendered background, buildings, and vehicles into a unified and coherent image𝐈^Ctsuperscriptsubscript^𝐈𝐶𝑡\\mathbf{\\hat{I}}_{C}^{t}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. “Gen.”, “Mod.“, “Cond.”, “BG.”, “BLDG.”, and “VEH.” denote “Generation”, “Modulation”, “Condition”, “Background”, “Building”, and “Vehicle”, respectively.", | ||
"position": 212 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "3Method", | ||
"images": [] | ||
}, | ||
{ | ||
"header": "4Datasets", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x2.png", | ||
"caption": "Figure 2:Overview of the OSM and GoogleEarth Datasets.(a) Examples of the 2D and 3D annotations in the GoogleEarth dataset, which can be automatically generated using the OSM dataset. (b) The automatic annotation pipeline can be readily adapted for worldwide cities. (c) The dataset statistics highlight the diverse perspectives in the GoogleEarth dataset.", | ||
"position": 1020 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x3.png", | ||
"caption": "Figure 3:Overview of the CityTopia Dataset.(a) The virtual city generation pipeline. “Pro.Inst.”, “Sur.Spl”, and “3D Inst. Anno.” denote “Prototype Instantiation”, “Surface Sampling”, and “3D Instance Annotation”, respectively. (b) Examples of 2D and 3D annotations in the CityTopia dataset are shown from both daytime and nighttime street-view and aerial-view perspectives, automatically generated during virtual city generation. (c) The dataset statistics highlight the diverse perspectives in both street and aerial views.", | ||
"position": 1023 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x4.png", | ||
"caption": "Figure 4:Qualitative Comparison on Google Earth.For SceneDreamer[7]and CityDreamer4D, vehicles are generated using models trained on CityTopia due to the lack of semantic annotations for vehicles in Google Earth. For DimensionX[107], the initial frame is provided by CityDreamer4D. The visual results of InfiniCity[26], provided by the authors, have been zoomed in for better viewing. “Pers.Nature” stands for “PersistentNature”[105].", | ||
"position": 1485 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x5.png", | ||
"caption": "Figure 5:Qualitative Comparison on CityTopia.The initial frame for DimensionX and the input frames for DreamScene4D are chosen from the dataset. “Pers.Nature” refers to “PersistentNature”[105].", | ||
"position": 1488 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "5Experiments", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x6.png", | ||
"caption": "Figure 7:Qualitative Comparison of City Layout Generators.The height map values are normalized to a range of[0,1]01[0,1][ 0 , 1 ]by dividing each value by the maximum value within the map.", | ||
"position": 1594 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x7.png", | ||
"caption": "Figure 8:Qualitative Comparison of Building Instance Generator (BIG) Variants.(a) and (b) illustrate the effects of removing BIG and instance labels, respectively. (c)–(f) present the results of various scene parameterizations. Note that “Enc.” is an abbreviation for “Encoder”.", | ||
"position": 1796 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x8.png", | ||
"caption": "Figure 9:Qualitative Comparison of Vehicle Instance Generator (VIG) Variants.(a) and (b) illustrate the effects of removing VIG and canonicalization, respectively. (c)–(f) present the results of various scene parameterizations. Note that “Enc.” is an abbreviation for “Encoder”.", | ||
"position": 1912 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x9.png", | ||
"caption": "Figure 10:Localized Editing on the Generated Cities.(a) and (c) show vehicle editing results, while (b) and (d) present building editing results.", | ||
"position": 1937 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x10.png", | ||
"caption": "Figure 11:Text-driven City Stylization with ControlNet.The multi-view consistency is preserved in stylized Minecraft and Cyberpunk cities.", | ||
"position": 1940 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x11.png", | ||
"caption": "Figure 12:COLMAP Reconstruction of 600-frame Orbital Videos.The red ring shows the camera positions, and the clear point clouds demonstrate CityDreamer4D’s consistent rendering. Note that ”Recon.” stands for ”Reconstruction.”", | ||
"position": 1967 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x12.png", | ||
"caption": "Figure 13:Directional Light Relighting Effect.(a) and (b) show the lighting intensity. (c) illustrates the relighting effect. Note that “S.M.” denotes “Shadow Mapping”.", | ||
"position": 1970 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/x13.png", | ||
"caption": "Figure 14:Night-view Generation Results.Despite achieving realistic effects, managing global illumination in the generated scenes remains a challenge.", | ||
"position": 1973 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "6Conclusion", | ||
"images": [ | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/haozhe-xie.jpg", | ||
"caption": "", | ||
"position": 2831 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/zhaoxi-chen.jpg", | ||
"caption": "", | ||
"position": 2847 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/fangzhou-hong.jpg", | ||
"caption": "", | ||
"position": 2864 | ||
}, | ||
{ | ||
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/ziwei-liu.jpg", | ||
"caption": "", | ||
"position": 2880 | ||
} | ||
] | ||
}, | ||
{ | ||
"header": "References", | ||
"images": [] | ||
} | ||
] |
Oops, something went wrong.