Skip to content

Commit

Permalink
Auto. Make Doomgrad HF Review on 16 January
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Jan 16, 2025
1 parent 0415603 commit f1d7e5b
Show file tree
Hide file tree
Showing 13 changed files with 809 additions and 1,246 deletions.
91 changes: 91 additions & 0 deletions assets/img_data/2501.08828.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
[
{
"header": "Abstract",
"images": [
{
"img": "https://arxiv.org/html/2501.08828/x1.png",
"caption": "Figure 1.MMDocIRcomprises 313 lengthy documents across 10 different domains, along with 1,685 questions. For each question, page-level annotations are provided via selected screenshots. Red boundary boxes represent layout-level annotations.",
"position": 157
},
{
"img": "https://arxiv.org/html/2501.08828/x2.png",
"caption": "Figure 2.Area ratio of different modalities (1) in overall and (2) by domains in MMLongBench-Doc benchmark(Ma et al.,2024b). Note that the white spaces, headers, and footers are removed from the area counting.",
"position": 160
}
]
},
{
"header": "1.Introduction",
"images": []
},
{
"header": "2.Dual-Task Retrieval Definition",
"images": []
},
{
"header": "3.MMDocIR: Evaluation Set",
"images": [
{
"img": "https://arxiv.org/html/2501.08828/x3.png",
"caption": "Table 2.Detailed statistics forMMDocIRevaluation set. “#Lay/Page” is the averaging layouts per page, reflecting page’s layout complexity. “%Lay” refers to the area ratio of useful layouts (excluding white spaces, headers, and footers) over entire page.",
"position": 375
},
{
"img": "https://arxiv.org/html/2501.08828/x3.png",
"caption": "Table 4.Document statistics for Training Datasets collected.",
"position": 581
}
]
},
{
"header": "4.MMDocIR: Training Set",
"images": [
{
"img": "https://arxiv.org/html/2501.08828/x3.png",
"caption": "Table 5.Main results for page-level retrieval. “OCR-text” and “VLM-text” refer to converting multi-modal content in the document page using OCR and VLM respectively. “Image” refers to processing document page as screenshot image.",
"position": 744
}
]
},
{
"header": "5.Model Training: DPR-Phi3&Col-Phi3",
"images": []
},
{
"header": "6.Experiment",
"images": [
{
"img": "https://arxiv.org/html/2501.08828/x3.png",
"caption": "Table 6.Main results for layout-level retrieval. “OCR-text” and “VLM-text” refer to converting multi-modal layouts using OCR and VLM respectively. “Pure-Image” and “Hybrid” refer to reading textual layouts in image and text format respectively.",
"position": 1058
},
{
"img": "https://arxiv.org/html/2501.08828/x3.png",
"caption": "(a)Avg word length",
"position": 1466
},
{
"img": "https://arxiv.org/html/2501.08828/x3.png",
"caption": "(a)Avg word length",
"position": 1469
},
{
"img": "https://arxiv.org/html/2501.08828/x4.png",
"caption": "(b)Distribution density of word length",
"position": 1474
}
]
},
{
"header": "7.Related Work",
"images": []
},
{
"header": "8.Conclusion",
"images": []
},
{
"header": "References",
"images": []
}
]
123 changes: 123 additions & 0 deletions assets/img_data/2501.08983.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
[
{
"header": "Abstract",
"images": []
},
{
"header": "1Introduction",
"images": []
},
{
"header": "2Related Works",
"images": [
{
"img": "https://arxiv.org/html/2501.08983/x1.png",
"caption": "Figure 1:Overview of CityDreamer4D.4D city generation is divided into static and dynamic scene generation, conditioned on𝐋𝐋\\mathbf{L}bold_Land𝐓tsubscript𝐓𝑡\\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, produced by Unbounded Layout Generator and Traffic Scenario Generator, respectively. City Background Generator uses𝐋𝐋\\mathbf{L}bold_Lto create background images𝐈^Gsubscript^𝐈𝐺\\mathbf{\\hat{I}}_{G}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPTfor stuff like roads, vegetation, and the sky, while Building Instance Generator renders the buildings{𝐈^Bi}subscript^𝐈subscript𝐵𝑖\\{\\mathbf{\\hat{I}}_{B_{i}}\\}{ over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT }within the city. Using𝐓tsubscript𝐓𝑡\\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Vehicle Instance Generator generates vehicles{𝐈^Vit}superscriptsubscript^𝐈subscript𝑉𝑖𝑡\\{\\mathbf{\\hat{I}}_{V_{i}}^{t}\\}{ over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }at time stept𝑡titalic_t. Finally, Compositor combines the rendered background, buildings, and vehicles into a unified and coherent image𝐈^Ctsuperscriptsubscript^𝐈𝐶𝑡\\mathbf{\\hat{I}}_{C}^{t}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. “Gen.”, “Mod.“, “Cond.”, “BG.”, “BLDG.”, and “VEH.” denote “Generation”, “Modulation”, “Condition”, “Background”, “Building”, and “Vehicle”, respectively.",
"position": 212
}
]
},
{
"header": "3Method",
"images": []
},
{
"header": "4Datasets",
"images": [
{
"img": "https://arxiv.org/html/2501.08983/x2.png",
"caption": "Figure 2:Overview of the OSM and GoogleEarth Datasets.(a) Examples of the 2D and 3D annotations in the GoogleEarth dataset, which can be automatically generated using the OSM dataset. (b) The automatic annotation pipeline can be readily adapted for worldwide cities. (c) The dataset statistics highlight the diverse perspectives in the GoogleEarth dataset.",
"position": 1020
},
{
"img": "https://arxiv.org/html/2501.08983/x3.png",
"caption": "Figure 3:Overview of the CityTopia Dataset.(a) The virtual city generation pipeline. “Pro.Inst.”, “Sur.Spl”, and “3D Inst. Anno.” denote “Prototype Instantiation”, “Surface Sampling”, and “3D Instance Annotation”, respectively. (b) Examples of 2D and 3D annotations in the CityTopia dataset are shown from both daytime and nighttime street-view and aerial-view perspectives, automatically generated during virtual city generation. (c) The dataset statistics highlight the diverse perspectives in both street and aerial views.",
"position": 1023
},
{
"img": "https://arxiv.org/html/2501.08983/x4.png",
"caption": "Figure 4:Qualitative Comparison on Google Earth.For SceneDreamer[7]and CityDreamer4D, vehicles are generated using models trained on CityTopia due to the lack of semantic annotations for vehicles in Google Earth. For DimensionX[107], the initial frame is provided by CityDreamer4D. The visual results of InfiniCity[26], provided by the authors, have been zoomed in for better viewing. “Pers.Nature” stands for “PersistentNature”[105].",
"position": 1485
},
{
"img": "https://arxiv.org/html/2501.08983/x5.png",
"caption": "Figure 5:Qualitative Comparison on CityTopia.The initial frame for DimensionX and the input frames for DreamScene4D are chosen from the dataset. “Pers.Nature” refers to “PersistentNature”[105].",
"position": 1488
}
]
},
{
"header": "5Experiments",
"images": [
{
"img": "https://arxiv.org/html/2501.08983/x6.png",
"caption": "Figure 7:Qualitative Comparison of City Layout Generators.The height map values are normalized to a range of[0,1]01[0,1][ 0 , 1 ]by dividing each value by the maximum value within the map.",
"position": 1594
},
{
"img": "https://arxiv.org/html/2501.08983/x7.png",
"caption": "Figure 8:Qualitative Comparison of Building Instance Generator (BIG) Variants.(a) and (b) illustrate the effects of removing BIG and instance labels, respectively. (c)–(f) present the results of various scene parameterizations. Note that “Enc.” is an abbreviation for “Encoder”.",
"position": 1796
},
{
"img": "https://arxiv.org/html/2501.08983/x8.png",
"caption": "Figure 9:Qualitative Comparison of Vehicle Instance Generator (VIG) Variants.(a) and (b) illustrate the effects of removing VIG and canonicalization, respectively. (c)–(f) present the results of various scene parameterizations. Note that “Enc.” is an abbreviation for “Encoder”.",
"position": 1912
},
{
"img": "https://arxiv.org/html/2501.08983/x9.png",
"caption": "Figure 10:Localized Editing on the Generated Cities.(a) and (c) show vehicle editing results, while (b) and (d) present building editing results.",
"position": 1937
},
{
"img": "https://arxiv.org/html/2501.08983/x10.png",
"caption": "Figure 11:Text-driven City Stylization with ControlNet.The multi-view consistency is preserved in stylized Minecraft and Cyberpunk cities.",
"position": 1940
},
{
"img": "https://arxiv.org/html/2501.08983/x11.png",
"caption": "Figure 12:COLMAP Reconstruction of 600-frame Orbital Videos.The red ring shows the camera positions, and the clear point clouds demonstrate CityDreamer4D’s consistent rendering. Note that ”Recon.” stands for ”Reconstruction.”",
"position": 1967
},
{
"img": "https://arxiv.org/html/2501.08983/x12.png",
"caption": "Figure 13:Directional Light Relighting Effect.(a) and (b) show the lighting intensity. (c) illustrates the relighting effect. Note that “S.M.” denotes “Shadow Mapping”.",
"position": 1970
},
{
"img": "https://arxiv.org/html/2501.08983/x13.png",
"caption": "Figure 14:Night-view Generation Results.Despite achieving realistic effects, managing global illumination in the generated scenes remains a challenge.",
"position": 1973
}
]
},
{
"header": "6Conclusion",
"images": [
{
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/haozhe-xie.jpg",
"caption": "",
"position": 2831
},
{
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/zhaoxi-chen.jpg",
"caption": "",
"position": 2847
},
{
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/fangzhou-hong.jpg",
"caption": "",
"position": 2864
},
{
"img": "https://arxiv.org/html/2501.08983/extracted/6115531/authors/ziwei-liu.jpg",
"caption": "",
"position": 2880
}
]
},
{
"header": "References",
"images": []
}
]
Loading

0 comments on commit f1d7e5b

Please sign in to comment.