Boost.XML

Name : Gopi Krishna Menon

College : Dronacharya College of Engineering, Gurgaon, Haryana

Course : Computer Science Engineering

Degree Program : B.TECH (Bachelor of Technology)

Availability :

How much time do you plan to spend on your GSOC?

I like to keep myself punctual and plan my day in the morning itself 🙂. For GSOC, I intend to spend an average of 36 hours per week ( 6 hours * 6 Days) during the 3 months.

Effective Timings :

Weekdays:

6:00pm - 12:00am : Monday - Friday

Weekends:

10:00am - 2:00pm : Morning Adventures (Coding)

4:00pm - 5:00pm : Evening Adventures (Coding)

7:00pm - 9:00pm : Night Adventures (Coding)

What are your intended start and end dates?

I can start working on the project from 18th May onwards. Initially, during the community bonding period, I will require about 2 week's time to perform research on the existing APIs of different libraries and design the interface for both XML reader and writer. I intend to achieve all the proposed miltestones by August 10, 2021

What other factors affect your availability ?

During the GSOC period, I don't have any major engagements that would affect my availability. Only in July would I require 2-3 days for submitting my project report in college.

Background Information

I am a Senior at Dronacharya College of Engineering Gurgaon Haryana. Some of the courses that I have pursued till date are

Object Oriented Programming in C++
Principles of Operating System
Digital System Design
Digital Electronics
Theory of Computation
Core Java
Discrete Mathematics
Computer Architecture and Organization
Data Structures and Algorithms in C++
Microprocessors and Interfacing
Computer Graphics
Compiler Design
Software Project Management
Software Testing

Academic Performance

3rd Semester :
- Total Marks Secured : 926/1150
- University Rank : 16
- College Rank : 2
4th Semester :
- Total Marks Secured : 936/1150
- University Rank : 12
- College Rank : 1
5th Semester :
- Total Marks Secured : 986/1150
- University Rank : 2
- College Rank : 2
6th Semester :
- Total Marks Secured : 1009/1150
- University Rank : 2
- College Rank : 2

I also secured 100/100 Marks in Computer Science in Class 12 CBSE Board Exams. I completed my high school education from St.Thomas Sr.Sec School , Bahadurgarh, Haryana.

Internship/Jobs/Courses Audited

Google Summer of Code 2020 : For the development of FITS parser in Boost.Astronomy library
Student Faculty Status at Dronacharya College of Engineering : I conducted a 6 week course on Advanced C++ for the students of all semesters and obtained testimony for selfless contribution and outstanding performance during the course period.
Programming in C++ : An 8 week course conducted by National Programme on Technology Enhanced Learning. I was also in the 5% toppers category.

My Articles

Structured Binding in C++ : https://www.geeksforgeeks.org/structured-binding-c/

Attributes in C++ : https://www.geeksforgeeks.org/attributes-in-c/

std::Any class in C++ : https://www.geeksforgeeks.org/stdany-class-in-c/

Programming Interests

C++ has been the daily driver for all my needs. I am quite comfortable working and experimenting with C++. Apart from C++ I also love working in C#, Python, RUST and Visual Basic

Reason for Contributing to Boost Libraries

I am a C++ fanboy and one of the reasons is simply Boost libraries. Here are some of the reasons why I love Boost Libraries

Exceptional Code Quality
Purely Open Source
Flexible/Adaptable design
Gold Class Industry Standard
Influenced many of the libraries and features in the standard

Boost has been one of the main influencers in the development of modern C++ and it is my honour and pride if I can become a part of developing this beautiful language (C++) and its ecosystem.

Have you done any previous work in this area before or on similar projects?

Yes. One of my early projects was developing a pseudo programming language called GML. It is a push parser(I didn't know that back then). I started that language as an experiment to avoid the massive amount of typing that was present in markup languages such as SGML and XML. But eventually, I understood that many decisions of XML are there due to its design which has a lot of advantages. Apart from that I also wrote a FITS parser for parsing the FITS files for the Boost. Astronomy project last year which we intend to submit for review by the end of this year.

What are your plans beyond this Summer of Code time frame for your proposed work?

I believe that Boost.XML is a very vast project. There are several aspects of this library and during the GSOC period, we are tackling just one of the aspects of this project.

At the end of GSOC, I believe that we will have a low-level parser that will act as the foundation for all the features and tools that will be added/proposed later to the Boost.XML project.

After completing GSoC, I would fine-tune the existing low-level parser and continue my research on the further development of the Boost.XML project and propose the designs and prototypes to my mentor for features such as (example)

Validating XML Processor
DOM support
Improving the handling of Unicode characters

Please rate, from 0 to 5 (0 being no experience, 5 being expert), your knowledge of the following languages, technologies, or tools:

C++ 98/03 (traditional C++) : 4
C++ 11-20 (modern C++) : 3
C++ Standard Library : 4
Boost C++ Libraries : 3
Git : 4

What software development environments are you most familiar with (Visual Studio, Eclipse, KDevelop, etc.)?

Windows Platform : Visual Studio, Visual Studio Code, CLion

Linux Platform : Visual Studio Code

From my 6th grade I have been a big fan of Visual Studio but this year I decided to migrate my dev env to Visual Studio Code. In VS code I usually develop inside a container ( to avoid dependency problems, unnecessary environment setup, managing multiple versions), etc. Building a project scaffolding is nowadays very simple due to containerized development. Just create a folder open in a container that uses my custom cpp-base image and boom, I have a new dev environment will all the base packages, libraries, tools, etc (Boost,vcpkg,...)

What software documentation tool are you most familiar with (Doxygen, DocBook, Quickbook, etc.)

I am quite familiar with Doxygen, Sphinix etc. I love building the documentation using doxygen and then using Sphinix with exhale to generate beautiful documentations.

Earlier I only used Doxygen with Morsa MCSS framework to generate beautiful and clean documentation but it was quite hard to work with. This is the reason why I moved onto sphinix with exhale for generating documentation for the project.

Project Proposal

In the current aspect of Boost.XML, the XML namespace consists of two main classes

reader
writer

xml::reader: As the name itself suggests reader class is used for parsing the XML document. The parser in turn generates the lower-level events which can be then utilized either by the application or by the user for different purposes.

xml::writer: The writer class is used for generating an XML document.

Given below is a brief overview of both the reader and writer class of Boost.XML. The information for each of the two classes is presented under 4 different sections dealing with the design and implementation details of the class.

xml::reader

Overview:

As mentioned above xml::reader is a class used for parsing the XML documents to generate the lower-level events, which can then be used by the Application or user for different purposes ( preparing DOM tree, performing operations on a particular element, etc).

Along with generating the lower-level events, xml::reader also provides the user/application with a bunch of convenient methods that can be used for querying different types of data (based on current event) associated with the markup.

Examples:

Querying attribute values on reading a start tag
Querying the namespace URI on reading the start tag
Querying the prefix of the tag on reading the start tag in an element
Querying the entity reference or character reference

Diagrammatic Representation of XML::Reader

Design Goals:

Internal Design and Working of Reader

The XML reader will be a simple handwritten Recursive Descent Parser that has been modified according to our design. Inside reader, almost all of the events are represented in the form of mini automatons that generate events such as START DOCUMENT and necessary metadata such as attribute list, namespace URI, prefix, etc.

Jumping from one automaton to another one can only occurr if the final state is reached within an automaton. If the automaton gets halted or results in a dead state ERROR event will be flagged along with the error string that tells what kind of error occurred.

Now there are two problems with this Top-Down Approach which has been resolved

Grammar for XML: To write the automaton by hand I will require the grammar for XML so that the automaton can be verified. Fortunately, the XML specification lists out the production for each of the markup and entities. Hence I can simply refer to the XML specification 1.0 for learning about the behavior of a mini automaton

Left Recursive Grammar: Until now I haven't found a production that is directly or indirectly left recursive. Even then as the parser is handwritten it would be fairly easy to remove the left recursion

Left Factoring: In top-down parsers, the complexity significantly increases (in LL(1) impossible) if the productions are not left factored but fortunately, we can 'left factor' the grammar easily by eliminating the common prefix and treating it as a separate production altogether. Doing so will keep the code clean, easy to understand, and flexible for future additions. Without understanding the grammar and left factoring the productions the underlying code has a high chance of resulting in a big blob of code.

Input of XML Reader

XML reader can be considered as a Turing machine that computes on an long infinite input tape over which the machine moves in a unidirectional manner

We can think of the input as a memory stream which supplies data to the reader. Based on the availability of data in the memory stream, the XML processor processes the input data and generates the tokens.

Error Handling

If the error is recoverable ( such as latency in network stream, or no input in-memory stream ) then the parser will have the ability to recover from it provided, the user has performed the necessary action to recover.

If the error is nonrecoverable (automaton reaching a dead state) then the parser must handle the error in a draconian fashion i.e simply result in a fatal error.

Most of the errors both recoverable and nonrecoverable are reported by the internal automatons themselves through ERROR_EVENT and error_string method.

Being a library designer one must be quite aware of what the library is and what it is not!

What XML Reader can do?

Generate low level events
Provide metadata about the different markups, text through convinience functions
Translate Character references (optional/feature can be removed if required)

What XML Reader cannot do?

Validate an XML document
Check the well-formedness of the document (This is because making xml reader check well-formedness will introduce both performance and memory problems, also it will violate the SRP).
Translate character references (optional)
Translate entity references

Basic API Design

Event: Determines what kind of event(action) has occurred in the parser

Event Types Following events can be generated by XML reader

EVENT	Description
START_TAG	Parser read a starting tag
END_TAG	Parser read an ending tag
EMPTY_ELEMENT	Parser read an empty element
CHARACTER	Parser read Inner text of a tag
COMMENT	Parser read a comment
ENTITY REFERENCE	Parser read an entity reference
PROCESSING INSTRUCTION	Parser read a processing instruction
ERRORED	Parser is halted or has resulted in some fatal error

XML reader also allows the user to query metadata through several convenience methods. The value returned by these convenience methods is dependent on the current event reported

The list given below represents some of the convenience methods provided by the reader ( The querying interface is similar to copper spice XML API. This is done to reduce the learning gradient for the user of this library (QT) ). By no means this list is exhaustive and the signatures or methods themselves are subjected to change upon further research.

at_end() : Returns true if the parser has reached the end of document or errored.
encoding(): Returns the encoding based on prolog or BOM in the file
xml_version(): Returns the xml_version to which the document adheres to (1.0,1.1)
error_string() : Returns the error message associated with the parser
has_error(): Determines if the parser is in halted state/fatal errored
is_document_end(): Indicates if the parser has reached the end of the document
is_entity_reference() : Indicates if the current token is an entity reference
is_character_reference(): Indicates if the current token is a character reference
is_processing_instruction() : Returns true if the current element is a processing instruction
is_standalone_document(); Returns true if the XML document is a standalone document
attributes() : Returns a map of attributes with their values if a start tag has been read
get_processing_instruction() : Returns a pair containing the target and instruction
is_cdata(): Determines if we are inside a CDATA section
cdata_text(): Returns the text inside the current CDATA section ...

Control Methods :

next(): Parses the input and generates the next token (along with appropriate event) clear() : Clears the parser state entirely. After calling clear the parser is new as it was default constructed next_start_tag(): Returns the next start-tag

...

Basic Example

xml::reader xml_reader;
// Set up the input stream
while(!xml_reader.is_document_end()){
	xml_reader.next();

	if(xml_reader.is_processing_instruction()){
		auto[target,instruction] = xml_reader.get_processing_instruction();
		fmt::print("Target : {}\n Instr : {}",target,instruction)
	}

}

xml::writer

Overview

xml::writer is a simple streaming API that is used for generating XML documents. It serves as the counterpart of xml::reader and operates on an output buffer that is supplied by the user.

xml::writer provides the user with a collection of methods that can be used for modifying or editing various aspects of an XML document.

Design Goals

I am working on the improving the API design for xml::writer and hence would like to present the main design goals in of bullet points

Easy to use API for creating XML documents
The API should possess a high level of abstraction and the user must not be concerned with pointee bracket notation
The documents generated by the writer must at least be well-formed
The user should be able to connect the writer and reader to work with a current token of a different file.
The writer should be fast and must occupy only a small memory footprint

Error Handling

The errors must be handled in the same way as that of the reader.

What the xml writer can do?

Generate well formed xml documents.
Support substitution of character references.
Provides the user with a higher level API in such a way the user is not at all concerned with the pointee bracket notation (i.e ending tags)

What the xml writer cannot do?

Cannot produce a valid XML document (currently it lacks the support for DTD)
No support for substitution of entity references (will be given after building up the basic writer)

Basic API design

As mentioned before, xml::writer provides the user with a bunch of methods that allows the user to generate xml document under a high level of abstraction.

Some of the methods are : (By no means this is an exhaustive list and the signatures or the entire method is subjected to change in future)

is_errored() : Returns true if the writer is unable to write into the stream
write_prolog(version,encoding) : Writes the prolog for the XML document with XML version set as version and XML encoding set as encoding
write_start_element(): Writes the start element.
write_attribute(): Takes a key value pair as argument and writes the attribute with its value into the stream.
write_attributes(): Takes a pair of attribute and its values and writes it onto the stream for the perticular element
write_end_element(): Generates the end element corrosponding to the start element
write_end_document(): Closes all the remaining open elements and ends the stream with a newline

....

Basic Example

xml::writer xml_writer;
xml_writer.set_ouput_stream(/*set the stream*/);

xml_writer.write_prolog(1.0,"UTF-8");
xml_writer.write_start_element('meow');
xml_writer.write_attribute("arch","meow64");
xml_writer.write_end_element();
xml_writer.write_end_document(); 

// Destructor can also invoke write_end_document()  automatically if required but the buffer lifetime should be greater than the writer object lifetime

PROPOSED MILESTONES AND SCHEDULE

Instead of giving day to day activities during the GSOC period (which becomes highly granular), I would like to start at a higher level of planning and flesh it up to a reasonable point

There are three major milestones in Boost.XML project

API Design Freeze (M1)
Completition of xml::reader (M2)
Completition of xml::writer (M3)

Community Bonding Period: 3weeks(17 May- 7 June)

During the community bonding period, the first two weeks will be spent researching more about the reader and writer API.

This involves the following activities

Deciding the feature set for both reader and writer with the mentor
Studying the XML specification ( A good source is annotated-xml.com) and noting down and discussing the important points with the mentor
Researching and designing the external/front-facing API based on existing libraries of different ecosystems

At the end of 2 weeks, the first major milestone (M1) should be achieved. This means the API design for both reader and writer will be frozen and we will try our best to not make any major changes to the design in later stages.

This does not mean we are stuck with the design. As the coding starts there will be a lot of details that will unfold themselves. We will make sure to adapt them to the design after an internal review. Achieving the first milestone is simply a way to generate a basic scaffolding for the developer to keep track of what is going on.

During the second week and third weeks (overlapped), CI/CD will also be set up for ensuring cross-compatibility across different operating systems and different compilers. Along with that both boost build and cmake will be configured for the project so that the developers can easily get started with development in the future. If my mentor permits then I would also add a .devcontainer configuration along with Dockerfile for doing containerized development( This ensures that the developer working on the code will have no problems with setting up dependencies, environment, etc)

Phase 1: 6 weeks(7 June- 16July)

The first 4 weeks will be spent writing the entire XML reader class and performing debugging etc. The next two weeks will be spent in generating benchmarks and finding critical hotspots and I will try to optimize that. The library used for benchmarking the code will be Google Benchmarks.

After 6 weeks of development the second major milestone (M2) should be achieved

Phase 1 Evaluations: 5 days (12 July - 16 July)

These five days will be spent answering queries and defending the decisions that will be challenged by the mentor. Based on the result, changes will be made if necessary

Phase 2: 1 Month (16 July - 16 August)

The first 3 weeks will be dedicated to writing the entire XML writer class and performing debugging etc. The next 1 week will be spent in benchmarking the code against famous libraries and finding hotspots in the code to reduce memory footprint and improve performance

After 4 weeks of development the third major milestone (M3) should be achieved

Phase 2 Evaluations: 7 days (16 August - 23 August)

Again as mentioned in the Phase 1 evaluations , the 7 days will be spent answering queries and defending decisions against the overall design of the library which will be challenged by the mentor. The changes will only be made after critically thinking about its consequences.

Astute readers might have noticed that I have not mentioned about unit test and documentation anywhere in the schedule. This is because unit tests and documentation are not seperate activities but an integral part of coding itself. I usually follow TDD (Test Driven Development) style of development. TDD is especially useful in designing libraries and API because it helps the developer see the bigger picture of what each method does. In the traditional style of development, there is a very high chance that we might code and code and eventually peform a lot of hacks in the code which might reduce the modularity, flexiblity of the library

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boost.XML

Availability :

Background Information

Project Proposal

xml::reader

xml::writer

PROPOSED MILESTONES AND SCHEDULE

About

Releases

Packages

gopi487krishna/proposal-gsoc-2021

Folders and files

Latest commit

History

Repository files navigation

Boost.XML

Availability :

Background Information

Project Proposal

xml::reader

xml::writer

PROPOSED MILESTONES AND SCHEDULE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages