Name : Gopi Krishna Menon
College : Dronacharya College of Engineering, Gurgaon, Haryana
Course : Computer Science Engineering
Degree Program : B.TECH (Bachelor of Technology)
Email : [email protected]
How much time do you plan to spend on your GSOC?
I like to keep myself punctual and plan my day in the morning itself 🙂. For GSOC, I intend to spend an average of 36 hours per week ( 6 hours * 6 Days) during the 3 months.
Effective Timings :
Weekdays:
6:00pm - 12:00am : Monday - Friday
Weekends:
10:00am - 2:00pm : Morning Adventures (Coding)
4:00pm - 5:00pm : Evening Adventures (Coding)
7:00pm - 9:00pm : Night Adventures (Coding)
What are your intended start and end dates?
I can start working on the project from 18th May onwards. Initially, during the community bonding period, I will require about 2 week's time to perform research on the existing APIs of different libraries and design the interface for both XML reader and writer. I intend to achieve all the proposed miltestones by August 10, 2021
What other factors affect your availability ?
During the GSOC period, I don't have any major engagements that would affect my availability. Only in July would I require 2-3 days for submitting my project report in college.
I am a Senior at Dronacharya College of Engineering Gurgaon Haryana. Some of the courses that I have pursued till date are
- Object Oriented Programming in C++
- Principles of Operating System
- Digital System Design
- Digital Electronics
- Theory of Computation
- Core Java
- Discrete Mathematics
- Computer Architecture and Organization
- Data Structures and Algorithms in C++
- Microprocessors and Interfacing
- Computer Graphics
- Compiler Design
- Software Project Management
- Software Testing
Academic Performance
-
3rd Semester :
- Total Marks Secured : 926/1150
- University Rank : 16
- College Rank : 2
-
4th Semester :
- Total Marks Secured : 936/1150
- University Rank : 12
- College Rank : 1
-
5th Semester :
- Total Marks Secured : 986/1150
- University Rank : 2
- College Rank : 2
-
6th Semester :
- Total Marks Secured : 1009/1150
- University Rank : 2
- College Rank : 2
I also secured 100/100 Marks in Computer Science in Class 12 CBSE Board Exams. I completed my high school education from St.Thomas Sr.Sec School , Bahadurgarh, Haryana.
Internship/Jobs/Courses Audited
- Google Summer of Code 2020 : For the development of FITS parser in Boost.Astronomy library
- Student Faculty Status at Dronacharya College of Engineering : I conducted a 6 week course on Advanced C++ for the students of all semesters and obtained testimony for selfless contribution and outstanding performance during the course period.
- Programming in C++ : An 8 week course conducted by National Programme on Technology Enhanced Learning. I was also in the 5% toppers category.
My Articles
Structured Binding in C++ : https://www.geeksforgeeks.org/structured-binding-c/
Attributes in C++ : https://www.geeksforgeeks.org/attributes-in-c/
std::Any class in C++ : https://www.geeksforgeeks.org/stdany-class-in-c/
Programming Interests
C++ has been the daily driver for all my needs. I am quite comfortable working and experimenting with C++. Apart from C++ I also love working in C#, Python, RUST and Visual Basic
Reason for Contributing to Boost Libraries
I am a C++ fanboy and one of the reasons is simply Boost libraries. Here are some of the reasons why I love Boost Libraries
- Exceptional Code Quality
- Purely Open Source
- Flexible/Adaptable design
- Gold Class Industry Standard
- Influenced many of the libraries and features in the standard
Boost has been one of the main influencers in the development of modern C++ and it is my honour and pride if I can become a part of developing this beautiful language (C++) and its ecosystem.
Have you done any previous work in this area before or on similar projects?
Yes. One of my early projects was developing a pseudo programming language called GML. It is a push parser(I didn't know that back then). I started that language as an experiment to avoid the massive amount of typing that was present in markup languages such as SGML and XML. But eventually, I understood that many decisions of XML are there due to its design which has a lot of advantages. Apart from that I also wrote a FITS parser for parsing the FITS files for the Boost. Astronomy project last year which we intend to submit for review by the end of this year.
What are your plans beyond this Summer of Code time frame for your proposed work?
I believe that Boost.XML is a very vast project. There are several aspects of this library and during the GSOC period, we are tackling just one of the aspects of this project.
At the end of GSOC, I believe that we will have a low-level parser that will act as the foundation for all the features and tools that will be added/proposed later to the Boost.XML project.
After completing GSoC, I would fine-tune the existing low-level parser and continue my research on the further development of the Boost.XML project and propose the designs and prototypes to my mentor for features such as (example)
- Validating XML Processor
- DOM support
- Improving the handling of Unicode characters
Please rate, from 0 to 5 (0 being no experience, 5 being expert), your knowledge of the following languages, technologies, or tools:
- C++ 98/03 (traditional C++) : 4
- C++ 11-20 (modern C++) : 3
- C++ Standard Library : 4
- Boost C++ Libraries : 3
- Git : 4
What software development environments are you most familiar with (Visual Studio, Eclipse, KDevelop, etc.)?
Windows Platform : Visual Studio, Visual Studio Code, CLion
Linux Platform : Visual Studio Code
From my 6th grade I have been a big fan of Visual Studio but this year I decided to migrate my dev env to Visual Studio Code. In VS code I usually develop inside a container ( to avoid dependency problems, unnecessary environment setup, managing multiple versions), etc. Building a project scaffolding is nowadays very simple due to containerized development. Just create a folder open in a container that uses my custom cpp-base image and boom, I have a new dev environment will all the base packages, libraries, tools, etc (Boost,vcpkg,...)
What software documentation tool are you most familiar with (Doxygen, DocBook, Quickbook, etc.)
I am quite familiar with Doxygen, Sphinix etc. I love building the documentation using doxygen and then using Sphinix with exhale to generate beautiful documentations.
Earlier I only used Doxygen with Morsa MCSS framework to generate beautiful and clean documentation but it was quite hard to work with. This is the reason why I moved onto sphinix with exhale for generating documentation for the project.
In the current aspect of Boost.XML, the XML namespace consists of two main classes
- reader
- writer
xml::reader: As the name itself suggests reader class is used for parsing the XML document. The parser in turn generates the lower-level events which can be then utilized either by the application or by the user for different purposes.
xml::writer: The writer class is used for generating an XML document.
Given below is a brief overview of both the reader and writer class of Boost.XML. The information for each of the two classes is presented under 4 different sections dealing with the design and implementation details of the class.
Overview:
As mentioned above xml::reader is a class used for parsing the XML documents to generate the lower-level events, which can then be used by the Application or user for different purposes ( preparing DOM tree, performing operations on a particular element, etc).
Along with generating the lower-level events, xml::reader also provides the user/application with a bunch of convenient methods that can be used for querying different types of data (based on current event) associated with the markup.
Examples:
-
Querying attribute values on reading a start tag
-
Querying the namespace URI on reading the start tag
-
Querying the prefix of the tag on reading the start tag in an element
-
Querying the entity reference or character reference
Diagrammatic Representation of XML::Reader
Design Goals:
Internal Design and Working of Reader
The XML reader will be a simple handwritten Recursive Descent Parser that has been modified according to our design. Inside reader, almost all of the events are represented in the form of mini automatons that generate events such as START DOCUMENT and necessary metadata such as attribute list, namespace URI, prefix, etc.
Jumping from one automaton to another one can only occurr if the final state is reached within an automaton. If the automaton gets halted or results in a dead state ERROR event will be flagged along with the error string that tells what kind of error occurred.
Now there are two problems with this Top-Down Approach which has been resolved
Grammar for XML: To write the automaton by hand I will require the grammar for XML so that the automaton can be verified. Fortunately, the XML specification lists out the production for each of the markup and entities. Hence I can simply refer to the XML specification 1.0 for learning about the behavior of a mini automaton
Left Recursive Grammar: Until now I haven't found a production that is directly or indirectly left recursive. Even then as the parser is handwritten it would be fairly easy to remove the left recursion
Left Factoring: In top-down parsers, the complexity significantly increases (in LL(1) impossible) if the productions are not left factored but fortunately, we can 'left factor' the grammar easily by eliminating the common prefix and treating it as a separate production altogether. Doing so will keep the code clean, easy to understand, and flexible for future additions. Without understanding the grammar and left factoring the productions the underlying code has a high chance of resulting in a big blob of code.
Input of XML Reader
XML reader can be considered as a Turing machine that computes on an long infinite input tape over which the machine moves in a unidirectional manner
We can think of the input as a memory stream which supplies data to the reader. Based on the availability of data in the memory stream, the XML processor processes the input data and generates the tokens.
Error Handling
If the error is recoverable ( such as latency in network stream, or no input in-memory stream ) then the parser will have the ability to recover from it provided, the user has performed the necessary action to recover.
If the error is nonrecoverable (automaton reaching a dead state) then the parser must handle the error in a draconian fashion i.e simply result in a fatal error.
Most of the errors both recoverable and nonrecoverable are reported by the internal automatons themselves through ERROR_EVENT and error_string method.
Being a library designer one must be quite aware of what the library is and what it is not!
What XML Reader can do?
- Generate low level events
- Provide metadata about the different markups, text through convinience functions
- Translate Character references (optional/feature can be removed if required)
What XML Reader cannot do?
- Validate an XML document
- Check the well-formedness of the document (This is because making xml reader check well-formedness will introduce both performance and memory problems, also it will violate the SRP).
- Translate character references (optional)
- Translate entity references
Basic API Design
Event: Determines what kind of event(action) has occurred in the parser
Event Types Following events can be generated by XML reader
EVENT | Description |
---|---|
START_TAG | Parser read a starting tag |
END_TAG | Parser read an ending tag |
EMPTY_ELEMENT | Parser read an empty element |
CHARACTER | Parser read Inner text of a tag |
COMMENT | Parser read a comment |
ENTITY REFERENCE | Parser read an entity reference |
PROCESSING INSTRUCTION | Parser read a processing instruction |
ERRORED | Parser is halted or has resulted in some fatal error |
XML reader also allows the user to query metadata through several convenience methods. The value returned by these convenience methods is dependent on the current event reported
The list given below represents some of the convenience methods provided by the reader ( The querying interface is similar to copper spice XML API. This is done to reduce the learning gradient for the user of this library (QT) ). By no means this list is exhaustive and the signatures or methods themselves are subjected to change upon further research.
-
at_end() : Returns true if the parser has reached the end of document or errored.
-
encoding(): Returns the encoding based on prolog or BOM in the file
-
xml_version(): Returns the xml_version to which the document adheres to (1.0,1.1)
-
error_string() : Returns the error message associated with the parser
-
has_error(): Determines if the parser is in halted state/fatal errored
-
is_document_end(): Indicates if the parser has reached the end of the document
-
is_entity_reference() : Indicates if the current token is an entity reference
-
is_character_reference(): Indicates if the current token is a character reference
-
is_processing_instruction() : Returns true if the current element is a processing instruction
-
is_standalone_document(); Returns true if the XML document is a standalone document
-
attributes() : Returns a map of attributes with their values if a start tag has been read
-
get_processing_instruction() : Returns a pair containing the target and instruction
-
is_cdata(): Determines if we are inside a CDATA section
-
cdata_text(): Returns the text inside the current CDATA section ...
Control Methods :
next(): Parses the input and generates the next token (along with appropriate event) clear() : Clears the parser state entirely. After calling clear the parser is new as it was default constructed next_start_tag(): Returns the next start-tag
...
Basic Example
xml::reader xml_reader;
// Set up the input stream
while(!xml_reader.is_document_end()){
xml_reader.next();
if(xml_reader.is_processing_instruction()){
auto[target,instruction] = xml_reader.get_processing_instruction();
fmt::print("Target : {}\n Instr : {}",target,instruction)
}
}
Overview
xml::writer is a simple streaming API that is used for generating XML documents. It serves as the counterpart of xml::reader and operates on an output buffer that is supplied by the user.
xml::writer provides the user with a collection of methods that can be used for modifying or editing various aspects of an XML document.
Design Goals
I am working on the improving the API design for xml::writer and hence would like to present the main design goals in of bullet points
- Easy to use API for creating XML documents
- The API should possess a high level of abstraction and the user must not be concerned with pointee bracket notation
- The documents generated by the writer must at least be well-formed
- The user should be able to connect the writer and reader to work with a current token of a different file.
- The writer should be fast and must occupy only a small memory footprint
Error Handling
The errors must be handled in the same way as that of the reader.
What the xml writer can do?
- Generate well formed xml documents.
- Support substitution of character references.
- Provides the user with a higher level API in such a way the user is not at all concerned with the pointee bracket notation (i.e ending tags)
What the xml writer cannot do?
- Cannot produce a valid XML document (currently it lacks the support for DTD)
- No support for substitution of entity references (will be given after building up the basic writer)
Basic API design
As mentioned before, xml::writer provides the user with a bunch of methods that allows the user to generate xml document under a high level of abstraction.
Some of the methods are : (By no means this is an exhaustive list and the signatures or the entire method is subjected to change in future)
- is_errored() : Returns true if the writer is unable to write into the stream
- write_prolog(version,encoding) : Writes the prolog for the XML document with XML version set as
version
and XML encoding set asencoding
- write_start_element(): Writes the start element.
- write_attribute(): Takes a key value pair as argument and writes the attribute with its value into the stream.
- write_attributes(): Takes a pair of attribute and its values and writes it onto the stream for the perticular element
- write_end_element(): Generates the end element corrosponding to the start element
- write_end_document(): Closes all the remaining open elements and ends the stream with a newline
....
Basic Example
xml::writer xml_writer;
xml_writer.set_ouput_stream(/*set the stream*/);
xml_writer.write_prolog(1.0,"UTF-8");
xml_writer.write_start_element('meow');
xml_writer.write_attribute("arch","meow64");
xml_writer.write_end_element();
xml_writer.write_end_document();
// Destructor can also invoke write_end_document() automatically if required but the buffer lifetime should be greater than the writer object lifetime
Instead of giving day to day activities during the GSOC period (which becomes highly granular), I would like to start at a higher level of planning and flesh it up to a reasonable point
There are three major milestones in Boost.XML project
- API Design Freeze (M1)
- Completition of xml::reader (M2)
- Completition of xml::writer (M3)
Community Bonding Period: 3weeks(17 May- 7 June)
During the community bonding period, the first two weeks will be spent researching more about the reader and writer API.
This involves the following activities
- Deciding the feature set for both reader and writer with the mentor
- Studying the XML specification ( A good source is annotated-xml.com) and noting down and discussing the important points with the mentor
- Researching and designing the external/front-facing API based on existing libraries of different ecosystems
At the end of 2 weeks, the first major milestone (M1) should be achieved. This means the API design for both reader and writer will be frozen and we will try our best to not make any major changes to the design in later stages.
This does not mean we are stuck with the design. As the coding starts there will be a lot of details that will unfold themselves. We will make sure to adapt them to the design after an internal review. Achieving the first milestone is simply a way to generate a basic scaffolding for the developer to keep track of what is going on.
During the second week and third weeks (overlapped), CI/CD will also be set up for ensuring cross-compatibility across different operating systems and different compilers. Along with that both boost build and cmake will be configured for the project so that the developers can easily get started with development in the future. If my mentor permits then I would also add a .devcontainer configuration along with Dockerfile for doing containerized development( This ensures that the developer working on the code will have no problems with setting up dependencies, environment, etc)
Phase 1: 6 weeks(7 June- 16July)
The first 4 weeks will be spent writing the entire XML reader class and performing debugging etc. The next two weeks will be spent in generating benchmarks and finding critical hotspots and I will try to optimize that. The library used for benchmarking the code will be Google Benchmarks.
After 6 weeks of development the second major milestone (M2) should be achieved
Phase 1 Evaluations: 5 days (12 July - 16 July)
These five days will be spent answering queries and defending the decisions that will be challenged by the mentor. Based on the result, changes will be made if necessary
Phase 2: 1 Month (16 July - 16 August)
The first 3 weeks will be dedicated to writing the entire XML writer class and performing debugging etc. The next 1 week will be spent in benchmarking the code against famous libraries and finding hotspots in the code to reduce memory footprint and improve performance
After 4 weeks of development the third major milestone (M3) should be achieved
Phase 2 Evaluations: 7 days (16 August - 23 August)
Again as mentioned in the Phase 1 evaluations , the 7 days will be spent answering queries and defending decisions against the overall design of the library which will be challenged by the mentor. The changes will only be made after critically thinking about its consequences.
Astute readers might have noticed that I have not mentioned about unit test and documentation anywhere in the schedule. This is because unit tests and documentation are not seperate activities but an integral part of coding itself. I usually follow TDD (Test Driven Development) style of development. TDD is especially useful in designing libraries and API because it helps the developer see the bigger picture of what each method does. In the traditional style of development, there is a very high chance that we might code and code and eventually peform a lot of hacks in the code which might reduce the modularity, flexiblity of the library