GSoC 2018: Final Report

This is my final report of my Google Summer of Code 2018, it also serves as my final code submission.

For the last 3 months I have been working with Debian on the project Extracting Data from PDF Invoices and Bills Details. Information about the project can be found here:

My mentor and I agreed to modify the work to be done in the Summer. Already discussed here:

We will advance the ecosystem for machine-readable invoice exchange and make it easily accessible for the whole Python community by making the following contributions:
  • Python library to read/write/add/edit Factur-x metadata in different XML-flavors in Python.
  • Command line interface to process PDF files and access the main library functions.
  • Way to add structured data to existing files or from legacy accounting systems. (via invoice2data project)
  • New desktop GUI to add, edit, import and export factur-x metadata in- and out of PDF files.

Short overview

The project work can be bifurcated into two parts:
  • Main Deliverable: GUI creation for Factur-X Library
  • Pre-requisites for Main Deliverable: Improvements to invoice2data library and updating Factur-X library to a working state

Contributions to invoice2data

A modular Python library to support your accounting process. Tested on Python 2.7 and 3.4+. Main steps:
  1. extracts text from PDF files using different techniques, like pdftotext, pdfminer or tesseract OCR.
  2. searches for regex in the result using a YAML-based template system
  3. saves results as CSV, JSON or XML or renames PDF files to match the content.
My contributions:

Contributions to Factur-X

Factur-X is a EU standard for embedding XML representations of invoices in PDF files. This library provides an interface for reading, editing and saving the this metadata.
My contributions:

Organisation Page

An organisation created on github, invoice-x, to tie down all the repository at a single place.
link to organisation page:

Organisation Website

A static website briefly explaining the whole project. Link to website:

Main Deliverable Repository

This repository contains the code to make GUI for Factur-x Library. Link to the repository:

invoicex-gui: invoice2data integration with invoicex-gui and factur-x-ng


Pre-requisites for Main Deliverable


To work on GUI creation for Factur-X, I first needed to update Factur-x library to a working state. My mentor, Manuel, did the initial refactoring of the project after forking the original repository,

Since then I have added a few features to the library:
  • Fix checking of embedded resources
  • Converting the documentation format from md to rst
  • Added unit tests for factur-x
  • Added new feature to export metadata in JSON and YAML format
  • Cleaned XML template to add
  • Added validation of country and currency codes with ISO standards.
  • Implemented Command Line Options


I started contributing to invoice2data in the month of February. Invoice2data became the first open source project I contributed to. The first contribution was just fixing a typo in the documentation, but this introduced me to the world of Free Open Source Software (FOSS).

Since, I have been selected for Google Summer of Code 2018, I have added the following commits:
  • Removed required fields in favour of providing flexibility to extract data
  • Added feature to extract all fields mentioned in template
  • Updated README and worked on conversion of md to rst
  • Added checks for dependencies: tesseract and imagemagick
  • Changed subprocess input form normal string to list
  • Added more tests and checked coverage locally
  • Fixed the ways invoice2data handles lists

Main Deliverable


My main deliverable was to make Graphical User Interface for Factur-X library. For this I used PyQt-5 framework. The other options for the same were Kivy and wxWidgets. I have some prior experience with PyQt-5 and a bug in Kivy related to touchpad driver of Debian inclined me to use PyQt-5.

The making the GUI was one of the most challenging part of the GSoC project. The lack of documentation for PyQt-5 didn’t help much. I have 3 years of experience with C++ and used it to learn more about PyQt-5 through its original documentation for Qt which is in C++.

The GUI includes:
  • Selected PDF and searching for any embedded standard
  • If no standard is found, give a pop up to select the standard to be added
  • Edit metadata of existing embedded standard
  • Export metadata
  • Validate Metadata
  • Use invoice2data to extract field data from invoice

Weekly Work Done (week 1) (week 2) (week 3) (week 4) (week 5) (week 6) (week 7) (week 8) (week 9, 10) (week 11) (week 12) 


Popular posts from this blog

InvoiceX-GUI: Google Summer of Code Project

GSoC 2018, Debian: Community Bonding