GSoC 2018: Final Report
This is my final report of my Google Summer of Code 2018, it also serves as my final code submission.
For the last 3 months I have been working with Debian on the project Extracting Data from PDF Invoices and Bills Details. Information about the project can be found here:
My mentor and I agreed to modify the work to be done in the Summer. Already discussed here: http://blog.harshitjoshi.in/2018/05/gsoc-2018-debian-community-bonding.html
We will advance the ecosystem for machine-readable invoice exchange and make it easily accessible for the whole Python community by making the following contributions:
- Python library to read/write/add/edit Factur-x metadata in different XML-flavors in Python.
- Command line interface to process PDF files and access the main library functions.
- Way to add structured data to existing files or from legacy accounting systems. (via invoice2data project)
- New desktop GUI to add, edit, import and export factur-x metadata in- and out of PDF files.
Short overviewThe project work can be bifurcated into two parts:
- Main Deliverable: GUI creation for Factur-X Library
- Pre-requisites for Main Deliverable: Improvements to invoice2data library and updating Factur-X library to a working state
Contributions to invoice2dataA modular Python library to support your accounting process. Tested on Python 2.7 and 3.4+. Main steps:
- extracts text from PDF files using different techniques, like
- searches for regex in the result using a YAML-based template system
- saves results as CSV, JSON or XML or renames PDF files to match the content.
Contributions to Factur-X
Factur-X is a EU standard for embedding XML representations of invoices in PDF files. This library provides an interface for reading, editing and saving the this metadata.My contributions: https://github.com/invoice-x/factur-x-ng/commits?author=duskybomb
Organisation PageAn organisation created on github, invoice-x, to tie down all the repository at a single place.
link to organisation page: https://github.com/invoice-x/
Organisation WebsiteA static website briefly explaining the whole project. Link to website: https://www.invoice-x.org/
Main Deliverable RepositoryThis repository contains the code to make GUI for Factur-x Library. Link to the repository: https://github.com/invoice-x/invoicex-gui
|invoicex-gui: invoice2data integration with invoicex-gui and factur-x-ng|
Pre-requisites for Main Deliverable
Factur-XTo work on GUI creation for Factur-X, I first needed to update Factur-x library to a working state. My mentor, Manuel, did the initial refactoring of the project after forking the original repository, https://github.com/akretion/factur-x.
Since then I have added a few features to the library:
- Fix checking of embedded resources
- Converting the documentation format from md to rst
- Added unit tests for factur-x
- Added new feature to export metadata in JSON and YAML format
- Cleaned XML template to add
- Added validation of country and currency codes with ISO standards.
- Implemented Command Line Options
Invoice2dataI started contributing to invoice2data in the month of February. Invoice2data became the first open source project I contributed to. The first contribution was just fixing a typo in the documentation, but this introduced me to the world of Free Open Source Software (FOSS).
Since, I have been selected for Google Summer of Code 2018, I have added the following commits:
- Removed required fields in favour of providing flexibility to extract data
- Added feature to extract all fields mentioned in template
- Updated README and worked on conversion of md to rst
- Added checks for dependencies: tesseract and imagemagick
- Changed subprocess input form normal string to list
- Added more tests and checked coverage locally
- Fixed the ways invoice2data handles lists
Invoicex-GUIMy main deliverable was to make Graphical User Interface for Factur-X library. For this I used PyQt-5 framework. The other options for the same were Kivy and wxWidgets. I have some prior experience with PyQt-5 and a bug in Kivy related to touchpad driver of Debian inclined me to use PyQt-5.
The making the GUI was one of the most challenging part of the GSoC project. The lack of documentation for PyQt-5 didn’t help much. I have 3 years of experience with C++ and used it to learn more about PyQt-5 through its original documentation for Qt which is in C++.
The GUI includes:
- Selected PDF and searching for any embedded standard
- If no standard is found, give a pop up to select the standard to be added
- Edit metadata of existing embedded standard
- Export metadata
- Validate Metadata
- Use invoice2data to extract field data from invoice
Weekly Work Donehttps://lists.debian.org/debian-outreach/2018/05/msg00015.html (week 1)
https://lists.debian.org/debian-outreach/2018/05/msg00029.html (week 2)
https://lists.debian.org/debian-outreach/2018/06/msg00003.html (week 3)
https://lists.debian.org/debian-outreach/2018/06/msg00029.html (week 4)
https://lists.debian.org/debian-outreach/2018/06/msg00078.html (week 5)
https://lists.debian.org/debian-outreach/2018/06/msg00106.html (week 6)
https://lists.debian.org/debian-outreach/2018/06/msg00136.html (week 7)
https://lists.debian.org/debian-outreach/2018/07/msg00019.html (week 8)
https://lists.debian.org/debian-outreach/2018/07/msg00072.html (week 9, 10)
https://lists.debian.org/debian-outreach/2018/07/msg00105.html (week 11)
https://lists.debian.org/debian-outreach/2018/08/msg00011.html (week 12)