Skip to content. | Skip to navigation

Informatik 5
Information Systems
Prof. Dr. M. Jarke
Sections
Personal tools
You are here: Home Theses Next-generation Sequencing Using Data Lakes - MA / BA

Contact

Prof. Dr. M. Jarke
RWTH Aachen
Informatik 5
Ahornstr. 55
D-52056 Aachen
Tel +49/241/8021501
Fax +49/241/8022321

How to find us

Annual Reports

Disclaimer

Webmaster

 

 

Next-generation Sequencing Using Data Lakes - MA / BA

Thesis type
  • Master
Status Open
Supervisor(s)
Advisor(s)

NGS Pipeline on Azure Data Lake The development of next-generation sequencing (NGS) technologies at the beginning of the 21st century opened the door for revolutionary study setups which enables access to a long time closed dimension of knowledge in several fields of research like genetics, ecology and medicine. The so called “Ppipeline” is the link between different work steps and corresponding tools within the workflow when analyzing a NGS dataset. Microsoft Azure Data Lake is based on two different services. The Azure Data Lake Store allows to store different types of data in an almost unlimited size. With Azure Data Lake Analytics it’s possible to run massively parallel data transformation and processing programs massively in U-SQL, R, Python, and .NET over petabytes of data stored in the Azure Data Lake Store. In our company, we develop the system BOA , a Pipeline based on Microsoft Azure Data Lake designed for analyzing NGS datasets. In order to achieve maximum performance and optimal results, the data for the various steps are partitioned and analyzed using different algorithms. The input and output of the data within BOA is done via a user-friendly user interface. In addition, the system offers the possibility to analyze the data directly against various reference databases. The goals of this thesis are: - to evaluate different algorithms for data partitioning in the context von RNA sequences - to evaluate different algorithms in different languages (C#, R, Python) to optimize the performance in sequence alignment - to design and implement an incremental loading process of different RNA reference databases for the alignment process

Document Actions