Next Generation Sequencing Experiment The general NGS sequencing workflow is outlined below 1. Experimental Design Contact Facility Director for advice BEFORE you submit your samples 2. Sample Preparation Sample Preparation through iLAB Next Generation Sequencing Instructions Genomic DNA prep: DNA should be of high quality, at least 0.1 µg with a concentration of 10 ng/µl or higher in TE. ChIP DNA prep: at least 1 ng ChIP enriched DNA in 15 µl or less of water, in a low retention tube. The ChIP DNA should be mostly shorter than 500bp. RNA-Seq Directional Library Prep: at least 0.5 ug total RNA of good quality in 50 µl or less RNase-free water, RNA Integrity Number (RIN) from Bioanalyzer QC should be 9.0 or higher. Small RNA-Seq Library Prep: at least 0.1 µg total RNA in 10 µl or less RNase-free water, preferably 0.5 -1 µg input total RNA. RNA-Seq Low Input Non-Directional: one or more eukaryotic cells, please contact the Facility Director for recommendations. 10X Genomics Single Cell RNA-Seq V2 library prep: A single cell suspension of viable cells in compatible buffer, please see the 10X Genomics Single Cell Protocols for recommendations. Pre-made Libraries: A minimum of 20 µl of 5ng/ul pre-made library in EB (Tris-Cl 10mM, pH 8.5) must be provided in a low retention tube. Bioanalyzer QC of library size is highly recommended for each library (before pooling multiplexed libraries). Custom Primer: Custom sequencing primer concentration should be at 100 µM, with a minimum volume of 20 µl. NOTE: When filling out the NGS form in iLAB all "Sample IDs" entered should match what you have written on your tubes. Sample IDs should be short, unique and meaningful. 3. Sample Submission Navigate to Genomics Core Facility iLAB Select the Request Services tab and click on the ‘Initiate Request’ button next to the service of interest. You will be asked to complete a form before submitting a request to the core. When filling out the form in iLAB all “Sample IDs” entered should match what you have written on your tubes. Sample IDs should be short, unique and meaningful. Print out a copy of the form and bring it with you when submitting your samples. Place the samples along with the iLAB form in the Mini-freezer located within the Genomics Core Facility (appointments are suggested for new users). 4. Sequencing (performed by core staff) 5. Data Retrieval After sequencing has been completed, you will receive an email from the HTSEQ system that your data is ready. The data from an Illumina sequencing run consists of a FASTQ formatted file for each read: the forward read, the index read, and optionally the reverse read. The quality values are encoded in the standard Sanger encoding scheme (ASCII + 33) and they are gzipped to reduce file sizes and download times. In addition, you have access to reads that were filtered out during quality control, in case those are useful to you. Starting in the summer of 2011, we began spiking in 1% PhiX DNA into each sample to assist with quality control. The HTSEQ system filters fragments that map to PhiX into a separate file. Mapping is determined by mapping read 1 (the forward read) to the PhiX genome using Bowtie with default settings. Your data can be retrieved in two ways, via the HTSeq website or via a Network Drive. HTSeq Website The simplest way to retrieve your data is to log in to HTSeq, search for the assay you are interested in, and click the download button For detailed instructions, please see the help page. This method allows you to get Sanger formatted FASTQ files of reads as well as some basic QC analysis. The FASTQ files on HTSeq will be kept for as long as space is available. Network Drive In addition to using the interactive HTSEQ website, you can also download your data via a network drive. This can make scripting downloads simpler and is recommended when downloading data directly to a server such as the TIGRESS Della computing cluster or Lewis-Sigler's Cetus computing cluster. In order to use this method of downloading your data you need: Princeton NetID: You must have an active Princeton NetID which must match your HTSEQ UserID. Please contact Lewis-Sigler computing support if you do not have an active NetID or are having difficulty connecting to the network drive. An account on the Arrayfiles server: You must print, sign, and submit the Arrayfiles Data Storage and Retrieval Agreement to obtain an account on arrayfiles.princeton.edu. On campus or VPN network connection: In order to access the network drive you will need to be on the Princeton network, or connected via VPN. The data is available here Windows In the Windows Explorer Address bar, type \\arrayfiles.princeton.edu\htseq Enter PRINCETON\yourNetId as the username. Enter your NetID password in the password field. Or for a more permanent way to connect: Choose Map Network Drive from the Explorer Tools menu. Select the drive letter of your choice (e.g. "S:" or "Q:") and enter the share address in Folder field: \\arrayfiles.princeton.edu\htseq. If you tick the Reconnect at logon checkbox, Windows will attempt to reconnect the network drive when you next log in. Mac OS X With the Finder active, select Connect to Server... from the Go menu, or from the keyboard use Command-K. In the Connect to Server window, type the following Address field: smb://[email protected]/htseq Click the Connect button and login using Princeton as the Workgroup/Domain, your NetID as Username, and your NetID password Linux Browse: Use the smbclient command to connect and transfer data interactively. This is useful to test your connection to the server or if you do not have administrator rights to the system. smbclient //arrayfiles.princeton.edu/htseq -W PRINCETON -U <princeton_netid> The password will be your Princeton NetID password. Mount the drive: To mount the network drive as a directory on your system, use the following proceedure: Make a directory for the mountpoint: mkdir /mnt/<name-of-mount-point> Mount the share: mount -t smbfs -o username=<username>,password=<password> //arrayfiles.princeton.edu/htseq /mnt/<name-of-mountpoint> Create a symbolic link to the mounted drive: ln -s /mnt/<name-of-mount-point> /<path-of-symlink> 6. Data Analysis Tutorials Barcode Splitting Multiplexed Reads SNP and Indel Detection ChIP-Seq RNA-Seq More tutorials from the Princeton HTSEQ Users Group HTSeq Database The HTSEQ database provides access to quality control statistics about your data generated by the Illumina basecaller and the FastQC software package. Galaxy The Galaxy workflow system provides a simple way to analyze high-throughput sequencing and other biological datasets. The Lewis-Sigler Institute Bioinformatics Group has set up a local instance of Galaxy for use by Princeton researchers. In addition to many of the tools available from the public instance of Galaxy at Penn State, our local version provides easy upload of data from the HTSEQ system as well as a number of customized analysis tools requested by researchers in the Institute. Find more information about Galaxy, including a list of the most useful tools for high throughput sequence analysis. Other Resources Princeton HTSEQ User Group The Princeton HTSEQ Users Group provides discussion groups, regular meetings, where users share tips and methods in dealing with sequence data. There are list of tutorials on the website as well as a list of software that users have found particularly useful. Lewis-Sigler and University Computing Resources In addition to HTSEQ and Galaxy, the Lewis-Sigler Institute provides its members with access to various genomics computing resources including various Linux servers, high-performance data storage, and cluster (grid) computing systems. Tigress High Performance Computing Center The University's TIGRESS High Performance Computing Center is a collaborative facility that brings together funding, support, and participation from the Princeton Institute for Computational Science and Engineering (PICSciE), the Office of Information Technology (OIT), the School of Engineering and Applied Science (SEAS), the Lewis Sigler Institute for Integrative Genomics (Genomics), the Princeton Institute for the Science and Technology of Materials (PRISM), the Princeton Plasma Physics Laboratory (PPPL), and a number of academic departments and faculty members. The facility is designed to create a well-balanced set of High Performance Computing (HPC) resources meeting the broad computational requirements of the University research community. Barcode Splitting Multiplexed Data If you multiplexed samples into a single lane using Illumina's barcoded TruSeq adapters, you may need to split your reads into separate files for each sample. The simplest way to do this is to use the Galaxy system's "Barcode Splitter" tool. If you don't already have access to Galaxy, you can request it by sending an email to the Lewis-Sigler Bioinformatics Group with your name, Princeton lab affiliation, and your Princeton NetID. Import your data into Galaxy Login to Galaxy Choose the Get Data => Princeton HTSEQ tool from the left menu Login to the HTSEQ database and use select the menu Search => Assay Search to find the assay you are interested in. Click the [Upload to Galaxy] button next to the Read 1 passed filter data file, and click Upload to Galaxy to confirm on the next page. You should now see a new data file in your Galaxy history that will be yellow while the data imports. Repeat this process for Read 2 (Index read) and the Barcodes file (if this is missing or inaccurate, you will need a tab delimited file with two columns, the first is the sample name and the second the barcode sequence that corresponds to it). For Paired End Runs Only - Upload Read 3 data Split the data into individual files for each sample From the top menu, select [Shared Data] => [Published Workflows] Choose either [Barcode Split (single-end)] or [Barcode Split (paired-end)] Select [+ Import Workflow] from the top right, and click on "start using this workflow" Click on the new workflow from the menu [Imported: Barcode Split (single-end)] and select [Run] Select the appropriate data files in the menus for Read 1, Barcodes File, and Read 2. Choose an appropriate number of mismatches for the barcode matching (typically 0 or 1 mismatch is appropriate). For paired-end data, you must enter the same number of mistmatches for BOTH Barcode splitting steps. Click on [Run Workflow]. You will receive an email at your princeton.edu email address when the splitting is complete. Review the results Each input read will be split into multiple files, one for each sample in the barcodes file and one to hold reads that didn't match any known sample. There will also be a small report that indicates how many reads were matched to each sample and the percent of the total. Galaxy The Galaxy workflow system provides a simple way to analyze high-throughput sequencing and other biological datasets. The Lewis-Sigler Institute Bioinformatics Group has setup a local instance of Galaxy for use by Princeton researchers. In addition to many of the tools available from the public instance of Galaxy at Penn State, our local version provides easy upload of data from the HTSEQ system as well as a number of customized analysis tools requested by researchers in the Institute. For more information about how to get started with Galaxy and available tools see the Galaxy FAQS.