[유전체 빅데이터분석 코드 :: R 프로그래밍] PCA 결과 PC plot 그리기

In this posting, there is WGS data, and after performing PCA, we will talk about the R code that draws PCplots based on PC values. First, before performing PCA, SNPs in LDs performed PCA based on data after performing Pruning process to remove LDs because SNPs in LDs cause problems during matrix calculation. First, the library used for data analysis is tidyverse. The tidyverse library called the ggplot() function to use in R.

/// R programming code library (tidyverse)

Next, the folder and the file name are specified. In data analysis, when the same code is turned for a plurality of situations, it is usually better to set up a pipeline to automatically store the results of the plurality of situations while changing the folder and the name for the input. For this reason, a library is put in the first paragraph and an input folder and a file name are put in the second paragraph. After that, the actual code is entered, and the substantial code part is not changed in the subsequent process once set.

As can be seen from the code above, data obtained by merging the vcf file for each chromosome are secured, and if there is a file corresponding to the phenotype and a file to be used in covariates, the file can be defined. A variable storing position information of the eigenvalues and eigenvector values obtained as a result of PCA is assigned.

Therefore, the relevant race information is brought as follows./// R programming code df_race=read.delim(‘race/race_father.txt’)In the race data, the value of the column name new_ID becomes the ID of the sample. The sample ID in the pca data I brought before that is the name of the column ind, so you can merge the two. Merge it into the merge() function, but assign by.x and by.y respectively to apply different criteria to each data frame./// R __ rr f d dfmergeddf_race = merge (pca, df_race, by.x = ‘ind’, by.y = ‘new_ID’)In rare cases, race information may be NA or empty, and the following measures have been taken to exclude such cases. In other words, only non-empty cases, not NA, were taken as subsets()./// R’ ‘df_subset’ = subset (df_subset, !is.na (Ethnic).Category) and ethnicity.Category!=’)Next, create a new column called race, where Ethnic.Category takes only the information in the previous section. And the criterion for the front part is the string: the front part. This variable was taken in subset()./// /// /// R프로그래밍 코드filtered_ID=subset(df_subset,race_subset%in%c(“White”,”NotStatedWhite”))).indBased on the extracted ID, the phenotype and covariant data are called to take only the data belonging to the ID, and only the first value of the overlapping ID is taken and the rest is removed./// R ddf_r코 ddf_pheno=read.table(fname_pheno)df_tem=subset(df_pheno,V2% in %filtered_ID)df_whiteonly=df_tem [!duplicated(df_tem$V2),]A code above is filtered for phenotypes, and a code below is filtered for covariates./// R d df_r코 d df_summitiate=read.table(fname_summitiate) df_tem1 = subset (df_summitiate, V2% in %filtered_ID) df_whiteonly_cov = df_tem1 [!duplicated(df_tem1$V2),]Thus, the newly created phenotype and covariate data must be written as a new file. This part was done with the following code. Originally, it is a code that draws a PC plot, but since it is a part that proceeds based on the results of PCA and later, it was implemented with the same code./// R로그래밍드rwrite.table(df_whiteonly,fname_pheno_new,sep=’\t’,quote=F,row.names=F,col.names=F)write.table(df_whiteonly_cov,fname_col.namese,sep=’t)write.table(df_whiteonly$V2,f_vcf,colname=f,col.fFrom now on, we will draw PCAplot and distribution graphs using the ggplot() function. First of all, the ratio of the PC explaining the variance is calculated and flattened as follows. 