Help required with a very specific motif search

17 posts / 0 new
Last post
Fraser Moss
Fraser Moss's picture
Help required with a very specific motif search

I'm trying to establish if a particular motif at the very C-terminal of any membrane protein (the C-terminal being intracellular) terminates with the residues YKI.

This is obiously a very short sequence and will produce thousands of hits in a search of Swissprot or other protein databases.

Does anyone know how I can set up search parameters to confidently limit my search to membrane proteins, and tell the search engine that I want these residues to be the terminal three residues of the protein? So far without manually trawling through thousands of potential hits I have not found a way to do this in NCBI. A positive control search would be YKV which I know is present in some receptors.

I have also searched PDZbase, but this service is presently limited to only about 300 known interactions.

Any help would be greatly appreciated.

ryan_m
ryan_m's picture
So basically you want to find

So basically you want to find out how many protein sequences terminate in YKI, then test the null hypothesis that these are not over-represented in proteins localized in any one cellular component? The alternative being that these proteins are enriched in membrane-proteins?

Ryan

bgood
bgood's picture
fraser,

fraser,

First, what is your criteria for accepting that a protein is a membrane protein? Is it sufficient to have annotation to a GO subcellular location of, for example, 'plasma membrane' (GO:0005886)? Do you want to constrain your protein list based on organism?

I don't know, but I suspect that you may need to do a little coding (or get some one else to) in order to answer your question.

You could, for example, retrieve a set of protein sequences from NCBI or other that have the GO annotation you seek, and then quite easily count how many of them end in the motif you are interested in.

bgood
bgood's picture
Anyone know if there is

Anyone know if there is software that enables search of protein databases that includes both semantic constraints (e.g. ontology terms) and regular expression matching (for motif search) ?

??

surferchic
surferchic's picture
You might have a look here -

You might have a look here - though I had some connection issues so couldn't try it out

http://www.embl-heidelberg.de/~chenna/elm_2.html

ryan_m
ryan_m's picture
surferchic wrote:You might

surferchic wrote:

You might have a look here - though I had some connection issues so couldn't try it out

http://www.embl-heidelberg.de/~chenna/elm_2.html

I can't get the link to the actual tool to work either (e.g. http://sirw.embl.de/)

Ryan

Fraser Moss
Fraser Moss's picture
Here are my constraints

Here are my constraints

organism does not really matter although you could eliminate any plant species

Subcellular localization: plasma membrane

Structure: Membrane protein with at least 1 transmembrane spanning domain and an intracellular carboxy terminus.

The carboxy terminus ends with the sequence YKI.

To reiterate- I just want to determine whether or not any mammalian membrane proteins at all terminate with the sequence YKI. On paper it conforms to a PDZ type II interacting motif, but to date I have not yet found an example of it occuring in any known membrane protein whereas the homologous YKV motif does and has been shown to interact with GRIP and PICK1. By the way a negative result - ie it does not exist in nature is a perfectly good result.

However my searching has been limited to manually looking at raw sequences one by one because the shortness of the sequence always overwhelms the search program. I really want to have the computers do the leg work for me.

Any bright more ideas? Thanks for the prelimiary help and the speed of your replies so far.

Fraser Moss
Fraser Moss's picture
hey guys - is this the new

hey guys - is this the new link do you reckon?

http://elm.eu.org/index.html

ryan_m
ryan_m's picture
If you have access to a

If you have access to a machine running linux or unix with a perl interpreter:

1) download swissprot or any other fasta-formatted protein database of your choice
2)make a file called check_carboxy.pl (see code below)
3) cat fasta_file.fa | check_carboxy.pl
4) all proteins that are printed to the screen are your positives.

Cheers,

Ryan

Code:

#!/usr/bin/perl
use strict;
my $seq = "YKI";
my $header_info;
my $prev_line;
while(){
chomp;
if(/>(.+)/){
my $new_header_info = $1;
if($prev_line =~ /$seq$/i){
print "$header_info\n";
}
$header_info = $new_header_info;
}
else{
$prev_line = $_ if $_;
}
}

Fraser Moss
Fraser Moss's picture
ryan_m wrote:If you have

ryan_m wrote:

If you have access to a machine running linux or unix with a perl interpreter:

1) download swissprot or any other fasta-formatted protein database of your choice

By this you mean perform a search e.g. "plasma membrane" and then down load all 10998 hits as FASTA files?

Then apply your code?

Please excuse my naivety, but my bioinformatics skills are prety green

ryan_m
ryan_m's picture
I was thinking that you would

I was thinking that you would use the script to find all proteins that match your sequence requirement (i.e. ending in YKI) and then taking those and searching their GO terms to see if any localise to the PM. However the other way would work too. It all depends on if you have access to that. Thinking out loud here, I wonder if you could get this using the BioMart portal to the ENSEMBL database. I'll check it out.

Cheers,

Ryan

ryan_m
ryan_m's picture
frasermoss wrote:ryan_m wrote

frasermoss wrote:

ryan_m wrote:
If you have access to a machine running linux or unix with a perl interpreter:

1) download swissprot or any other fasta-formatted protein database of your choice

By this you mean perform a search e.g. "plasma membrane" and then down load all 10998 hits as FASTA files?

Then apply your code?

Please excuse my naivety, but my bioinformatics skills are prety green

OK. You should be able to use ensembl biomart to get the data you need (e.g. set a filter to get only the proteins with the GO term you want). You can set BioMart to download the sequences as protein sequences in fasta format.

Ryan

Fraser Moss
Fraser Moss's picture
Job done! Thanks everyone.

Job done! Thanks everyone.

ryan_m
ryan_m's picture
Great!

Great!
Glad to have helped out. I'm interested now. Did you find the result you hoped for?

Ryan

Fraser Moss
Fraser Moss's picture
Yep. I did not find any

Yep. I did not find any membrane proteins ending in YKI in Human, mouse, rat, dog or C.elegans.

If anyone has the time or the inclination to double check for me that would be cool.

ryan_m
ryan_m's picture
frasermoss wrote:Yep. I did

frasermoss wrote:

Yep. I did not find any membrane proteins ending in YKI in Human, mouse, rat, dog or C.elegans.

If anyone has the time or the inclination to double check for me that would be cool.

I just realized that the perl script was overkill. Assuming your fasta file does not have any 'empty' lines separating the records, you could use this command:

grep -B 1 \> all_5kb_upstream_118.fa | grep -v \> | grep -i YKI$

This would print out the last line of any protein sequence containing that string at its terminus. If nothing is printed out, you can safely assume that your first analysis was right. Also, if there is an empty line between the records in your fasta file, change "grep -B 1" to "grep -B 2".

Ryan

GCG Support
GCG Support's picture
Sorry to be joining this

Sorry to be joining this thread late, but if you have access to GCG (aka Wisconsin Package) and a local protein database installation you could use FindPatterns to search for regular expressions and define end constraints using > and <. For example, the pattern YKI> would only be found if it occurs at the end of the sequence range. Your search could be performed using this comand:

findpatterns '-pat=yki>'

A UniProt search (4,003,765 sequences) gives 374 total hits. The output is a list file that contains the entry names, hit site locations, and the found match. Here's the first hit on the list:

ACCO3_SOLLC ck: 53 len: 363 ! P10967 solanum lycopersicum (tomato) (lycopersicon esculentum). 1-aminocyc

1 yki>
YKI
361: SALSR YKI

You can use the list file to fetch the corresponding entries for further analysis.